Bhattarai Guru: Some frequently used scripts in Text Processing

Different scripts are useful for processing the text. I am going to mention here some of the most frequently used scripts in text processing. This may save your time to google for separate tasks.

Sampling Lines from text:

If you have a large number of lines in your data set. You may like to sample a good number of lines from your data set for evaluation pursposes. Following awk command does for you :

cat your_file.txt | perl -n -e 'print if (rand() < .1)' > newfile.txt

This command samples 10% of sentences from your file  your_file.txt

and writes it into newfile.txt.

Reading separate line from text:

If your familiar with java and using BufferedReader and readLine() method for reading lines from

text, then you may be in problem to read file such as :

Now the earth was formless and empty.  Darkness was on the surface
of the deep.  God's Spirit was hovering over the surface
of the waters.

Basically readline() method of java considers string upto newline as a sentence. But sentences are

those which endup with fullstop(.).Following script pre-processes to write text into new text

having single sentence in a line.

perl -pe 's/\n\Z/ /; s/(\.)\s*/$1\n/g' inputfile.txt

For more information, you may look into

http://stackoverflow.com/questions/10375031/preparing-single-sentence-per-line-document-from-plain-text

 Deleting blank spaces in text File:

 There can be some blank lines in text and sometime our script may detect those blank line and may

 assume the end of document.We can delete those lines before processing the text as pre-processing

 task.

              sed '/^$/d' myFile > tt

Bhattarai Guru

Monday, April 30, 2012

Some frequently used scripts in Text Processing

1 comment:

About Me