Different scripts are useful for processing the text. I am going to mention here some of the most frequently used scripts in text processing. This may save your time to google for separate tasks.
Sampling Lines from text:
If you have a large number of lines in your data set. You may like to sample a good number of lines from your data set for evaluation pursposes. Following awk command does for you :
Sampling Lines from text:
If you have a large number of lines in your data set. You may like to sample a good number of lines from your data set for evaluation pursposes. Following awk command does for you :
cat your_file.txt | perl -n -e 'print if (rand() < .1)'
> newfile.txt
This command samples 10% of sentences from your file your_file.txt
and writes it into newfile.txt.
Reading separate line from text:
If your familiar with java and using BufferedReader and readLine() method for reading lines from
text, then you may be in problem to read file such as :
Now the earth was formless and empty. Darkness was on the surface
of the deep. God's Spirit was hovering over the surface
of the waters.
Basically readline() method of java considers string upto newline as a sentence. But sentences are
those which endup with fullstop(.).Following script pre-processes to write text into new text
having single sentence in a line.
perl -pe 's/\n\Z/ /; s/(\.)\s*/$1\n/g' inputfile.txt
For more information, you may look into
http://stackoverflow.com/questions/10375031/preparing-single-sentence-per-line-document-from-plain-text
Deleting blank spaces in text File:
There can be some blank lines in text and sometime our script may detect those blank line and may
assume the end of document.We can delete those lines before processing the text as pre-processing
task.
sed '/^$/d' myFile > tt
nice piece of titbits ... thanks
ReplyDelete