Monday, July 16, 2012

NLTK for text processing

NLTK is a python based language processing toolkit. This tool is quite popular
for text processing tasks. Some of the text processing tasks are:

1. Sentence detection:
If you have a large corpus which is noisy. You need to extract true language sentences, then you can do:

        from nltk.tokenize import sent_tokenize
        text = open('input.txt','r').read()
        sentences = sent_tokenize(text)

Now, you need to write output in separate data, for that you can do as:

       out = open('out.txt','w')
       for line in sentences:
                  out.write(line)
      

Monday, April 30, 2012

Some frequently used scripts in Text Processing

Different scripts are useful for processing the text. I am going to mention here some of the most frequently used scripts in text processing. This may save your time to google for separate tasks.


Sampling Lines from text:


If you have a large number of lines in your data set. You may like to sample a good number of lines from your data set for evaluation pursposes. Following awk command does for you :

cat your_file.txt | perl -n -e 'print if (rand() < .1)' > newfile.txt 
 
This command samples 10% of sentences from your file  your_file.txt 
and writes it into newfile.txt.


Reading separate line from text:

If your familiar with java and using BufferedReader and readLine() method for reading lines from
text, then you may be in problem to read file such as :
 
 
Now the earth was formless and empty.  Darkness was on the surface
of the deep.  God's Spirit was hovering over the surface
of the waters.
 
 
Basically readline() method of java considers string upto newline as a sentence. But sentences are 
those which endup with fullstop(.).Following script pre-processes to write text into new text 
having single sentence in a line.
 
perl -pe 's/\n\Z/ /; s/(\.)\s*/$1\n/g' inputfile.txt 
 
For more information, you may look into 
http://stackoverflow.com/questions/10375031/preparing-single-sentence-per-line-document-from-plain-text  

 
 Deleting blank spaces in text File:
  
 There can be some blank lines in text and sometime our script may detect those blank line and may 
 assume the end of document.We can delete those lines before processing the text as pre-processing  
 task.  
 
 
              sed '/^$/d' myFile > tt