Monday, July 16, 2012

NLTK for text processing

NLTK is a python based language processing toolkit. This tool is quite popular
for text processing tasks. Some of the text processing tasks are:

1. Sentence detection:
If you have a large corpus which is noisy. You need to extract true language sentences, then you can do:

        from nltk.tokenize import sent_tokenize
        text = open('input.txt','r').read()
        sentences = sent_tokenize(text)

Now, you need to write output in separate data, for that you can do as:

       out = open('out.txt','w')
       for line in sentences:
                  out.write(line)