Monday, July 16, 2012

NLTK for text processing

NLTK is a python based language processing toolkit. This tool is quite popular
for text processing tasks. Some of the text processing tasks are:

1. Sentence detection:
If you have a large corpus which is noisy. You need to extract true language sentences, then you can do:

        from nltk.tokenize import sent_tokenize
        text = open('input.txt','r').read()
        sentences = sent_tokenize(text)

Now, you need to write output in separate data, for that you can do as:

       out = open('out.txt','w')
       for line in sentences:

Monday, April 30, 2012

Some frequently used scripts in Text Processing

Different scripts are useful for processing the text. I am going to mention here some of the most frequently used scripts in text processing. This may save your time to google for separate tasks.

Sampling Lines from text:

If you have a large number of lines in your data set. You may like to sample a good number of lines from your data set for evaluation pursposes. Following awk command does for you :

cat your_file.txt | perl -n -e 'print if (rand() < .1)' > newfile.txt 
This command samples 10% of sentences from your file  your_file.txt 
and writes it into newfile.txt.

Reading separate line from text:

If your familiar with java and using BufferedReader and readLine() method for reading lines from
text, then you may be in problem to read file such as :
Now the earth was formless and empty.  Darkness was on the surface
of the deep.  God's Spirit was hovering over the surface
of the waters.
Basically readline() method of java considers string upto newline as a sentence. But sentences are 
those which endup with fullstop(.).Following script pre-processes to write text into new text 
having single sentence in a line.
perl -pe 's/\n\Z/ /; s/(\.)\s*/$1\n/g' inputfile.txt 
For more information, you may look into  

 Deleting blank spaces in text File:
 There can be some blank lines in text and sometime our script may detect those blank line and may 
 assume the end of document.We can delete those lines before processing the text as pre-processing  
              sed '/^$/d' myFile > tt  

Saturday, September 18, 2010

ठुला भ्रम - सत्य साधारण

मिति : १८ सेप्तेम्बर २०१०

मिति : ११ सेप्तेम्बर २०१०

मिति : ४ सेप्तेम्बर २०१०

मिती २८ आग्सत २०१०

Friday, September 3, 2010

Controversial tape record of Maoist leader Krishna Bahadur Mahara

The controversial tape record of Maoist leader Krishna Bahadur Mahara, who claims sought Rs 500 million from china to buy lawmakers has been released through the media of Nepal.Maoist were trying enter into the government since last 15 months when the government of Prachanda was overthrown by Pranchanda himself. They have used all possible means to take make position in the government.Seeing all those activities, people can't easily believe the refutation of  Mahara for this issue. Here is the link for audio:

First conversation
Second Conversation