Bhattarai Guru

Monday, July 16, 2012

NLTK for text processing

NLTK is a python based language processing toolkit. This tool is quite popular
for text processing tasks. Some of the text processing tasks are:

1. Sentence detection:
If you have a large corpus which is noisy. You need to extract true language sentences, then you can do:

        from nltk.tokenize import sent_tokenize
        text = open('input.txt','r').read()
        sentences = sent_tokenize(text)

Now, you need to write output in separate data, for that you can do as:

       out = open('out.txt','w')
       for line in sentences:
out.write(line)

Monday, April 30, 2012

Some frequently used scripts in Text Processing

Different scripts are useful for processing the text. I am going to mention here some of the most frequently used scripts in text processing. This may save your time to google for separate tasks.

Sampling Lines from text:

If you have a large number of lines in your data set. You may like to sample a good number of lines from your data set for evaluation pursposes. Following awk command does for you :

cat your_file.txt | perl -n -e 'print if (rand() < .1)' > newfile.txt

This command samples 10% of sentences from your file  your_file.txt

and writes it into newfile.txt.

Reading separate line from text:

If your familiar with java and using BufferedReader and readLine() method for reading lines from

text, then you may be in problem to read file such as :

Now the earth was formless and empty.  Darkness was on the surface
of the deep.  God's Spirit was hovering over the surface
of the waters.

Basically readline() method of java considers string upto newline as a sentence. But sentences are

those which endup with fullstop(.).Following script pre-processes to write text into new text

having single sentence in a line.

perl -pe 's/\n\Z/ /; s/(\.)\s*/$1\n/g' inputfile.txt

For more information, you may look into

http://stackoverflow.com/questions/10375031/preparing-single-sentence-per-line-document-from-plain-text

 Deleting blank spaces in text File:

 There can be some blank lines in text and sometime our script may detect those blank line and may

 assume the end of document.We can delete those lines before processing the text as pre-processing

 task.

              sed '/^$/d' myFile > tt

Wednesday, January 5, 2011

Nerdiness

NerdTests.com says I'm a Slightly Dorky Nerd God. Click here to take the Nerd Test, get geeky images and jokes, and talk to others on the nerd forum!

Saturday, September 18, 2010

ठुला भ्रम - सत्य साधारण

मिति : १८ सेप्तेम्बर २०१०

मिति : ११ सेप्तेम्बर २०१०

मिति : ४ सेप्तेम्बर २०१०

मिती २८ आग्सत २०१०

Friday, September 17, 2010

Caravan(The Himayalas)- a must watch

PART I

PART 2

PART 6

PART 7

Friday, September 3, 2010

Controversial tape record of Maoist leader Krishna Bahadur Mahara

The controversial tape record of Maoist leader Krishna Bahadur Mahara, who claims sought Rs 500 million from china to buy lawmakers has been released through the media of Nepal.Maoist were trying enter into the government since last 15 months when the government of Prachanda was overthrown by Pranchanda himself. They have used all possible means to take make position in the government.Seeing all those activities, people can't easily believe the refutation of Mahara for this issue. Here is the link for audio:

First conversation
Second Conversation