4

I've got a text file outputted from a WebSpider. The Spider extracts all sentences from a given list of URLs. What I need to do is then process this file and find all lines that contain more than 65 characters and then determine the language of each line. I've got it working in a one liner (my bash scripting skills are non existent).

sed -n '/^.\{65\}/p' www.mbl.is | langid --line | grep is 

langid is a python module that identifies languages and provides a number associated with how likely it is this language. To install just run:

pip install langid 

or visit https://github.com/saffsd/langid.py, for more information. Now what I need to do is print the line that is piped into the langid command, that contains 'is', hence the grep. Below is a sample output of my current command:

('is', -288.34235095977783) ('is', -168.52833652496338) ('is', -255.30311250686646) ('is', -254.8700122833252) ('is', -664.7349543571472) ('is', -169.40936374664307) ('is', -315.0590629577637) ('is', -323.49001693725586) ('is', -281.2222490310669) ('is', -198.52733993530273) ('is', -152.1551775932312) ('is', -66.93532514572144) ('is', -231.61306524276733) ('is', -254.00042057037354) ('is', -322.7330708503723) ('is', -151.84487915039062) 

EDIT: as per terdon♦ comment

Command:

sed -n '/^.\{65\}/p' www.mbl.is 

Output:

Eftir stutt stopp i hofudborginni sem okkur heilt yfir leist agaetlega a var kominn timi a ad graeja visa fyrir Vietnam. 1 I gaer, paskadag, eyddum vid thvi deginum i ad koma okkur fyrir a Back Home, gerdum god kaup a Petaling Street (chinatown) og forum i paskaeggjaleit. 1 Vid, temmilega nyvoknud, stigum ut ur rutunni thar sem klassisku leigubilstjornarnir standa fyrir utan ad berjast um folk i bilana sina. 1 Vid forum med Boraj og Tino og leigdum okkur hljodeinangrad einkaherbergi med ollu innifoldu i klukkutima, fyrir taepa 20 dollara (1/4 af manadarlaunum theirra!) - fullt af bjor, starfsmadur med okkur allan timan og steiktar poddur i snakk med idyfum. 1 Vid ludarnir i "Good morning Vietnam" bolunum okkar umkringd moldriku folki klaett i italkst fra toppi og nidur. 1 Vid aetlum tho rett ad vona ad foreldrar okkar sjai ser faert ad geyma eins og eitt alvoru paskaegg handa hvoru okkar? 1 Hinsvegar var okkur bent a tyndu perluna, Mai Chau, sem hefur allt sem Sapa hefur upp a ad bjoda, nema thu dregur turismann fra. 1 Thetta var audvitad allt saman hreinasta lygi en vid letum okkur hafa thad og gistum eina nott a thessu annars agaeta hoteli. 1 Individual truth is constantly evolving, and a truth seeker must be willing to give up last week's major truth for whatever new discovery the innermost self reveals. 1 Um kvoldid forum vid svo oll saman ad borda vid mekong ana og attum mjog gott kvold saman. 1 Tha segja teir enn fremur ad bandarikjamenn hafi i raun verid ad reyna ad hindra frekari utbreidslu kommunisma i SA-Asiu, svo ad stridid var i raun bara einn stor misskilningur. 1 

Command:

sed -n '/^.\{65\}/p' www.mbl.is | langid --line 

Output:

('en', -193.52840971946716) ('en', -445.4644522666931) ('en', -158.1918339729309) ('en', -220.16202330589294) ('en', -596.61936211586) ('en', -379.3824007511139) ('en', -150.61454391479492) ('en', -379.3824007511139) ('en', -270.56594038009644) ('en', -446.9800910949707) ('en', -702.9869554042816) ('en', -208.84209847450256) ('en', -345.15056800842285) ('en', -321.2763195037842) ('en', -209.9769265651703) ('en', -144.31591272354126) ('en', -208.40711855888367) ('en', -161.14595460891724) ('en', -180.95807218551636) ('is', -151.84487915039062) ('en', -32.042465686798096) ('no', -73.23809719085693) ('lb', -194.81272649765015) ('et', -80.76274251937866) ('en', -129.17673206329346) ('en', -95.43238878250122) ('da', -30.086124420166016) 

Is this possible to do in a one liner or would it be best to write a script. I can do it in python, but it's regex modules are painful and need to change the character variable quickly depending on the input file and change the grep to different language codes easily. Plus I thought this would be a good time to start my journey in bash scripting, bash commands are awesome, and I can assume that so is bash scripting (I've just got to get my head around the semantics and syntax, lots of $ signs)

2
  • Please edit your question and add an example of the output of sed -n '/^.\{65\}/p' www.mbl.is | langid --line , without the grep. That way we know what we're looking for without having to install the langid program.
    – terdon
    CommentedOct 14, 2015 at 11:02
  • sorry no worriesCommentedOct 14, 2015 at 11:04

2 Answers 2

2

You can do that with a while loop:

while read l; do [ ${#l} -gt 65 ] && \ echo "$l" | langid --line | grep -q "is" && \ echo "$l" done <file 

  • read l read the input line by line and store the current line in the variable $l.
  • [ ${#l} -gt 65 ] if the line contains more than 65 characters.
    • echo "$l" | langid --line | grep -q "is" process the line and grep for the language, notice with -q, grep will be silent. We just want to check if the string is there, no output.
    • echo "$l" If the string is there, print the original line.
  • <file use the contents of file as input.

Edit: The above runs the langid command on each line, this is very slow. If you want it to run in one transit (faster) use this:

awk 'FNR==NR{a[NR]=$0} FNR!=NR&&$1~"is"{print a[FNR]}' \ <(sed -n '/^.\{65\}/p' file) \ <(sed -n '/^.\{65\}/p' file | langid --line) 
  • awk processes two "files":
    • The output of sed -n '/^.\{65\}/p' file: All sentences with 65 or more characters.
    • The output of sed -n '/^.\{65\}/p' file | langid --line which processes all lines with 65 or more characters in one transit.
  • Inside awk:
    • FNR==NR applies in the first "file"
    • a[NR]=$0 Fill an array with the line number as index
    • FNR!=NR&&$1~"is" applies to the second "file" and checks if the line contains the string is
    • print a[FNR] if thats true, print the corresponding line in the prevously created array a which contains the original sentence.
5
  • thanks, that seems to work, but its extremly slow. Its printing each line every 5-7 seconds. Now each file contains more than 10,000 lines and I need to run it on like 100 files. Let's be generous and say it takes 3 seconds for each line, that would take just over a month to process all that data on my machine.CommentedOct 14, 2015 at 11:25
  • @frostware I edited the answer for that case. With the example data in the question it runs 5-7 seconds in total.
    – chaos
    CommentedOct 14, 2015 at 12:00
  • sorry man, but how do I run. For example the file that I wan to run it on is called www.mbl.is. Do I replace both file variables with my input. When I do that it just prints 392 blank lines with a couple of sentences sparta cally printed throughout.CommentedOct 14, 2015 at 12:23
  • @frostware I changed it again, please try again with the second one. And yes, replace both file with your filename.
    – chaos
    CommentedOct 14, 2015 at 13:04
  • Works perfectly. Although I did go with the python solution, as I need to add more features now. bit in saying that you best answered the question as it is in BASH and works perfectly. Thanks for all the helpCommentedOct 14, 2015 at 13:49
2

If your shell is bash, you could do something like this:

sed -n '/^.\{65\}/p' www.mbl.is | while read line ; do LANGID=$(echo "$line" | langid --line) if [[ "$LANGID" =~ is ]] ; then echo "$line: $LANGID" fi done 

This would be very slow, though because it runs multiple instances of langid (one per input line). You'd probably be better off writing a python script that imports langid as mentioned in the readme file on github. As above, a simple loop reading stdin and passing it to langid.classify() would do.

My python is extremely rusty and I don't have langid.py installed so this is untested but here's a really primitive python example:

#! /usr/bin/python import langid, fileinput, re for line in fileinput.input(): if len(line) > 65: id = langid.classify(line) if re.match(r'is',id): print line, ": ", id 

It did pass a compilation test with python -m py_compile langtest.py but that's about all i can say in its favour.


Added by frostware:

A much improved and presumably tested and working version:

#! /usr/bin/python import sys, codecs, re from fileinput import input as file from langid import classify #Output STDOUT as UTF-8 sys.stdout = codecs.getwriter("utf8")(sys.stdout) sys.stderr = codecs.getwriter("utf8")(sys.stderr) #read text as a positional argument and procss each line for line in file(): #check if line is greater than 65 characters if len(line) > 65: #determine the language of each line id = classify(line) #check if language is Icelandic if re.search('is', str(id)): #print the line and the langid classification print line, ": ", id 

Also a more comprehensive python script what allows arguments and some extra features. Gist Code

8
  • I think that might be the only option, thanks manCommentedOct 14, 2015 at 11:28
  • feel free to do what you want with the above. it's too short to be anything but public domain and i wouldn't be posting any super-sekrit trade secrets here anyway. i could, but then i'd have to killall world
    – cas
    CommentedOct 14, 2015 at 11:48
  • Did you mean re.match(r'is', id) ?CommentedOct 14, 2015 at 12:09
  • yep, good catch.
    – cas
    CommentedOct 14, 2015 at 12:13
  • The problem is id is a tuple, which I can change to a string, but if you run this in the python interpreter print re.match(r'is', "('is', 0.9989814056831943)") you receive a None type, meaning the regex didn't work.CommentedOct 14, 2015 at 12:28

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.