I've got a text file outputted from a WebSpider. The Spider extracts all sentences from a given list of URLs. What I need to do is then process this file and find all lines that contain more than 65 characters and then determine the language of each line. I've got it working in a one liner (my bash scripting skills are non existent).
sed -n '/^.\{65\}/p' www.mbl.is | langid --line | grep is
langid is a python module that identifies languages and provides a number associated with how likely it is this language. To install just run:
pip install langid
or visit https://github.com/saffsd/langid.py, for more information. Now what I need to do is print the line that is piped into the langid command, that contains 'is', hence the grep. Below is a sample output of my current command:
('is', -288.34235095977783) ('is', -168.52833652496338) ('is', -255.30311250686646) ('is', -254.8700122833252) ('is', -664.7349543571472) ('is', -169.40936374664307) ('is', -315.0590629577637) ('is', -323.49001693725586) ('is', -281.2222490310669) ('is', -198.52733993530273) ('is', -152.1551775932312) ('is', -66.93532514572144) ('is', -231.61306524276733) ('is', -254.00042057037354) ('is', -322.7330708503723) ('is', -151.84487915039062)
EDIT: as per terdon♦ comment
Command:
sed -n '/^.\{65\}/p' www.mbl.is
Output:
Eftir stutt stopp i hofudborginni sem okkur heilt yfir leist agaetlega a var kominn timi a ad graeja visa fyrir Vietnam. 1 I gaer, paskadag, eyddum vid thvi deginum i ad koma okkur fyrir a Back Home, gerdum god kaup a Petaling Street (chinatown) og forum i paskaeggjaleit. 1 Vid, temmilega nyvoknud, stigum ut ur rutunni thar sem klassisku leigubilstjornarnir standa fyrir utan ad berjast um folk i bilana sina. 1 Vid forum med Boraj og Tino og leigdum okkur hljodeinangrad einkaherbergi med ollu innifoldu i klukkutima, fyrir taepa 20 dollara (1/4 af manadarlaunum theirra!) - fullt af bjor, starfsmadur med okkur allan timan og steiktar poddur i snakk med idyfum. 1 Vid ludarnir i "Good morning Vietnam" bolunum okkar umkringd moldriku folki klaett i italkst fra toppi og nidur. 1 Vid aetlum tho rett ad vona ad foreldrar okkar sjai ser faert ad geyma eins og eitt alvoru paskaegg handa hvoru okkar? 1 Hinsvegar var okkur bent a tyndu perluna, Mai Chau, sem hefur allt sem Sapa hefur upp a ad bjoda, nema thu dregur turismann fra. 1 Thetta var audvitad allt saman hreinasta lygi en vid letum okkur hafa thad og gistum eina nott a thessu annars agaeta hoteli. 1 Individual truth is constantly evolving, and a truth seeker must be willing to give up last week's major truth for whatever new discovery the innermost self reveals. 1 Um kvoldid forum vid svo oll saman ad borda vid mekong ana og attum mjog gott kvold saman. 1 Tha segja teir enn fremur ad bandarikjamenn hafi i raun verid ad reyna ad hindra frekari utbreidslu kommunisma i SA-Asiu, svo ad stridid var i raun bara einn stor misskilningur. 1
Command:
sed -n '/^.\{65\}/p' www.mbl.is | langid --line
Output:
('en', -193.52840971946716) ('en', -445.4644522666931) ('en', -158.1918339729309) ('en', -220.16202330589294) ('en', -596.61936211586) ('en', -379.3824007511139) ('en', -150.61454391479492) ('en', -379.3824007511139) ('en', -270.56594038009644) ('en', -446.9800910949707) ('en', -702.9869554042816) ('en', -208.84209847450256) ('en', -345.15056800842285) ('en', -321.2763195037842) ('en', -209.9769265651703) ('en', -144.31591272354126) ('en', -208.40711855888367) ('en', -161.14595460891724) ('en', -180.95807218551636) ('is', -151.84487915039062) ('en', -32.042465686798096) ('no', -73.23809719085693) ('lb', -194.81272649765015) ('et', -80.76274251937866) ('en', -129.17673206329346) ('en', -95.43238878250122) ('da', -30.086124420166016)
Is this possible to do in a one liner or would it be best to write a script. I can do it in python, but it's regex modules are painful and need to change the character variable quickly depending on the input file and change the grep to different language codes easily. Plus I thought this would be a good time to start my journey in bash scripting, bash commands are awesome, and I can assume that so is bash scripting (I've just got to get my head around the semantics and syntax, lots of $ signs)
sed -n '/^.\{65\}/p' www.mbl.is | langid --line
, without thegrep
. That way we know what we're looking for without having to install thelangid
program.