Character count of language X in mixed text file?

Question

I have mixed-language text files, and would like to count the simple total number of printable characters of one of the languages. It helps that the languages inhabit different unicode ranges.

My specific use-case involves Hebrew, Polytonic Greek, and English -- but I imagine a solution to this problem could be generalized for other contexts, too.

I would like to count to the Hebrew characters only -- that's Unicode [\u0590-\u05ff]. Here's a brief sample input file (which, by my manual count, contains 62 Hebrew characters):

[ Ps117 ]‬ h1: ‫  הללו את יהוה כל גוים שבחוהו כל האמים ‬ r1: Praise the LORD, all nations! Extol him, all peoples! g1: Αλληλουια. Αἰνεῖτε τὸν κύριον, πάντα τὰ ἔθνη, ἐπαινέσατε αὐτόν, πάντες οἱ λαοί, b1: Alleluia. Praise the Lord all you nations: praise him all you peoples. h2: ‫  כי גבר עלינו חסדו ואמת יהוה לעולם הללו יה ‬ r2: For great is his steadfast love toward us; and the faithfulness of the LORD endures for ever. Praise the LORD! g2: ὅτι ἐκραταιώθη τὸ ἔλεος αὐτοῦ ἐφ' ἡμᾶς, καὶ ἡ ἀλήθεια τοῦ κυρίου μένει εἰς τὸν αἰῶνα. b2: For his mercy has been abundant toward us: and the truth of the Lord endures for ever.

I'm on Ubuntu 16.04.2 LTS, if that helps. I imagine perl would be a likely option here, or some shell script ... but I don't know these things, which is why I'm asking!

_{For the curious, the lines in my input are: h= Hebrew; r= Revised Standard Version; g = Greek Septuagint; b = Brenton translation of Septuagint; in each case followed by a verse number.}

So what about spaces? Also, it would be pretty straight forward to only count characters on lines starting with h1: , h2: etc. — Stephen Rauch, CommentedJun 20, 2017 at 14:01
I'd use a Perl one-liner to remove all unicode chars except those in your range (see e.g. here how to use as tr substitute), man perlre), then count remaining chars. — dirkt, CommentedJun 20, 2017 at 14:05
@StephenRauch - Yes, whitespace would be a bit of a pain. Fortunately, all I'm after is the "printable" Hebrew characters. The h1: prefix is simply a quirk of this input file; hopefully any solution will rely on recognizing the unicode range, not my random file convention. ;) — Dɑvïd, CommentedJun 20, 2017 at 14:44
"Count" as in figure out how many distinct characters are, or their relative distribution; or just how many glyphs in this character range the file contains (basically the length of the text after you have removed all characters outside the desired range)? — tripleee, CommentedJun 20, 2017 at 17:35
@tripleee - Your third option (appropriately, given your username ;) = "how many glyphs in this character range the file contains". I've now tweaked the question to (hopefully!) make that clear. — Dɑvïd, CommentedJun 20, 2017 at 18:23

David Six · Accepted Answer · 2017-06-20 16:34:59Z

There is potentially an issue with determining the length of Unicode strings. See this page from Twitter's developer docs for more details on Normalization

The character count will depend on the locale you have configured. You can run locale to verify that you have a UTF-8 locale configured. Once this is done, the code from @stephen-rauch should work.

Depending on which regex library you use, you might also be able to use named scripts like \p{Hebrew} and \P{Greek} Here is an example of using \P{Hebrew} to remove all non-Hebrew characters: Link

Edited: Initial results were due to mis-configured locale

@Dɑvïd It looks like the output of wc will depend on your locale, I will update my answer to reflect this. — David Six, CommentedJun 20, 2017 at 15:30

steeldriver · Accepted Answer · 2017-06-20 19:15:58Z

These seem to come close for me (tested on Ubuntu 16.04)

$ perl -0777 -MEncode -ne 'print decode("UTF-8",$_) =~ tr/\x{0590}-\x{05ff}//,"\n"' input 62 $ perl -0777 -MEncode -ne 'print decode("UTF-8",$_) =~ tr/\p{Hebrew}//,"\n"' input 63

I'm not sure what the "right" answer should be.

The right answer (if I've counted correctly) is 62 -- I added it to the question. I wonder what \p{Hebrew} picks up that the range itself doesn't? Anyway -- thanks! — Dɑvïd, CommentedJun 20, 2017 at 19:28

Stephen Rauch · Accepted Answer · 2017-06-20 15:47:30Z

Using python you can do something like this:

Code:

# coding: utf-8 import re import codecs #find_hebrew = re.compile(ur'[\u0590-\u05ff]+') # python 2 find_hebrew = re.compile(r'[\u0590-\u05ff]+') # python 3 count = 0 with codecs.open('text_file', 'rU', encoding='utf-8') as f: for line in f.readlines(): for n in find_hebrew.findall(line): count += len(n) print(count)

Result:

I've done a small tweak (made possible to pass input filename as argument) and added shebang and commented use notes, and saved as a gist. If my tweaks could use tweaking, do tell! ;) Thanks again. — Dɑvïd, CommentedJun 21, 2017 at 13:50

Stack Exchange Network

Character count of language X in mixed text file?

3 Answers 3

Code:

Result:

You must log in to answer this question.

Linked

Hot Network Questions

Character count of language X in mixed text file?

3 Answers 3

Code:

Result:

You must log in to answer this question.

Linked

Related

Hot Network Questions