5

I have mixed-language text files, and would like to count the simple total number of printable characters of one of the languages. It helps that the languages inhabit different unicode ranges.

My specific use-case involves Hebrew, Polytonic Greek, and English -- but I imagine a solution to this problem could be generalized for other contexts, too.

I would like to count to the Hebrew characters only -- that's Unicode [\u0590-\u05ff]. Here's a brief sample input file (which, by my manual count, contains 62 Hebrew characters):

[ Ps117 ]‬ h1: ‫  הללו את יהוה כל גוים שבחוהו כל האמים ‬ r1: Praise the LORD, all nations! Extol him, all peoples! g1: Αλληλουια. Αἰνεῖτε τὸν κύριον, πάντα τὰ ἔθνη, ἐπαινέσατε αὐτόν, πάντες οἱ λαοί, b1: Alleluia. Praise the Lord all you nations: praise him all you peoples. h2: ‫  כי גבר עלינו חסדו ואמת יהוה לעולם הללו יה ‬ r2: For great is his steadfast love toward us; and the faithfulness of the LORD endures for ever. Praise the LORD! g2: ὅτι ἐκραταιώθη τὸ ἔλεος αὐτοῦ ἐφ' ἡμᾶς, καὶ ἡ ἀλήθεια τοῦ κυρίου μένει εἰς τὸν αἰῶνα. b2: For his mercy has been abundant toward us: and the truth of the Lord endures for ever. 

I'm on Ubuntu 16.04.2 LTS, if that helps. I imagine perl would be a likely option here, or some shell script ... but I don't know these things, which is why I'm asking!


For the curious, the lines in my input are: h= Hebrew; r= Revised Standard Version; g = Greek Septuagint; b = Brenton translation of Septuagint; in each case followed by a verse number.

5
  • So what about spaces? Also, it would be pretty straight forward to only count characters on lines starting with h1: , h2: etc.CommentedJun 20, 2017 at 14:01
  • I'd use a Perl one-liner to remove all unicode chars except those in your range (see e.g. here how to use as tr substitute), man perlre), then count remaining chars.
    – dirkt
    CommentedJun 20, 2017 at 14:05
  • @StephenRauch - Yes, whitespace would be a bit of a pain. Fortunately, all I'm after is the "printable" Hebrew characters. The h1: prefix is simply a quirk of this input file; hopefully any solution will rely on recognizing the unicode range, not my random file convention. ;)
    – Dɑvïd
    CommentedJun 20, 2017 at 14:44
  • "Count" as in figure out how many distinct characters are, or their relative distribution; or just how many glyphs in this character range the file contains (basically the length of the text after you have removed all characters outside the desired range)?
    – tripleee
    CommentedJun 20, 2017 at 17:35
  • @tripleee - Your third option (appropriately, given your username ;) = "how many glyphs in this character range the file contains". I've now tweaked the question to (hopefully!) make that clear.
    – Dɑvïd
    CommentedJun 20, 2017 at 18:23

3 Answers 3

4

There is potentially an issue with determining the length of Unicode strings. See this page from Twitter's developer docs for more details on Normalization

The character count will depend on the locale you have configured. You can run locale to verify that you have a UTF-8 locale configured. Once this is done, the code from @stephen-rauch should work.

Depending on which regex library you use, you might also be able to use named scripts like \p{Hebrew} and \P{Greek} Here is an example of using \P{Hebrew} to remove all non-Hebrew characters: Link

Edited: Initial results were due to mis-configured locale

1
  • @Dɑvïd It looks like the output of wc will depend on your locale, I will update my answer to reflect this.
    – David Six
    CommentedJun 20, 2017 at 15:30
4

These seem to come close for me (tested on Ubuntu 16.04)

$ perl -0777 -MEncode -ne 'print decode("UTF-8",$_) =~ tr/\x{0590}-\x{05ff}//,"\n"' input 62 $ perl -0777 -MEncode -ne 'print decode("UTF-8",$_) =~ tr/\p{Hebrew}//,"\n"' input 63 

I'm not sure what the "right" answer should be.

1
  • The right answer (if I've counted correctly) is 62 -- I added it to the question. I wonder what \p{Hebrew} picks up that the range itself doesn't? Anyway -- thanks!
    – Dɑvïd
    CommentedJun 20, 2017 at 19:28
3

Using python you can do something like this:

Code:

# coding: utf-8 import re import codecs #find_hebrew = re.compile(ur'[\u0590-\u05ff]+') # python 2 find_hebrew = re.compile(r'[\u0590-\u05ff]+') # python 3 count = 0 with codecs.open('text_file', 'rU', encoding='utf-8') as f: for line in f.readlines(): for n in find_hebrew.findall(line): count += len(n) print(count) 

Result:

62 
1
  • 1
    I've done a small tweak (made possible to pass input filename as argument) and added shebang and commented use notes, and saved as a gist. If my tweaks could use tweaking, do tell! ;) Thanks again.
    – Dɑvïd
    CommentedJun 21, 2017 at 13:50

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.