I have mixed-language text files, and would like to count the simple total number of printable characters of one of the languages. It helps that the languages inhabit different unicode ranges.
My specific use-case involves Hebrew, Polytonic Greek, and English -- but I imagine a solution to this problem could be generalized for other contexts, too.
I would like to count to the Hebrew characters only -- that's Unicode [\u0590-\u05ff]
. Here's a brief sample input file (which, by my manual count, contains 62 Hebrew characters):
[ Ps117 ] h1: הללו את יהוה כל גוים שבחוהו כל האמים r1: Praise the LORD, all nations! Extol him, all peoples! g1: Αλληλουια. Αἰνεῖτε τὸν κύριον, πάντα τὰ ἔθνη, ἐπαινέσατε αὐτόν, πάντες οἱ λαοί, b1: Alleluia. Praise the Lord all you nations: praise him all you peoples. h2: כי גבר עלינו חסדו ואמת יהוה לעולם הללו יה r2: For great is his steadfast love toward us; and the faithfulness of the LORD endures for ever. Praise the LORD! g2: ὅτι ἐκραταιώθη τὸ ἔλεος αὐτοῦ ἐφ' ἡμᾶς, καὶ ἡ ἀλήθεια τοῦ κυρίου μένει εἰς τὸν αἰῶνα. b2: For his mercy has been abundant toward us: and the truth of the Lord endures for ever.
I'm on Ubuntu 16.04.2 LTS, if that helps. I imagine perl would be a likely option here, or some shell script ... but I don't know these things, which is why I'm asking!
For the curious, the lines in my input are: h
= Hebrew; r
= Revised Standard Version; g
= Greek Septuagint; b
= Brenton translation of Septuagint; in each case followed by a verse number.
h1:
,h2:
etc.tr
substitute),man perlre
), then count remaining chars.h1:
prefix is simply a quirk of this input file; hopefully any solution will rely on recognizing the unicode range, not my random file convention. ;)how many glyphs in this character range the file contains
". I've now tweaked the question to (hopefully!) make that clear.