Skip to the content.

Traineddata Files for Version 4.00 +

We have three sets of official .traineddata files trained at Google, for tesseract versions 4.00 and above. These are made available in three separate repositories.

When using the traineddata files from the tessdata_best and tessdata_fast repositories, only the new LSTM-based OCR engine (–oem 1) is supported. The legacy tesseract engine (–oem 0) is NOT supported with these files, so Tesseract’s oem modes ‘0’ and ‘2’ won’t work with them.

Special Data Files

Lang CodeDescription4.x/3.0x traineddata
osdOrientation and script detectionosd.traineddata
equMath / equation detectionequ.traineddata

Note: These two data files are compatible with older versions of Tesseract. osd is compatible with version 3.01 and up, and equ is compatible with version 3.02 and up.

Updated Data Files (September 15, 2017)

We have three sets of .traineddata files on GitHub in three separate repositories. These are compatible with Tesseract 4.0x+ and 5.0.0.Alpha.

 Trained modelsSpeedAccuracySupports legacyRetrainable
tessdataLegacy + LSTM (integerized tessdata-best)Faster than tessdata-bestSlightly less accurate than tessdata-bestYesNo
tessdata-bestLSTM only (based on langdata)SlowestMost accurateNoYes
tessdata-fastIntegerized LSTM of a smaller network than tessdata-bestFastestLeast accurateNoNo

Most users will want tessdata_fast and that is what will be shipped as part of Linux distributions.

tessdata_best is for people willing to trade a lot of speed for slightly better accuracy. It is also the only set of files which can be used for certain retraining scenarios for advanced users.

The third set in tessdata is the only one that supports the legacy recognizer. The 4.00 files from November 2016 have both legacy and older LSTM models. The current set of files in tessdata have the legacy models and newer LSTM models (integer versions of 4.00.00 alpha models in tessdata_best).

Note: When using the new models in the tessdata_best and tessdata_fast repositories, only the new LSTM-based OCR engine is supported. The legacy engine is not supported with these files, so Tesseract’s oem modes ‘0’ and ‘2’ won’t work with them.

Data Files for Version 4.00 (November 29, 2016)

tessdata tagged 4.0.0 has the models from Sept 2017 that have been updated with Integer versions of tessdata_best LSTM models. This set of traineddata files has support for the legacy recognizer with –oem 0 and for LSTM models with –oem 1.

tessdata tagged 4.00 has the models from 2016. The individual language files are linked in the table below.

Note: The kur data file was not updated from 3.04. For Fraktur, use the newer data files from the tessdata_fast or tessdata_best repositories.

Lang CodeLanguage4.0 traineddata
afrAfrikaansafr.traineddata
amhAmharicamh.traineddata
araArabicara.traineddata
asmAssameseasm.traineddata
azeAzerbaijaniaze.traineddata
aze_cyrlAzerbaijani - Cyrillicaze_cyrl.traineddata
belBelarusianbel.traineddata
benBengaliben.traineddata
bodTibetanbod.traineddata
bosBosnianbos.traineddata
bulBulgarianbul.traineddata
catCatalan; Valenciancat.traineddata
cebCebuanoceb.traineddata
cesCzechces.traineddata
chi_simChinese - Simplifiedchi_sim.traineddata
chi_traChinese - Traditionalchi_tra.traineddata
chrCherokeechr.traineddata
cymWelshcym.traineddata
danDanishdan.traineddata
deuGermandeu.traineddata
dzoDzongkhadzo.traineddata
ellGreek, Modern (1453-)ell.traineddata
engEnglisheng.traineddata
enmEnglish, Middle (1100-1500)enm.traineddata
epoEsperantoepo.traineddata
estEstonianest.traineddata
eusBasqueeus.traineddata
fasPersianfas.traineddata
finFinnishfin.traineddata
fraFrenchfra.traineddata
frkGerman Frakturfrk.traineddata
frmFrench, Middle (ca. 1400-1600)frm.traineddata
gleIrishgle.traineddata
glgGalicianglg.traineddata
grcGreek, Ancient (-1453)grc.traineddata
gujGujaratiguj.traineddata
hatHaitian; Haitian Creolehat.traineddata
hebHebrewheb.traineddata
hinHindihin.traineddata
hrvCroatianhrv.traineddata
hunHungarianhun.traineddata
ikuInuktitutiku.traineddata
indIndonesianind.traineddata
islIcelandicisl.traineddata
itaItalianita.traineddata
ita_oldItalian - Oldita_old.traineddata
javJavanesejav.traineddata
jpnJapanesejpn.traineddata
kanKannadakan.traineddata
katGeorgiankat.traineddata
kat_oldGeorgian - Oldkat_old.traineddata
kazKazakhkaz.traineddata
khmCentral Khmerkhm.traineddata
kirKirghiz; Kyrgyzkir.traineddata
korKoreankor.traineddata
kurKurdishkur.traineddata
laoLaolao.traineddata
latLatinlat.traineddata
lavLatvianlav.traineddata
litLithuanianlit.traineddata
malMalayalammal.traineddata
marMarathimar.traineddata
mkdMacedonianmkd.traineddata
mltMaltesemlt.traineddata
msaMalaymsa.traineddata
myaBurmesemya.traineddata
nepNepalinep.traineddata
nldDutch; Flemishnld.traineddata
norNorwegiannor.traineddata
oriOriyaori.traineddata
panPanjabi; Punjabipan.traineddata
polPolishpol.traineddata
porPortuguesepor.traineddata
pusPushto; Pashtopus.traineddata
ronRomanian; Moldavian; Moldovanron.traineddata
rusRussianrus.traineddata
sanSanskritsan.traineddata
sinSinhala; Sinhalesesin.traineddata
slkSlovakslk.traineddata
slvSlovenianslv.traineddata
spaSpanish; Castilianspa.traineddata
spa_oldSpanish; Castilian - Oldspa_old.traineddata
sqiAlbaniansqi.traineddata
srpSerbiansrp.traineddata
srp_latnSerbian - Latinsrp_latn.traineddata
swaSwahiliswa.traineddata
sweSwedishswe.traineddata
syrSyriacsyr.traineddata
tamTamiltam.traineddata
telTelugutel.traineddata
tgkTajiktgk.traineddata
tglTagalogtgl.traineddata
thaThaitha.traineddata
tirTigrinyatir.traineddata
turTurkishtur.traineddata
uigUighur; Uyghuruig.traineddata
ukrUkrainianukr.traineddata
urdUrduurd.traineddata
uzbUzbekuzb.traineddata
uzb_cyrlUzbek - Cyrillicuzb_cyrl.traineddata
vieVietnamesevie.traineddata
yidYiddishyid.traineddata

Format of traineddata files

The traineddata file for each language is an archive file in a Tesseract specific format. It contains several uncompressed component files which are needed by the Tesseract OCR process. The program combine_tessdata is used to create a tessdata file from the component files and can also extract them again like in the following examples:

Pre 4.0.0 format from Nov 2016 (with both LSTM and Legacy models)

combine_tessdata -u eng.traineddata eng. Extracting tessdata components from eng.traineddata Wrote eng.unicharset Wrote eng.unicharambigs Wrote eng.inttemp Wrote eng.pffmtable Wrote eng.normproto Wrote eng.punc-dawg Wrote eng.word-dawg Wrote eng.number-dawg Wrote eng.freq-dawg Wrote eng.cube-unicharset Wrote eng.cube-word-dawg Wrote eng.shapetable Wrote eng.bigram-dawg Wrote eng.lstm Wrote eng.lstm-punc-dawg Wrote eng.lstm-word-dawg Wrote eng.lstm-number-dawg Wrote eng.version Version string:Pre-4.0.0 1:unicharset:size=7477, offset=192 2:unicharambigs:size=1047, offset=7669 3:inttemp:size=976552, offset=8716 4:pffmtable:size=844, offset=985268 5:normproto:size=13408, offset=986112 6:punc-dawg:size=4322, offset=999520 7:word-dawg:size=1082890, offset=1003842 8:number-dawg:size=6426, offset=2086732 9:freq-dawg:size=1410, offset=2093158 11:cube-unicharset:size=1511, offset=2094568 12:cube-word-dawg:size=1062106, offset=2096079 13:shapetable:size=63346, offset=3158185 14:bigram-dawg:size=16109842, offset=3221531 17:lstm:size=5390718, offset=19331373 18:lstm-punc-dawg:size=4322, offset=24722091 19:lstm-word-dawg:size=7143578, offset=24726413 20:lstm-number-dawg:size=3530, offset=31869991 23:version:size=9, offset=31873521 

4.00.00alpha LSTM only format

combine_tessdata -u eng.traineddata eng. Extracting tessdata components from eng.traineddata Wrote eng.lstm Wrote eng.lstm-punc-dawg Wrote eng.lstm-word-dawg Wrote eng.lstm-number-dawg Wrote eng.lstm-unicharset Wrote eng.lstm-recoder Wrote eng.version Version string:4.00.00alpha:eng:synth20170629:[1,36,0,1Ct3,3,16Mp3,3Lfys64Lfx96Lrx96Lfx512O1c1] 17:lstm:size=11689099, offset=192 18:lstm-punc-dawg:size=4322, offset=11689291 19:lstm-word-dawg:size=3694794, offset=11693613 20:lstm-number-dawg:size=4738, offset=15388407 21:lstm-unicharset:size=6360, offset=15393145 22:lstm-recoder:size=1012, offset=15399505 23:version:size=80, offset=15400517 

Proposal for compressed traineddata files

There are some proposals to replace the Tesseract archive format by a standard archive format which could also support compression. A [discussion on the tesseract-dev forum](https://groups.google.com/forum/?hl=en#!searchin/tesseract-dev/zipsort:date/tesseract-dev/U5HSugUeeeI) proposed the ZIP format already in 2014. In 2017 an experimental implementation was provided as a pull request.
close