English, Guest Posts

Natural Language Processing of Rabbinic Texts: Contexts, Challenges, Opportunities

The Talmud Blog is happy to continue our series on the interface of Digital Humanities and the study of Rabbinic Literature with a post by Marton Ribary of University of Manchester.

I read Michael Satlow’s enthusiastic report on the Classical Philology Goes Digital Workshop with great pleasure. I am delighted to see how the study of Rabbinic literature moves towards the use of digital tools and especially Natural Language Processing (NLP) methods. Below I shall sketch the background of NLP methods applied to Rabbinic literature and what we can learn from projects concentrating on other classical linguistic data, notably Latin. I shall briefly discuss the enormous obstacles Rabbinic literature poses even compared to Latin, which means that our expectations to achieve meaningful results in standard 3-5 year research projects should be very moderate. Nevertheless, I shall argue that we should dream big and aim for courageous projects accompanied by an outward-looking strategy in order to attract big corporate money.

Apart from the Digital Mishnah Project mentioned in Satlow’s post, the Italian machine-assisted translation project of the Babylonian Talmud has been also experimenting with NLP methods. Additionally, the French LATTICE Research Unit at the Centre national de la recherche scientifique led by Professor Thierry Poibeau has recently launched “a project called LAKME (Linguistically Annotated Corpora Using Machine Learning Techniques) … [which] will explore new techniques for the annotation of textual corpora of morphology rich languages” including Rabbinic Hebrew. I know that Professor Poibeau was recruiting a new member, though I am unfamiliar with the current situation of LAKME’s Rabbinic Hebrew side-project.

The NLP approach to Rabbinic literature has been informed by NLP methods applied to Modern Hebrew. The linguistic powerhouses of the University of Amsterdam and the Technion in Haifa, and especially scholars like Khalil Sima’an, Yoad Winter and Roy Bar-Haim are the leading figures in this field. If my understanding is correct, the NLP methods for Modern Hebrew are based on those developed for Arabic, which produced results like the Morph-Tagger (HMM-based part-of-speech tagger for Hebrew) and the MILA Morphological Disambiguation Tool – both hosted by the Technion. I have not seen much movement on these websites, so I am uncertain whether any developments have been made in this area in recent times.

One thing is certain, NLP for Hebrew (Modern as well as Rabbinic) has much to improve before it could provide any useful results. The above mentioned tools work relatively well for Modern Hebrew, but they are still significantly underperforming compared to the analytical success rate achieved in European languages. The tools are developed on contemporary linguistic data where morphology, syntax and orthography are all fairly consistent. When the current tools are applied to, for example, a piece of Talmud text, then the analytical success rate becomes so low that it becomes best to tag and annotate the text manually.

This leads me to the biggest challenge, as well as the biggest opportunity, of NLP methods applied to Rabbinic texts. I entertained the idea to experiment with NLP methods in my research, which analyses grammatical and rhetorical structures as manifestations of abstract legal thinking in Latin Roman and Hebrew/Aramaic Rabbinic texts. Barbara McGillivray, who is currently fellow of the Cambridge branch of the Alan Turing Institute, is one of the pioneers in developing NLP methods applied to historical texts. Her Methods in Latin computational linguistics (Brill, 2014) is an excellent introduction to the field and its inherent complications. Michael Piotrowksi’s Natural language processing for historical texts (Morgan & Claypool, 2012) presents the field in more general terms. Barbara McGillivray’s Latin corpus brings together and harmonises earlier attempts according to the NLP guidelines of SketchEngine developed by the late Adam Kilgariff. As McGillivray explains, “the texts have been collected from the LacusCurtius, Intratext and Musisque Deoque websites. The texts have been lemmatised with Dag Haug’s Latin morphological analyser and Quick Latin; the texts were then part-of-speech tagged with TreeTagger, trained on the Index Thomisticus Treebank, the Latin Dependency Treebank and the Latin treebank of the Proiel Project.” McGillivray and Piotrowksi both emphasise the enormous challenges historical texts pose to NLP analysis. Texts are collected from a vast geographical area, and sometime from a period stretching over millennia. Consequently, the linguistic data is extremely inconsistent in terms morphology, syntax and orthography. Due to the limited availability of historical texts, databanks are also much smaller compared to those of contemporary languages.

The first steps of NLP analysis can be traced back to Professor Satlow’s home university, where Henry Kučera and W. Nelson Francis put together a manually annotated English-language databank totalling 1 million words in the 1960s. The Brown Corpus became the gold standard for any NLP-related projects in the English language in syntax (e.g. part-of-speech tagging), semantics (e.g. machine translation), discourse (e.g. automatic summarization) and speech (e.g. speech recognition). The corpus is still widely used, though more modern corpora in excess of 100 million analysed words, such as the Corpus of Contemporary American English, the British National Corpus or the International Corpus of English, are taking over in research. My point is that even though machine assisted corpus linguistic research into contemporary English is more than 50 years old and created massive annotated corpora of fairly consistent linguistic data, and even though the field has been extremely well-funded thanks to the corporate potential of its research output (automated translation, voice control etc.), results have still yielded well below the expectation of these ambitious research projects, much slower and on a much smaller scale.

This might be a warning sign for Rabbinic texts that pose incredible challenges to the researcher, even compared to the already very problematic Latin. Rabbinic texts do not only constitute inconsistent linguistic data where morphology, syntax, orthography are not standardised, but these texts are mostly lacking punctuation, they often use a short-hand “lecture notes” style, they mix different dialects and sometimes even different languages, and worst of all, like other Semitic linguistic data, the visual presentation of the texts is more ambiguous due to the lack of vowels. The use of mater lectionis (אֵם קְרִיאָה)‎‎ could be helpful, but because its use is also inconsistent, it only complicates the matter further. These are potentially insurmountable problems for the machine assisted corpus linguistic analysis of Rabbinic texts.

As Satlow correctly points out, the first step is to create our databanks. The field has cutting-edge technology at its disposal, but if we compare ourselves to the research into contemporary English, we are approximately in the era of the 1960s when Kučera and Francis started to build the Brown Corpus. Unless there is a very unlikely breakthrough in the Optical Character Recognition (OCR) techniques applied to manuscript materials, then manuscript data will have to be recorded manually. The same applies to creating the gold standard of Rabbinic texts similar to the Brown Corpus. The corpus needs to be tagged and annotated manually, checked and double-checked so that it could provide a solid foundation for any further research. At this point, I see very little chance that the job can be done without enormous labour input, and I am a bit sceptical whether the field will be able to secure the funding such effort requires. The NLP of Rabbinic texts would need the kind of corporate financial support that has been boosting the research into contemporary English.

However, I also see a breaking point. If the field is able to formulate research projects that do not only produce results for specialists of Rabbinic literature but also promise to develop new NLP methods capable of tackling inconsistent linguistic data or even the mixture of non-standardised languages, then it would be able to attract the big corporate money the field requires in these early decades. Concentrated and synchronised efforts of researchers working on different historical linguistic data may be able to secure the big money we need to get our projects under way. Researchers in Hebrew, Aramaic, Arabic, Latin, Greek and other historical corpora need to cooperate in their efforts to find answers to their common questions.

The corporate benefits of such efforts are potentially huge. An example from personal experience: My partner, who is an Australian national of Chinese-Malay origin, uses a mix of “bad” (her word, not mine) Mandarin, Cantonese, Malay and English when she speaks to her family and friends back home. My cricketing friends from India, Pakistan and Bangladesh use a mix of non-standard Urdu, Punjabi, Hindi and English among themselves. I would not be surprised that something similar were true of the majority of the global population. Most of us on this planet mix languages in one sentence, and potentially none of them follow the rules of our language textbooks. Methods developed while working on Rabbinic and other historical texts may provide invaluable tools to the analysis of mixed language use, and that is one place where the big corporate money may come to the aid of our own little field.

Standard

2 thoughts on “Natural Language Processing of Rabbinic Texts: Contexts, Challenges, Opportunities

  1. Pingback: It Functions, and that’s (almost) All: Another Look at “Tagging the Talmud” | The Talmud Blog

Leave a comment