Using the Levenshtein algorithm for automatic lemmatization in Old English
Johnson, Bernadette Elaine
MetadataShow full item record
This study was undertaken to develop and test an automatic lemmatization program for the Old English language utilizing the Levenshtein edit distance algorithm, stemming, and other techniques to help overcome issues such as rampant spelling irregularity and the presence of inflectional endings. The primary goal was to create a lemma list for import into text analysis software to equate Old English words with their variants, which was met with limited but promising success. The main lemmatization program is written in the Perl programming language. Other scripts in Perl and XSLT, as well as Unix command line commands and AntConc 3.2.0 corpus analysis software, were used to extract text files from the Dictionary of Old English Corpus XML files, to generate sorted lists of all the words in the available texts for use by the main program, and to manipulate and analyze the data.