This article is based on Tika in Action, to be published on Summer, 2011. It is being reproduced here by permission from Manning Publications. Manning publishes MEAP (Manning Early Access Program,) eBooks and pBooks. MEAPs are sold exclusively through Manning.com. All pBook purchases include free PDF, mobi and epub. When mobile formats become available all customers will be contacted and upgraded. Visit Manning.com for more information. [ Use promotional code ‘java40beat’ and get 40% discount on eBooks and pBooks ]
Sounds Greek to Me—Theory of Language Detection
The ability to consistently name and classify things is essential for fully understanding them. There are thousands of languages in the world, many with multiple dialects or regional variants. Some of the languages are extinct and some are artificial. Some don’t even have names in English! Others, like Chinese, have names whose specific meaning is highly context-sensitive. A standard taxonomy that can name and classify all languages is needed to make it allow information systems to reliably store and process information about languages.
There are a number of different increasingly detailed systems for categorizing and naming languages, their dialects and other variants. For example, according to the RFC 5646: Tags for Identifying Languages standard, you could use de-CH-1996 to identify the form of German used in Switzerland after the spelling reform of 1996. Luckily there aren’t many practical applications where such detail is necessary or even desirable, so we’ll be focusing on just the de part of this identifier.
The RFC 5646 standard mentioned above leverages ISO 639 just like most of the other formal language taxonomies. ISO 639 is a set of standards defined by the International Organization for Standardization (ISO). The ISO 639 standards define a set of two- and three-letter language codes like the de code for German we encountered above. The two-letter codes that are most commonly used are defined in the ISO 639-1 standard. There are currently 184 registered two-letter language codes, and they represent most of the major languages in the world. The three-letter codes defined in the other ISO 639 standards are used mostly for more detailed representation of language variants and also for minor or even extinct languages.
The full list of ISO 639-1 codes is available from http://www.loc.gov/standards/iso639-2/ along with the larger lists of ISO 639-2 codes. Tika can detect 18 of the 184 currently registered ISO 639-1 languages. The codes of these supported languages are listed below.
After detecting the language of a document, Tika will use the above ISO 639-1 codes to identify the detected language. But how do we get to that point? Let’s find out!
Detecting the language of a document typically involves constructing a language profile of the document text and comparing that profile with those of known languages. The structure and contents of the language profile depend heavily on the detection algorithm but usually consists of a statistic compilation of some relevant features of the text. You’ll learn here about the profiling process and different profiling algorithms.
Usually the profile of a known language is constructed in the same way as that of the text whose language is being detected. The only difference is that the language of this text set, called a corpus, is already known in advance. For example, one could use the combined works of Shakespeare to create a profile for detecting his plays, those of his contemporaries or modern works that mimic the Shakespearean style. Should you come across and old-looking book like the one shown in figure 1, you could use the Shakespearean profile to test whether the contents of the book match its looks. Of course, such a profile would be less efficient at accurately matching the English language as it is used today.
A key question, then, for the developers of language detection or other natural language processing tools often is to find a good corpus that accurately and fairly represents the different ways a language is used. Usually it’s also the better the bigger the corpus is. Common sources of such sets of text are books, magazines and newspapers, official documents, and so on. Some are also based on the transcripts of spoken language from TV and radio programs. And, of course, the Internet is quickly becoming an important source, even though much of the text there is poorly categorized or labeled.
Once you have profiled the corpus of a language, you can use that profile to detect other texts that exhibit similar features. The better your profiling algorithm is the better those features match the features of the language in general instead of those of your corpus. The result of the profile comparison typically is a distance measure that indicates how close or how far the two profiles are from each other.
The language whose profile is closest to that of the candidate text is also most likely the language in which that text is written. The distance can also be a percentage estimate of how likely it is for the text to be written in a given language.
You’re probably already wondering about what these profiling algorithms look like. It’s time to find out!
The most obvious way to detect the language used in a piece of text is to look up the used words in dictionaries of different languages. If the majority of words in a given piece of text can be found in the dictionary of some language, it’s quite likely that the text indeed is written in that language. Even a relatively small dictionary of the most commonly used words of a language is often good enough for such language detection. You could even get reasonably accurate results with just the word “the” for detecting English, the words “le” and “la” for French, and “der”, “die” and “das” for German!
Such a list of common words is probably the simplest reasonably effective language profile. It could be further improved by associating each word with its relative frequency and calculating the distance of two profiles as the sum of differences between the frequencies of matching words. Another advantage2 of this improvement is that it allows the same profiling algorithm to be used to easily generate a language profile from a selected corpus instead of having to use a dictionary or other explicit list of common words.
Alas, the main problem with such an algorithm is that it’s not very efficient at matching short texts like single sentences or even just a few words. It also depends on the way of detecting word boundaries, which may be troublesome for languages like German, with lots of compound words, or Chinese and Japanese, where no whitespace or other extra punctuation is typically used to separate words. Finally, it has big problems with agglutinative languages like Finnish or Korean, where most words are formed by composing smaller units of meaning. For example the Finnish words “kotona” and “kotoa” mean “at home” and “from home” respectively, which makes counting common words like “at”, “from” or even “home” somewhat futile.
Given these difficulties, how about looking at individual characters or character groups instead?
The N-gram algorithm
The profiling algorithm based on word frequencies can just as easily be applied on individual characters. In fact, this even makes the algorithm simpler since, instead of a potentially infinite number of distinct words, you only need to track a finite number of characters. And, it turns out that character frequencies really do depend on the language, as shown in figure 2.
And, obviously this algorithm works even better with many Asian languages with characters that are used in only one or just a handful of languages. This algorithm however has the same problem as the word-based one in that it needs quite a bit of text for an accurate match. Interestingly enough, the problem here is the opposite of that with words. Where a short sentence may not contain any of the most common words of a language, it’s practically guaranteed to contain plenty of the common characters. Instead, the problem is that there simply isn’t enough material to differentiate between the languages with similar character frequencies.
This detail hints at an interesting approach that turns out to be quite useful in language detection. Instead of looking at individual words or characters, we could look at character sequences of a given length. Such sequences are called 2-, 3-, 4-grams or, more generally, N-grams based on the sequence length. For example, the 3-grams of a word like “hello” would be “hel”, “ell” and “llo”, plus “_he” and “lo_” when counting word boundaries as separate characters.
It turns out that N-grams are highly effective at isolating the essential features of at least most European languages. They nicely avoid problems with compound words or the oddities of languages like Finnish. And, they still provide statistically significant matches even for relatively short texts. Tika opts to use 3-grams as that seems to offer the best tradeoff of features in most practical cases.
Advanced profiling algorithms
There are other more advanced language profiling algorithms out there, but few match the simplicity and efficiency of the N-gram method described above. Typically such algorithms target specific features like maximum coverage of different kinds of languages, the ability to accurately detect the language of very short texts or the ability to detect multiple languages within a multi-lingual document.
Tika tracks developments in this area and may incorporate some new algorithms in its language detection features in future releases, but for now N-grams are the main profiling algorithm used by Tika.
Even though natural language processing is a fiendishly complex subject that has and probably will be a topic of scientific research for decades, there are certain areas that already useful in practical applications. Language detection is one of the simpler tasks of natural language processing and can, for the most part, be implemented with relatively simple statistical tools. Tika’s N-gram based language detection feature is one such implementation.