This article is based on Lucene in Action, Second Edition, published on October, 2010. It is being reproduced here by permission from Manning Publications. Manning publishes MEAP (Manning Early Access Program,) eBooks and pBooks. MEAPs are sold exclusively through Manning.com. All pBook purchases include free PDF, mobi and epub. When mobile formats become available all customers will be contacted and upgraded. Visit Manning.com for more information. [ Use promotional code ‘java40beat’ and get 40% discount on eBooks and pBooks ]
Using the Built-in Analyzers
Analysis, in Lucene, is the process of converting field text into its most fundamental indexed representation, terms. These terms are used to determine what documents match a query during searching. For example, if you indexed this sentence in a field the terms might start with for and example, and so on, as separate terms in sequence. An analyzer is an encapsulation of the analysis process. An analyzer tokenizes text by performing any number of operations on it, which could include extracting words, discarding punctuation, removing accents from characters, lowercasing (also called normalizing), removing common words, reducing words to a root form (stemming), or changing words into the basic form (lemmatization). This process is also called tokenization, and the chunks of text pulled from a stream of text are called tokens. Tokens, combined with their associated field name, are terms.
Lucene‘s primary goal is to facilitate information retrieval. The emphasis on retrieval is important. You want to throw gobs of text at Lucene and have them be richly searchable by the individual words within that text. In order for Lucene to know what “words” are, it analyzes the text during indexing, extracting it into terms. These terms are the primitive building blocks for searching.
Choosing the right analyzer is a crucial development decision with Lucene, and one size definitely doesn’t fit all. Language is one factor, because each has its own unique features. Another factor to consider is the domain of the text being analyzed; different industries have different terminology, acronyms, and abbreviations that may deserve attention. No single analyzer will suffice for all situations. It’s possible that none of the built-in analysis options are adequate for your needs, and you’ll have to invest in creating a custom analysis solution; fortunately, Lucene‘s building blocks make this quite easy.
Lucene includes several built-in analyzers, created by chaining together certain combinations of the built-in Tokenizers and TokenFilters. The primary ones are shown in table 1.
StopAnalyzer, beyond doing basic word splitting and lowercasing, also removes special words called stop words. Stop words are words that are very common, such as the, and thus assumed to carry very little standalone meaning for searching since nearly every document will contain the word.
Embedded in StopAnalyzer is the following set of common English stop words, defined as ENGLISH_STOP_WORDS_SET. This default set is used unless otherwise specified:[code] "a", "an", "and", "are", "as", "at", "be", "but", "by",
"for", "if", "in", "into", "is", "it", "no", "not", "of", "on",
"or", "such","that", "the", "their", "then", "there", "these",
"they", "this", "to", "was", "will", "with"[/code]
The StopAnalyzer has a second constructor that allows you to pass your own set instead. Under the hood, StopAnalyzer creates a StopFilter to perform the filtering.
StandardAnalyzer holds the honor as the most generally useful built-in analyzer. A JFlex-based grammar underlies it, tokenizing with cleverness for the following lexical types: alphanumerics, acronyms, company names, email addresses, computer hostnames, numbers, words with an interior apostrophe, serial numbers, IP addresses, and Chinese and Japanese characters. StandardAnalyzer also includes stop-word removal, using the same mechanism as the StopAnalyzer (identical default English set, and an optional Set constructor to override). StandardAnalyzer makes a great first choice.
Which core analyzer should you use?
We’ve now seen the substantial differences in how each of the four core Lucene analyzers works. How do you choose the right one for your application? The answer may surprise you: most applications don’t use any of the built-in analyzers, and instead opt to create their own analyzer chain. For those applications that do use a core analyzer, StandardAnalyzer is likely the most common choice. The remaining core analyzers are usually far too simplistic for most applications, except perhaps for specific use cases (for example, a field that contains a list of part numbers might use Whitespace-Analyzer). But these analyzers are great for test cases and are indeed used heavily by Lucene‘s unit tests.
Typically an application has specific needs, such as customizing the stop-words list, performing special tokenization for application-specific tokens like part numbers or for synonym expansion, preserving case for certain tokens, or choosing a specific stemming algorithm. In fact, Solr makes it trivial to create your own analysis chain by expressing the chain directly as XML in solrconfig.xml.
Analysis, while only a single facet of using Lucene, is the aspect that deserves the most attention and effort. The words that can be searched are those emitted during indexing analysis: nothing more, nothing less. Sure, using StandardAnalyzer may do the trick for your needs, and it suffices for many applications. But it’s important to understand the analysis process. Users who take analysis for granted often run into confusion later when they try to understand why searching for “to be or not to be” returns no results (perhaps due to stop-word removal).