lucene-solr-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Solr Wiki] Update of "MultitermQueryAnalysis" by ErickErickson
Date Thu, 24 Nov 2011 17:08:13 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.

The "MultitermQueryAnalysis" page has been changed by ErickErickson:
http://wiki.apache.org/solr/MultitermQueryAnalysis

Comment:
Writeup for SOLR-2438 (new analysis chain for wildcards)

New page:
<!> [[Solr3.6]] <!> [[Solr4.0]]

One of the surprises for most Solr users is that wildcards queries haven't gone through any
analysis. Practically, this means that wildcard (and prefix and range) queries are case sensitive,
which is at odds with expectations. As of this [[https://issues.apache.org/jira/browse/SOLR-2438|SOLR-2438]],
this behavior is changed.

What's a multiterm you ask? Essentially it's any term that may "point to" more than one real
term. For instance, run* could expand to runs, runner, running, runt, etc. Likewise, a range
query is really a "multiterm" query as well. Before Solr 3.6, these were completely unprocessed,
the application layer usually had to apply any transformations required, for instance lower-casing
the input. Running these types of terms through a "normal" analysis chain leads to all sorts
of ''interesting'' behavior so was avoided.

<<TableOfContents>>

== New analyzer chain ==
The [[http://svn.apache.org/viewvc/lucene/dev/trunk/solr/example/solr/conf/schema.xml?view=markup|schema.xml]]
file controls how fields are analyzed at index and query time. Now, you can optionally add
a <analyzer type="multiterm"/> analysis chain. Don't worry! If you don't, Solr tries
to "do the right thing" (see below). This arguably an expert-level option, and '''''in most
cases the new default behavior should be fine''''', so don't define your own "multiterm" analysis
chain unless you really have a need. You can string together any of the available, or custom
elements of a Solr analysis chain when defining this new chain.

=== Defining the chain ===
Just like there are two current analyzer chains, "index" and "query" (e.g.. <analyzer type="index"/>)
there is now a third, optional analyzer <analyzer type="multiterm"/>. Just put it in
the the <fieldType> as you would "index" or "query" types.

=== Why is this "arguably expert"? ===
Well, the assumption is that 95% of the time, all queries really need is lowercasing and,
perhaps, accent folding. With those cases and any CharFilters automatically applied, it seems
likely that defining your own multiterm analysis chain will not be necessary. If this assumption
is incorrect, feel free to raise a JIRA! All the assumptions in the world aren't worth a few
real-world tests.

If you define your own chain, be aware that Solr will throw an exception if the chain you've
defined evaluates to more than one term. Odd choice of phrase when it's called "multiterm",
but it ''operates'' on multi-term queries, it does not ''produce'' multiple terms. So, for
instance, if you define your own chain and put WordDelimiterFilterFactory in it and then send
camel-case terms with wildcards, Solr will throw an exception. Which is why the default case
(no "multiterm" analyzer defined) picks and chooses "safe" items from the Filters.


== Auto-detection: A.K.A. "doing the right thing" ==
Solr tries to take the pain out of creating yet ''another'' filter by looking at what has
been defined and picking things out of that chain that "make sense". These are used to build
the "multiterm" analyzer without anything needing to be specified. So you don't have to do
''anything'' to get this capability.

During schema file analysis, if this analyzer type is not present Solr will construct it from
the first of <analyzer type="query"> or <analyzer type="index"> or <analyzer>,
in that order. The new analyzer will consist of any Char Filters, a WhitespaceTokenizer, a
LowerCaseFilter (if present) and an ASCIIFoldingFilter (if present). This analyzer is then
applied to all multiterm terms. Note that the WhiteSpaceTokenizer is applied to the individual
term, not the entire input so it is essentially a no-op.

This is only synthesized if the version is >= 3.6, so this won't affect any existing installations
that upgrade.

=== Legacy behavior ===
If you want to use 3.6 or greater, but require the old behavior, you can specify ' legacyMultiTerm="true"
' in your <fieldType> statement (see the example schema). At this point, this is ''not''
something that can be specified in a <field> tag.

Mime
View raw message