lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Daniel Collins <danwcoll...@gmail.com>
Subject Re: About Query Parser
Date Fri, 20 Jun 2014 10:25:57 GMT
Alexandre's response is very thorough, so I'm really simplifying things, I
confess but here's my "query parsers for dummies". :)

In terms of inputs/outputs, a QueryParser takes a string (generally assumed
to be "human generated" i.e. something a user might type in, so maybe a
sentence, a set of words, the format can vary) and outputs a Lucene Query
object (
http://lucene.apache.org/core/4_8_1/core/org/apache/lucene/search/Query.html),
which in fact is a kind of "tree" (again, I'm simplifying I know) since a
query can contain nested expressions.

So very loosely its a translator from a human-generated query into the
structure that Lucene can handle.  There are several different query
parsers since they all use different input syntax, and ways of handling
different constructs (to handle A and B, should the user type "+A +B" or "A
and B" or just "A B" for example), and have different levels of support for
the various Query structures that Lucene can handle: SpanQuery, FuzzyQuery,
PhraseQuery, etc.

We for example use an XML-based query parser.  Why (you might well ask!),
well we had an already used and supported query syntax of our own, which
our users understood, so we couldn't use an off the shelf query parser.  We
could have built our own in Java, but for a variety of reasons we parse our
queries in a front-end system ahead of Solr (which is C++-based), so we
needed an interim format to pass queries to Solr that was as near to a
Lucene Query object as we could get (and there was an existing XML parser
to save us starting from square one!).

As part of that Query construction (but independent of which QueryParser
you use), Solr will also make use of a set of Tokenizers and Filters (
https://cwiki.apache.org/confluence/display/solr/Understanding+Analyzers,+Tokenizers,+and+Filters)
but that's more to do with dealing with the terms in the query (so in my
examples above, is A a real word, does it need stemming, lowercasing,
removing because its a stopword, etc).

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message