On Wed, Sep 1, 2010 at 8:21 AM, Jeff Rose <jeff@globalorange.nl> wrote:
> Hi,
> We are using SOLR to match query strings with a keyword database, where
> some of the keywords are actually more than one word. For example a
> keyword
> might be "apple pie" and we only want it to match for a query containing
> that word pair, but not one only containing "apple". Here is the relevant
> piece of the schema.xml, defining the index and query pipelines:
>
> <fieldType name="text" class="solr.TextField" positionIncrementGap="100">
> <analyzer type="index">
> <tokenizer class="solr.PatternTokenizerFactory" pattern=";"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.TrimFilterFactory" />
> </analyzer>
> <analyzer type="query">
> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.TrimFilterFactory" />
> <filter class="solr.ShingleFilterFactory" />
> </analyzer>
> </fieldType>
>
> In the analysis tool this schema looks like it works correctly. Our
> multi-word keywords are indexed as a single entry, and then when a search
> phrase contains one of these multi-word keywords it is shingled and
> matched.
> Unfortunately, when we do the same queries on top of the actual index it
> responds with zero matches. I can see in the index histogram that the
> terms
> are correctly indexed from our mysql datasource containing the keywords,
> but
> somehow the shingling doesn't appear to work on this live data. Does
> anyone
> have experience with shingling that might have some tips for us, or
> otherwise advice for debugging the issue?
>
query-time shingling probably isnt working with the queryparser you are
using, the default lucene one first splits on whitespace before sending it
to the analyzer: e.g. a query of foo bar is processed as TokenStream(foo) +
TokenStream(bar)
so query-time shingling like this doesn't work as you expect for this
reason.
--
Robert Muir
rcmuir@gmail.com
|