lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Muir <rcm...@gmail.com>
Subject Re: shingles work in analyzer but not real data
Date Wed, 01 Sep 2010 13:35:19 GMT
On Wed, Sep 1, 2010 at 8:21 AM, Jeff Rose <jeff@globalorange.nl> wrote:

> Hi,
>  We are using SOLR to match query strings with a keyword database, where
> some of the keywords are actually more than one word.  For example a
> keyword
> might be "apple pie" and we only want it to match for a query containing
> that word pair, but not one only containing "apple".  Here is the relevant
> piece of the schema.xml, defining the index and query pipelines:
>
>  <fieldType name="text" class="solr.TextField" positionIncrementGap="100">
>     <analyzer type="index">
>       <tokenizer class="solr.PatternTokenizerFactory" pattern=";"/>
>        <filter class="solr.LowerCaseFilterFactory"/>
>        <filter class="solr.TrimFilterFactory" />
>     </analyzer>
>     <analyzer type="query">
>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> <filter class="solr.LowerCaseFilterFactory"/>
>        <filter class="solr.TrimFilterFactory" />
> <filter class="solr.ShingleFilterFactory" />
>      </analyzer>
>   </fieldType>
>
> In the analysis tool this schema looks like it works correctly.  Our
> multi-word keywords are indexed as a single entry, and then when a search
> phrase contains one of these multi-word keywords it is shingled and
> matched.
>  Unfortunately, when we do the same queries on top of the actual index it
> responds with zero matches.  I can see in the index histogram that the
> terms
> are correctly indexed from our mysql datasource containing the keywords,
> but
> somehow the shingling doesn't appear to work on this live data.  Does
> anyone
> have experience with shingling that might have some tips for us, or
> otherwise advice for debugging the issue?
>

query-time shingling probably isnt working with the queryparser you are
using, the default lucene one first splits on whitespace before sending it
to the analyzer: e.g. a query of foo bar is processed as TokenStream(foo) +
TokenStream(bar)

so query-time shingling like this doesn't work as you expect for this
reason.


-- 
Robert Muir
rcmuir@gmail.com

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message