lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steven A Rowe <sar...@syr.edu>
Subject RE: shingles work in analyzer but not real data
Date Thu, 02 Sep 2010 19:21:00 GMT
Hi Jeff,

Have you seen PositionFilterFactory?:
 <http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.PositionFilterFactory>

Steve

> -----Original Message-----
> From: Jeff Rose [mailto:jeff@globalorange.nl]
> Sent: Thursday, September 02, 2010 9:06 AM
> To: solr-user@lucene.apache.org
> Subject: Re: shingles work in analyzer but not real data
> 
> On Wed, Sep 1, 2010 at 3:35 PM, Robert Muir <rcmuir@gmail.com> wrote:
> 
> > On Wed, Sep 1, 2010 at 8:21 AM, Jeff Rose <jeff@globalorange.nl> wrote:
> >
> > > Hi,
> > >  We are using SOLR to match query strings with a keyword database,
> where
> > > some of the keywords are actually more than one word.  For example a
> > > keyword
> > > might be "apple pie" and we only want it to match for a query
> containing
> > > that word pair, but not one only containing "apple".  Here is the
> > relevant
> > > piece of the schema.xml, defining the index and query pipelines:
> > >
> > >  <fieldType name="text" class="solr.TextField"
> > positionIncrementGap="100">
> > >     <analyzer type="index">
> > >       <tokenizer class="solr.PatternTokenizerFactory" pattern=";"/>
> > >        <filter class="solr.LowerCaseFilterFactory"/>
> > >        <filter class="solr.TrimFilterFactory" />
> > >     </analyzer>
> > >     <analyzer type="query">
> > >        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> > > <filter class="solr.LowerCaseFilterFactory"/>
> > >        <filter class="solr.TrimFilterFactory" />
> > > <filter class="solr.ShingleFilterFactory" />
> > >      </analyzer>
> > >   </fieldType>
> > >
> > > In the analysis tool this schema looks like it works correctly.  Our
> > > multi-word keywords are indexed as a single entry, and then when a
> search
> > > phrase contains one of these multi-word keywords it is shingled and
> > > matched.
> > >  Unfortunately, when we do the same queries on top of the actual index
> it
> > > responds with zero matches.  I can see in the index histogram that the
> > > terms
> > > are correctly indexed from our mysql datasource containing the
> keywords,
> > > but
> > > somehow the shingling doesn't appear to work on this live data.  Does
> > > anyone
> > > have experience with shingling that might have some tips for us, or
> > > otherwise advice for debugging the issue?
> > >
> >
> > query-time shingling probably isnt working with the queryparser you are
> > using, the default lucene one first splits on whitespace before sending
> it
> > to the analyzer: e.g. a query of foo bar is processed as
> TokenStream(foo) +
> > TokenStream(bar)
> >
> > so query-time shingling like this doesn't work as you expect for this
> > reason.
> 
> 
> Hi Robert, thanks for the response.  I've looked into the query parsers a
> bit and I did find that using the raw parser on a matching multi-word
> keyword works correctly.  I need to have shingling though, in order to
> support query phrases.  It seems odd to have the query parser emitting
> tokens though.  If this is the case why would we ever use the
> WhitespaceTokenizer?  Either way, do you know what the correct
> configuration
> should be to actually perform shingling as it is documented to work:
> joining
> adjacent tokens into a single search term?  (e.g. "apple" "pie" should
> become "apple pie")
> 
> Thanks  a lot for the help.
> 
> -Jeff
> 
> P.S. Markus, putting double quotes around the query doesn't seem to have
> any
> effect.  It would be nice to have the analysis debug output on the actual
> queries so that I could see what is being searched for after analysis...
Mime
View raw message