lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeff Rose <j...@globalorange.nl>
Subject Re: shingles work in analyzer but not real data
Date Fri, 03 Sep 2010 09:35:17 GMT
I don't have any fancy links, but from the documentation shingles make
pretty good sense.

You typically tokenize an input string so that "the best apple pie" becomes
"the" "best" "apple" "pie", so that each term can then be filtered to remove
stop words, take off plurals and suffixes like "ing", etc.  The problem is
if you want to search for multi-word phrases, like "apple pie".  This
default splitting behavior won't let you do that, so to deal with this
problem you can use shingles.  The shingle filter will take in successive
tokens and then produce a series of output tokens composed of the last 1-n
tokens, where n is a setting.  So with shingles of size 2, the default, you
get "the" "the best" "best" "best apple" "apple" "apple pie" from the above
string.  Now we can match "apple pie".

Besides the shingling there is apparently also some concept of position,
which I don't yet understand.

-Jeff

On Fri, Sep 3, 2010 at 11:05 AM, Dennis Gearon <gearond@sbcglobal.net>wrote:

> Anyone got a definitive, authoritative link to the definition of a
> 'shingle' in search engine results/technology?
>
>
> Dennis Gearon
>
> Signature Warning
> ----------------
> EARTH has a Right To Life,
>  otherwise we all die.
>
> Read 'Hot, Flat, and Crowded'
> Laugh at http://www.yert.com/film.php
>
>
> --- On Fri, 9/3/10, Jeff Rose <jeff@globalorange.nl> wrote:
>
> > From: Jeff Rose <jeff@globalorange.nl>
> > Subject: Re: shingles work in analyzer but not real data
> > To: solr-user@lucene.apache.org
> > Date: Friday, September 3, 2010, 1:48 AM
> > Thanks Steven and Jonathan, we got it
> > working by using a combination of
> > quoting and the PositionFilterFactory, like is shown
> > below.  The
> > documentation for the position filter doesn't make much
> > sense without
> > understanding more about how positioning of tokens is taken
> > into account,
> > but it appears to do the trick.  Does anyone know why
> > position would matter
> > here?  It seems like tokens would be emitted by a
> > tokenizer, filtered,
> > joined into pairwise tokens by the shingler, and then
> > matched against the
> > index.  If position information is also important it
> > seems odd that this is
> > not discussed in the documentation..  (Same for the
> > pre-tokenizing done by
> > the query parser, before handing phrases to the
> > tokenizer...)
> >
> > Anyway, here is our final schema that works as long as we
> > put search phrases
> > in double quotes.  Thanks for all the help!
> >
> > -Jeff
> >
> >  <fieldType name="text" class="solr.TextField"
> > positionIncrementGap="100">
> >       <analyzer type="index">
> >         <tokenizer
> > class="solr.PatternTokenizerFactory" pattern=";"/>
> >         <filter
> > class="solr.LowerCaseFilterFactory"/>
> >         <filter
> > class="solr.TrimFilterFactory" />
> >         <filter
> > class="solr.LowerCaseFilterFactory"/>
> >         <!-- <filter
> > class="solr.ShingleFilterFactory" outputUnigrams="true"
> > outputUnigramIfNoNgram="true" maxShingleSize="2"/>
> > -->
> >       </analyzer>
> >       <analyzer type="query">
> >         <tokenizer
> > class="solr.PatternTokenizerFactory" pattern="[.,?;:
> > !]"/>
> >  <filter class="solr.LowerCaseFilterFactory"/>
> >          <filter
> > class="solr.TrimFilterFactory" />
> >  <filter class="solr.ShingleFilterFactory"/>
> >  <filter class="solr.PositionFilterFactory"/>
> >       </analyzer>
> >     </fieldType>
> >
> >
> > On Thu, Sep 2, 2010 at 11:47 PM, Jonathan Rochkind <rochkind@jhu.edu>
> > wrote:
> >
> > > I've run into this before too. Both the dismax and
> > solr-lucene _query
> > > parsers_ will tokenize a query on whitespace _before_
> > they pass the query to
> > > any field analyzers.
> > > There are some reasons for this, lots of things
> > wouldn't work if they
> > > didn't do this.
> > >
> > > But it makes your approach kind of hard. Try doing
> > your search as a phrase
> > > search with double quotes, "apple pie", I bet it'll
> > work then -- because
> > > both dismax and solr-lucene will respect the phrase
> > quotes and NOT tokenize
> > > the stuff inside there before it gets to the field
> > analyzers.
> > >
> > > So if non-tokenized fields like this are all that are
> > included in your
> > > search, and if you can get your client application to
> > just force phrase
> > > quoting of everything before sending to Solr, that
> > might work. Otherwise....
> > > I don't know of a good solution. If you figure one
> > out, let me know.
> > >
> > > Jonathan
> > >
> > >
> > > Jeff Rose wrote:
> > >
> > >> Hi,
> > >>  We are using SOLR to match query strings
> > with a keyword database, where
> > >> some of the keywords are actually more than one
> > word.  For example a
> > >> keyword
> > >> might be "apple pie" and we only want it to match
> > for a query containing
> > >> that word pair, but not one only containing
> > "apple".  Here is the relevant
> > >> piece of the schema.xml, defining the index and
> > query pipelines:
> > >>
> > >>  <fieldType name="text"
> > class="solr.TextField" positionIncrementGap="100">
> > >>     <analyzer
> > type="index">
> > >>       <tokenizer
> > class="solr.PatternTokenizerFactory" pattern=";"/>
> > >>        <filter
> > class="solr.LowerCaseFilterFactory"/>
> > >>        <filter
> > class="solr.TrimFilterFactory" />
> > >>     </analyzer>
> > >>     <analyzer
> > type="query">
> > >>        <tokenizer
> > class="solr.WhitespaceTokenizerFactory"/>
> > >> <filter
> > class="solr.LowerCaseFilterFactory"/>
> > >>        <filter
> > class="solr.TrimFilterFactory" />
> > >> <filter class="solr.ShingleFilterFactory"
> > />
> > >>      </analyzer>
> > >>   </fieldType>
> > >>
> > >> In the analysis tool this schema looks like it
> > works correctly.  Our
> > >> multi-word keywords are indexed as a single entry,
> > and then when a search
> > >> phrase contains one of these multi-word keywords
> > it is shingled and
> > >> matched.
> > >>  Unfortunately, when we do the same queries
> > on top of the actual index it
> > >> responds with zero matches.  I can see in the
> > index histogram that the
> > >> terms
> > >> are correctly indexed from our mysql datasource
> > containing the keywords,
> > >> but
> > >> somehow the shingling doesn't appear to work on
> > this live data.  Does
> > >> anyone
> > >> have experience with shingling that might have
> > some tips for us, or
> > >> otherwise advice for debugging the issue?
> > >>
> > >> Thanks,
> > >> Jeff
> > >>
> > >>
> > >>
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message