lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dennis Gearon <gear...@sbcglobal.net>
Subject Re: shingles work in analyzer but not real data
Date Sat, 04 Sep 2010 05:25:29 GMT
Thank you mucho much, Lance.


Dennis Gearon

Signature Warning
----------------
EARTH has a Right To Life,
  otherwise we all die.

Read 'Hot, Flat, and Crowded'
Laugh at http://www.yert.com/film.php


--- On Fri, 9/3/10, Lance Norskog <goksron@gmail.com> wrote:

> From: Lance Norskog <goksron@gmail.com>
> Subject: Re: shingles work in analyzer but not real data
> To: solr-user@lucene.apache.org
> Date: Friday, September 3, 2010, 9:55 PM
> http://en.wikipedia.org/wiki/W-shingling
> 
> On Fri, Sep 3, 2010 at 6:19 AM, Steven A Rowe <sarowe@syr.edu>
> wrote:
> > Hi Dennis,
> >
> > I took a stab at answering this question in the
> following java-user mailing list post:
> >
> > http://www.lucidimagination.com/search/document/6cb7b54cce6872b3/lucene_indexes
> >
> > Steve
> >
> >> -----Original Message-----
> >> From: Dennis Gearon [mailto:gearond@sbcglobal.net]
> >> Sent: Friday, September 03, 2010 5:06 AM
> >> To: solr-user@lucene.apache.org
> >> Subject: Re: shingles work in analyzer but not
> real data
> >>
> >> Anyone got a definitive, authoritative link to the
> definition of a
> >> 'shingle' in search engine results/technology?
> >>
> >>
> >> Dennis Gearon
> >>
> >> Signature Warning
> >> ----------------
> >> EARTH has a Right To Life,
> >>   otherwise we all die.
> >>
> >> Read 'Hot, Flat, and Crowded'
> >> Laugh at http://www.yert.com/film.php
> >>
> >>
> >> --- On Fri, 9/3/10, Jeff Rose <jeff@globalorange.nl>
> wrote:
> >>
> >> > From: Jeff Rose <jeff@globalorange.nl>
> >> > Subject: Re: shingles work in analyzer but
> not real data
> >> > To: solr-user@lucene.apache.org
> >> > Date: Friday, September 3, 2010, 1:48 AM
> >> > Thanks Steven and Jonathan, we got it
> >> > working by using a combination of
> >> > quoting and the PositionFilterFactory, like
> is shown
> >> > below.  The
> >> > documentation for the position filter doesn't
> make much
> >> > sense without
> >> > understanding more about how positioning of
> tokens is taken
> >> > into account,
> >> > but it appears to do the trick.  Does anyone
> know why
> >> > position would matter
> >> > here?  It seems like tokens would be emitted
> by a
> >> > tokenizer, filtered,
> >> > joined into pairwise tokens by the shingler,
> and then
> >> > matched against the
> >> > index.  If position information is also
> important it
> >> > seems odd that this is
> >> > not discussed in the documentation..  (Same
> for the
> >> > pre-tokenizing done by
> >> > the query parser, before handing phrases to
> the
> >> > tokenizer...)
> >> >
> >> > Anyway, here is our final schema that works
> as long as we
> >> > put search phrases
> >> > in double quotes.  Thanks for all the help!
> >> >
> >> > -Jeff
> >> >
> >> >  <fieldType name="text"
> class="solr.TextField"
> >> > positionIncrementGap="100">
> >> >       <analyzer type="index">
> >> >         <tokenizer
> >> > class="solr.PatternTokenizerFactory"
> pattern=";"/>
> >> >         <filter
> >> > class="solr.LowerCaseFilterFactory"/>
> >> >         <filter
> >> > class="solr.TrimFilterFactory" />
> >> >         <filter
> >> > class="solr.LowerCaseFilterFactory"/>
> >> >         <!-- <filter
> >> > class="solr.ShingleFilterFactory"
> outputUnigrams="true"
> >> > outputUnigramIfNoNgram="true"
> maxShingleSize="2"/>
> >> > -->
> >> >       </analyzer>
> >> >       <analyzer type="query">
> >> >         <tokenizer
> >> > class="solr.PatternTokenizerFactory"
> pattern="[.,?;:
> >> > !]"/>
> >> >  <filter
> class="solr.LowerCaseFilterFactory"/>
> >> >          <filter
> >> > class="solr.TrimFilterFactory" />
> >> >  <filter
> class="solr.ShingleFilterFactory"/>
> >> >  <filter
> class="solr.PositionFilterFactory"/>
> >> >       </analyzer>
> >> >     </fieldType>
> >> >
> >> >
> >> > On Thu, Sep 2, 2010 at 11:47 PM, Jonathan
> Rochkind <rochkind@jhu.edu>
> >> > wrote:
> >> >
> >> > > I've run into this before too. Both the
> dismax and
> >> > solr-lucene _query
> >> > > parsers_ will tokenize a query on
> whitespace _before_
> >> > they pass the query to
> >> > > any field analyzers.
> >> > > There are some reasons for this, lots of
> things
> >> > wouldn't work if they
> >> > > didn't do this.
> >> > >
> >> > > But it makes your approach kind of hard.
> Try doing
> >> > your search as a phrase
> >> > > search with double quotes, "apple pie",
> I bet it'll
> >> > work then -- because
> >> > > both dismax and solr-lucene will respect
> the phrase
> >> > quotes and NOT tokenize
> >> > > the stuff inside there before it gets to
> the field
> >> > analyzers.
> >> > >
> >> > > So if non-tokenized fields like this are
> all that are
> >> > included in your
> >> > > search, and if you can get your client
> application to
> >> > just force phrase
> >> > > quoting of everything before sending to
> Solr, that
> >> > might work. Otherwise....
> >> > > I don't know of a good solution. If you
> figure one
> >> > out, let me know.
> >> > >
> >> > > Jonathan
> >> > >
> >> > >
> >> > > Jeff Rose wrote:
> >> > >
> >> > >> Hi,
> >> > >>  We are using SOLR to match query
> strings
> >> > with a keyword database, where
> >> > >> some of the keywords are actually
> more than one
> >> > word.  For example a
> >> > >> keyword
> >> > >> might be "apple pie" and we only
> want it to match
> >> > for a query containing
> >> > >> that word pair, but not one only
> containing
> >> > "apple".  Here is the relevant
> >> > >> piece of the schema.xml, defining
> the index and
> >> > query pipelines:
> >> > >>
> >> > >>  <fieldType name="text"
> >> > class="solr.TextField"
> positionIncrementGap="100">
> >> > >>     <analyzer
> >> > type="index">
> >> > >>       <tokenizer
> >> > class="solr.PatternTokenizerFactory"
> pattern=";"/>
> >> > >>        <filter
> >> > class="solr.LowerCaseFilterFactory"/>
> >> > >>        <filter
> >> > class="solr.TrimFilterFactory" />
> >> > >>     </analyzer>
> >> > >>     <analyzer
> >> > type="query">
> >> > >>        <tokenizer
> >> > class="solr.WhitespaceTokenizerFactory"/>
> >> > >> <filter
> >> > class="solr.LowerCaseFilterFactory"/>
> >> > >>        <filter
> >> > class="solr.TrimFilterFactory" />
> >> > >> <filter
> class="solr.ShingleFilterFactory"
> >> > />
> >> > >>      </analyzer>
> >> > >>   </fieldType>
> >> > >>
> >> > >> In the analysis tool this schema
> looks like it
> >> > works correctly.  Our
> >> > >> multi-word keywords are indexed as a
> single entry,
> >> > and then when a search
> >> > >> phrase contains one of these
> multi-word keywords
> >> > it is shingled and
> >> > >> matched.
> >> > >>  Unfortunately, when we do the same
> queries
> >> > on top of the actual index it
> >> > >> responds with zero matches.  I can
> see in the
> >> > index histogram that the
> >> > >> terms
> >> > >> are correctly indexed from our mysql
> datasource
> >> > containing the keywords,
> >> > >> but
> >> > >> somehow the shingling doesn't appear
> to work on
> >> > this live data.  Does
> >> > >> anyone
> >> > >> have experience with shingling that
> might have
> >> > some tips for us, or
> >> > >> otherwise advice for debugging the
> issue?
> >> > >>
> >> > >> Thanks,
> >> > >> Jeff
> >> > >>
> >> > >>
> >> > >>
> >> > >
> >> >
> >
> 
> 
> 
> -- 
> Lance Norskog
> goksron@gmail.com
> 

Mime
View raw message