lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From scott chu (朱炎詹) <scott....@udngroup.com>
Subject Re: shingles work in analyzer but not real data
Date Fri, 03 Sep 2010 09:55:14 GMT
Look up pp.288 in "Solr 1.4 Enterprise Search Engine" book by Eric & David.

Shingling is suitable for phrase query case based on token level, it's 
similar with n-gram. However, the latter one is based on term.

We are currently using shingling in our index with shingle size = 3. Be 
careful, the builing time of index & index dize could be dramatically long & 
large as the max shinlge size increases.

Scott

----- Original Message ----- 
From: "Jeff Rose" <jeff@globalorange.nl>
To: <solr-user@lucene.apache.org>
Sent: Friday, September 03, 2010 5:35 PM
Subject: Re: shingles work in analyzer but not real data


>I don't have any fancy links, but from the documentation shingles make
> pretty good sense.
>
> You typically tokenize an input string so that "the best apple pie" 
> becomes
> "the" "best" "apple" "pie", so that each term can then be filtered to 
> remove
> stop words, take off plurals and suffixes like "ing", etc.  The problem is
> if you want to search for multi-word phrases, like "apple pie".  This
> default splitting behavior won't let you do that, so to deal with this
> problem you can use shingles.  The shingle filter will take in successive
> tokens and then produce a series of output tokens composed of the last 1-n
> tokens, where n is a setting.  So with shingles of size 2, the default, 
> you
> get "the" "the best" "best" "best apple" "apple" "apple pie" from the 
> above
> string.  Now we can match "apple pie".
>
> Besides the shingling there is apparently also some concept of position,
> which I don't yet understand.
>
> -Jeff
>
> On Fri, Sep 3, 2010 at 11:05 AM, Dennis Gearon 
> <gearond@sbcglobal.net>wrote:
>
>> Anyone got a definitive, authoritative link to the definition of a
>> 'shingle' in search engine results/technology?
>>
>>
>> Dennis Gearon
>>
>> Signature Warning
>> ----------------
>> EARTH has a Right To Life,
>>  otherwise we all die.
>>
>> Read 'Hot, Flat, and Crowded'
>> Laugh at http://www.yert.com/film.php
>>
>>
>> --- On Fri, 9/3/10, Jeff Rose <jeff@globalorange.nl> wrote:
>>
>> > From: Jeff Rose <jeff@globalorange.nl>
>> > Subject: Re: shingles work in analyzer but not real data
>> > To: solr-user@lucene.apache.org
>> > Date: Friday, September 3, 2010, 1:48 AM
>> > Thanks Steven and Jonathan, we got it
>> > working by using a combination of
>> > quoting and the PositionFilterFactory, like is shown
>> > below.  The
>> > documentation for the position filter doesn't make much
>> > sense without
>> > understanding more about how positioning of tokens is taken
>> > into account,
>> > but it appears to do the trick.  Does anyone know why
>> > position would matter
>> > here?  It seems like tokens would be emitted by a
>> > tokenizer, filtered,
>> > joined into pairwise tokens by the shingler, and then
>> > matched against the
>> > index.  If position information is also important it
>> > seems odd that this is
>> > not discussed in the documentation..  (Same for the
>> > pre-tokenizing done by
>> > the query parser, before handing phrases to the
>> > tokenizer...)
>> >
>> > Anyway, here is our final schema that works as long as we
>> > put search phrases
>> > in double quotes.  Thanks for all the help!
>> >
>> > -Jeff
>> >
>> >  <fieldType name="text" class="solr.TextField"
>> > positionIncrementGap="100">
>> >       <analyzer type="index">
>> >         <tokenizer
>> > class="solr.PatternTokenizerFactory" pattern=";"/>
>> >         <filter
>> > class="solr.LowerCaseFilterFactory"/>
>> >         <filter
>> > class="solr.TrimFilterFactory" />
>> >         <filter
>> > class="solr.LowerCaseFilterFactory"/>
>> >         <!-- <filter
>> > class="solr.ShingleFilterFactory" outputUnigrams="true"
>> > outputUnigramIfNoNgram="true" maxShingleSize="2"/>
>> > -->
>> >       </analyzer>
>> >       <analyzer type="query">
>> >         <tokenizer
>> > class="solr.PatternTokenizerFactory" pattern="[.,?;:
>> > !]"/>
>> >  <filter class="solr.LowerCaseFilterFactory"/>
>> >          <filter
>> > class="solr.TrimFilterFactory" />
>> >  <filter class="solr.ShingleFilterFactory"/>
>> >  <filter class="solr.PositionFilterFactory"/>
>> >       </analyzer>
>> >     </fieldType>
>> >
>> >
>> > On Thu, Sep 2, 2010 at 11:47 PM, Jonathan Rochkind <rochkind@jhu.edu>
>> > wrote:
>> >
>> > > I've run into this before too. Both the dismax and
>> > solr-lucene _query
>> > > parsers_ will tokenize a query on whitespace _before_
>> > they pass the query to
>> > > any field analyzers.
>> > > There are some reasons for this, lots of things
>> > wouldn't work if they
>> > > didn't do this.
>> > >
>> > > But it makes your approach kind of hard. Try doing
>> > your search as a phrase
>> > > search with double quotes, "apple pie", I bet it'll
>> > work then -- because
>> > > both dismax and solr-lucene will respect the phrase
>> > quotes and NOT tokenize
>> > > the stuff inside there before it gets to the field
>> > analyzers.
>> > >
>> > > So if non-tokenized fields like this are all that are
>> > included in your
>> > > search, and if you can get your client application to
>> > just force phrase
>> > > quoting of everything before sending to Solr, that
>> > might work. Otherwise....
>> > > I don't know of a good solution. If you figure one
>> > out, let me know.
>> > >
>> > > Jonathan
>> > >
>> > >
>> > > Jeff Rose wrote:
>> > >
>> > >> Hi,
>> > >>  We are using SOLR to match query strings
>> > with a keyword database, where
>> > >> some of the keywords are actually more than one
>> > word.  For example a
>> > >> keyword
>> > >> might be "apple pie" and we only want it to match
>> > for a query containing
>> > >> that word pair, but not one only containing
>> > "apple".  Here is the relevant
>> > >> piece of the schema.xml, defining the index and
>> > query pipelines:
>> > >>
>> > >>  <fieldType name="text"
>> > class="solr.TextField" positionIncrementGap="100">
>> > >>     <analyzer
>> > type="index">
>> > >>       <tokenizer
>> > class="solr.PatternTokenizerFactory" pattern=";"/>
>> > >>        <filter
>> > class="solr.LowerCaseFilterFactory"/>
>> > >>        <filter
>> > class="solr.TrimFilterFactory" />
>> > >>     </analyzer>
>> > >>     <analyzer
>> > type="query">
>> > >>        <tokenizer
>> > class="solr.WhitespaceTokenizerFactory"/>
>> > >> <filter
>> > class="solr.LowerCaseFilterFactory"/>
>> > >>        <filter
>> > class="solr.TrimFilterFactory" />
>> > >> <filter class="solr.ShingleFilterFactory"
>> > />
>> > >>      </analyzer>
>> > >>   </fieldType>
>> > >>
>> > >> In the analysis tool this schema looks like it
>> > works correctly.  Our
>> > >> multi-word keywords are indexed as a single entry,
>> > and then when a search
>> > >> phrase contains one of these multi-word keywords
>> > it is shingled and
>> > >> matched.
>> > >>  Unfortunately, when we do the same queries
>> > on top of the actual index it
>> > >> responds with zero matches.  I can see in the
>> > index histogram that the
>> > >> terms
>> > >> are correctly indexed from our mysql datasource
>> > containing the keywords,
>> > >> but
>> > >> somehow the shingling doesn't appear to work on
>> > this live data.  Does
>> > >> anyone
>> > >> have experience with shingling that might have
>> > some tips for us, or
>> > >> otherwise advice for debugging the issue?
>> > >>
>> > >> Thanks,
>> > >> Jeff
>> > >>
>> > >>
>> > >>
>> > >
>> >
>>
>


--------------------------------------------------------------------------------



%<&b6G$J0T.'$$'d(l/f,r!C
Checked by AVG - www.avg.com
Version: 9.0.851 / Virus Database: 271.1.1/3111 - Release Date: 09/03/10 
14:34:00


Mime
View raw message