lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeff Rose <>
Subject Re: shingles work in analyzer but not real data
Date Fri, 03 Sep 2010 08:48:01 GMT
Thanks Steven and Jonathan, we got it working by using a combination of
quoting and the PositionFilterFactory, like is shown below.  The
documentation for the position filter doesn't make much sense without
understanding more about how positioning of tokens is taken into account,
but it appears to do the trick.  Does anyone know why position would matter
here?  It seems like tokens would be emitted by a tokenizer, filtered,
joined into pairwise tokens by the shingler, and then matched against the
index.  If position information is also important it seems odd that this is
not discussed in the documentation..  (Same for the pre-tokenizing done by
the query parser, before handing phrases to the tokenizer...)

Anyway, here is our final schema that works as long as we put search phrases
in double quotes.  Thanks for all the help!


 <fieldType name="text" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.PatternTokenizerFactory" pattern=";"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.TrimFilterFactory" />
        <filter class="solr.LowerCaseFilterFactory"/>
        <!-- <filter class="solr.ShingleFilterFactory" outputUnigrams="true"
outputUnigramIfNoNgram="true" maxShingleSize="2"/> -->
      <analyzer type="query">
        <tokenizer class="solr.PatternTokenizerFactory" pattern="[.,?;:
 <filter class="solr.LowerCaseFilterFactory"/>
         <filter class="solr.TrimFilterFactory" />
 <filter class="solr.ShingleFilterFactory"/>
 <filter class="solr.PositionFilterFactory"/>

On Thu, Sep 2, 2010 at 11:47 PM, Jonathan Rochkind <> wrote:

> I've run into this before too. Both the dismax and solr-lucene _query
> parsers_ will tokenize a query on whitespace _before_ they pass the query to
> any field analyzers.
> There are some reasons for this, lots of things wouldn't work if they
> didn't do this.
> But it makes your approach kind of hard. Try doing your search as a phrase
> search with double quotes, "apple pie", I bet it'll work then -- because
> both dismax and solr-lucene will respect the phrase quotes and NOT tokenize
> the stuff inside there before it gets to the field analyzers.
> So if non-tokenized fields like this are all that are included in your
> search, and if you can get your client application to just force phrase
> quoting of everything before sending to Solr, that might work. Otherwise....
> I don't know of a good solution. If you figure one out, let me know.
> Jonathan
> Jeff Rose wrote:
>> Hi,
>>  We are using SOLR to match query strings with a keyword database, where
>> some of the keywords are actually more than one word.  For example a
>> keyword
>> might be "apple pie" and we only want it to match for a query containing
>> that word pair, but not one only containing "apple".  Here is the relevant
>> piece of the schema.xml, defining the index and query pipelines:
>>  <fieldType name="text" class="solr.TextField" positionIncrementGap="100">
>>     <analyzer type="index">
>>       <tokenizer class="solr.PatternTokenizerFactory" pattern=";"/>
>>        <filter class="solr.LowerCaseFilterFactory"/>
>>        <filter class="solr.TrimFilterFactory" />
>>     </analyzer>
>>     <analyzer type="query">
>>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>> <filter class="solr.LowerCaseFilterFactory"/>
>>        <filter class="solr.TrimFilterFactory" />
>> <filter class="solr.ShingleFilterFactory" />
>>      </analyzer>
>>   </fieldType>
>> In the analysis tool this schema looks like it works correctly.  Our
>> multi-word keywords are indexed as a single entry, and then when a search
>> phrase contains one of these multi-word keywords it is shingled and
>> matched.
>>  Unfortunately, when we do the same queries on top of the actual index it
>> responds with zero matches.  I can see in the index histogram that the
>> terms
>> are correctly indexed from our mysql datasource containing the keywords,
>> but
>> somehow the shingling doesn't appear to work on this live data.  Does
>> anyone
>> have experience with shingling that might have some tips for us, or
>> otherwise advice for debugging the issue?
>> Thanks,
>> Jeff

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message