lucene-solr-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Rutherglen (JIRA)" <j...@apache.org>
Subject [jira] Commented: (SOLR-908) Port of Nutch CommonGrams filter to Solr
Date Fri, 18 Sep 2009 03:13:57 GMT

    [ https://issues.apache.org/jira/browse/SOLR-908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12756933#action_12756933
] 

Jason Rutherglen commented on SOLR-908:
---------------------------------------

This schema consistently and randomly generates query
truncations. Perhaps because we're mixing the new and old
tokenizing APIs? I can't figure out what state is being shared
nor how to debug this. We unfortunately upgraded to Solr 1.4
trunk and so cannot revert back to 1.3. I wrote a test case that
has not reproduced the bug locally. The bug happens in a
distributed environment with 20+ servers. 

{code}
<fieldType name="vCommonGrams" class="solr.TextField" positionIncrementGap="100">
  <analyzer type="index">
  <tokenizer class="solr.HTMLStripStandardTokenizerFactory"/>  
  <filter class="solr.StandardFilterFactory"/>
  <filter class="solr.CommonGramsFilterFactory" ignoreCase="true" words="stopwords.txt"/>
  <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"/>
  </analyzer>
  <analyzer type="query">
  <tokenizer class="solr.StandardTokenizerFactory"/>
  <filter class="solr.StandardFilterFactory"/>
  <filter class="solr.CommonGramsQueryFilterFactory" ignoreCase="true" words="stopwords.txt"/>
  <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"/>
  </analyzer>
</fieldType>
{code}

> Port of Nutch  CommonGrams filter to Solr
> -----------------------------------------
>
>                 Key: SOLR-908
>                 URL: https://issues.apache.org/jira/browse/SOLR-908
>             Project: Solr
>          Issue Type: Wish
>          Components: Analysis
>            Reporter: Tom Burton-West
>            Priority: Minor
>         Attachments: CommonGramsPort.zip, SOLR-908.patch, SOLR-908.patch, SOLR-908.patch,
SOLR-908.patch, SOLR-908.patch, SOLR-908.patch, SOLR-908.patch
>
>
> Phrase queries containing common words are extremely slow.  We are reluctant to just
use stop words due to various problems with false hits and some things becoming impossible
to search with stop words turned on. (For example "to be or not to be", "the who", "man in
the moon" vs "man on the moon" etc.)  
> Several postings regarding slow phrase queries have suggested using the approach used
by Nutch.  Perhaps someone with more Java/Solr experience might take this on.
> It should be possible to port the Nutch CommonGrams code to Solr  and create a suitable
Solr FilterFactory so that it could be used in Solr by listing it in the Solr schema.xml.
> "Construct n-grams for frequently occuring terms and phrases while indexing. Optimize
phrase queries to use the n-grams. Single terms are still indexed too, with n-grams overlaid."
> http://lucene.apache.org/nutch/apidocs-0.8.x/org/apache/nutch/analysis/CommonGrams.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message