lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SOLR-2450) Carrot2 clustering should use both its own and Solr's stop words
Date Wed, 30 Mar 2011 20:29:05 GMT

    [ https://issues.apache.org/jira/browse/SOLR-2450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13013629#comment-13013629
] 

Robert Muir commented on SOLR-2450:
-----------------------------------

just to extend on hossman's point, there are a variety of ways someone could be setting up
stopwords:

* With StopWordFilterFactory
* by configuring their analyzer with <analyzer class=....> and the Analyzer actually
uses a stopword list internally (in this case, if its a supplied lucene analyzer you can check:
if (instanceof StopwordAnalyzerBase) ... and then invoke StopwordAnalyzerBase.getStopwordSet()
on the analyzer, but its true someone could make a custom one that uses stopwords, but extends
Analyzer directly).
* by using stopwords-like stuff such as CommonGramsFilter, that still have the concept of
stopwords but just work differently.
* by using a custom filter/analyzer of their own that acts like stopfilter.


> Carrot2 clustering should use both its own and Solr's stop words
> ----------------------------------------------------------------
>
>                 Key: SOLR-2450
>                 URL: https://issues.apache.org/jira/browse/SOLR-2450
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - Clustering
>            Reporter: Stanislaw Osinski
>            Assignee: Stanislaw Osinski
>            Priority: Minor
>             Fix For: 3.2, 4.0
>
>
> While using only Solr's stop words for clustering isn't a good idea (compared to indexing,
clustering needs more aggressive stop word removal to get reasonable cluster labels), it would
be good if Carrot2 used both its own and Solr's stop words.
> I'm not sure what the best way to implement this would be though. My first thought was
to simply load {{stopwords.txt}} from Solr config dir and merge them with Carrot2's. But then,
maybe a better approach would be to get the stop words from the StopFilter being used? Ideally,
we should also consider the per-field stop filters configured on the fields used for clustering.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message