lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Stanislaw Osinski (JIRA)" <j...@apache.org>
Subject [jira] Commented: (SOLR-1804) Upgrade Carrot2 to 3.2.0
Date Fri, 20 Aug 2010 16:26:17 GMT

    [ https://issues.apache.org/jira/browse/SOLR-1804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12900755#action_12900755
] 

Stanislaw Osinski commented on SOLR-1804:
-----------------------------------------

Hi Robert,

Some initial work on tighter integration with Solr should be possible after applying the patch
from this issue. The patch contains a Solr-specific implementation of Carrot2's [ILanguageModel|http://download.carrot2.org/stable/javadoc/org/carrot2/text/linguistic/ILanguageModel.html]
interface. My rough guess is that the implementation of that interface could be further tweaked
to create  IStemmer and ITokenizer implementations based on the schema.xml settings. It could
also implement the isCommonWord() method based on Solr's resources. A few notes though:

* Carrot2 is slightly different from typical IR in a sense that it doesn't completely discard
stop words -- the tokenizer does not remove them from the token stream. The reason for this
is that the cluster labels are taken literally from the input text and if we discard stop
words, the labels won't as readable.

* The ILanguageModel#isStopLabel() method is another Carrot2-specific thing. It's a more fine-grained
method of removing useless labels, especially useful for domain-specific content. Carrot2's
default implementation is based on regular expressions similar to [this|https://carrot2.svn.sourceforge.net/svnroot/carrot2/trunk/core/carrot2-util-text/src-resources/stoplabels.en].
I'm not sure if there's a corresponding resource in Solr though.

We're thinking of restructuring Carrot2's language model a bit in one of the next releases,
so it's a good chance to include some Solr-inspired improvements as well.

S.

> Upgrade Carrot2 to 3.2.0
> ------------------------
>
>                 Key: SOLR-1804
>                 URL: https://issues.apache.org/jira/browse/SOLR-1804
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - Clustering
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>         Attachments: SOLR-1804-carrot2-3.4.0-dev-trunk.patch, SOLR-1804-carrot2-3.4.0-dev.patch,
SOLR-1804-carrot2-3.4.0-libs.zip, SOLR-1804.patch
>
>
> http://project.carrot2.org/release-3.2.0-notes.html
> Carrot2 is now LGPL free, which means we should be able to bundle the binary!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message