lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Stanislaw Osinski (Created) (JIRA)" <j...@apache.org>
Subject [jira] [Created] (SOLR-2917) Support for field-specific tokenizers, token- and character filters in search results clustering
Date Fri, 25 Nov 2011 08:36:39 GMT
Support for field-specific tokenizers, token- and character filters in search results clustering
------------------------------------------------------------------------------------------------

                 Key: SOLR-2917
                 URL: https://issues.apache.org/jira/browse/SOLR-2917
             Project: Solr
          Issue Type: Improvement
          Components: contrib - Clustering
            Reporter: Stanislaw Osinski
            Assignee: Stanislaw Osinski
             Fix For: 3.6


Currently, Carrot2 search results clustering component creates clusters based on the raw text
of a field. The reason for this is that Carrot2 aims to create meaningful cluster labels by
using sequences of words taken directly from the documents' text (including stop words: _Development
of Lucene and Solr_ is more readable than _Development Lucene Solr_). The easiest way of providing
input for such a process was feeding Carrot2 with raw (stored) document content.

It is, however, possible to take into account +some+ of the field's filters during clustering.
Because Carrot2 does not currently expose an API for feeding pre-tokenized input, the clustering
component would need to: 

1. get raw text of the field, 
2. run it through the field's char filters, tokenizers and selected token filters (omitting
e.g. stop words filter and stemmers, Carrot2 needs the original words to produce readable
cluster labels), 
3. glue the output back into a string and feed to Carrot2 for clustering. 

In the future, to eliminate step 3, we could modify Carrot2 to accept pre-tokenized content.



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message