lucene-solr-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yonik Seeley (JIRA)" <j...@apache.org>
Subject [jira] Commented: (SOLR-769) Support Document and Search Result clustering
Date Sat, 27 Jun 2009 12:20:47 GMT

    [ https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12724847#action_12724847
] 

Yonik Seeley commented on SOLR-769:
-----------------------------------

The response structure is a bit funny (it's like normal XML, which we don't really use in
Solr-land), and certainly not optimal for JSON responses:

{code}
 "clusters":[
  "cluster",[
	"labels",[
	 "label","DDR"],
	"docs",[
	 "doc","TWINX2048-3200PRO",
	 "doc","VS1GB400C3",
	 "doc","VDBDB1A16"]],
  "cluster",[
	"labels",[
	 "label","Car Power Adapter"],
	"docs",[
	 "doc","F8V7067-APL-KIT",
	 "doc","IW-02"]],
[...]
{code}

Is "labels"  is needed because there could be multiple labels per cluster in the future? 
( I assume yes)
Do we need more per-doc information than just the id?  (I assume no)
Could we want other per-cluster information in the future (I assume yes)
What other possible information could be added in the future?

Given the assumptions above, "clusters", "docs", and "labels" should all be arrays instead
of NamedLists (the names are just repeated redundant info).
All of the remaining NamedLists(just each "cluster") should be a SimpleOrderedMap since access
by key is more important than order... that will give us something along the lines of:

{code}
"clusters" : [
    { "labels" : ["DDR"],
	"docs":["TWINX2048-3200PRO","VS1GB400C3","VDBDB1A16"]
    }
    ,
    { "labels" : ["Car Power Adapter"],
	"docs":["F8V7067-APL-KIT","IW-02"]
    }
]
{code}

Make sense?

> Support Document and Search Result clustering
> ---------------------------------------------
>
>                 Key: SOLR-769
>                 URL: https://issues.apache.org/jira/browse/SOLR-769
>             Project: Solr
>          Issue Type: New Feature
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>            Priority: Minor
>             Fix For: 1.4
>
>         Attachments: clustering-componet-shard.patch, clustering-libs.tar, clustering-libs.tar,
SOLR-769-analyzerClass.patch, SOLR-769-lib.zip, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch,
SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch,
SOLR-769.patch, SOLR-769.patch, SOLR-769.tar, SOLR-769.zip
>
>
> Clustering is a useful tool for working with documents and search results, similar to
the notion of dynamic faceting.  Carrot2 (http://project.carrot2.org/) is a nice, BSD-licensed,
library for doing search results clustering.  Mahout (http://lucene.apache.org/mahout) is
well suited for whole-corpus clustering.  
> The patch I lays out a contrib module that starts off w/ an integration of a SearchComponent
for doing clustering and an implementation using Carrot.  In search results mode, it will
use the DocList as the input for the cluster.   While Carrot2 comes w/ a Solr input component,
it is not the same as the SearchComponent that I have in that the Carrot example actually
submits a query to Solr, whereas my SearchComponent is just chained into the Component list
and uses the ResponseBuilder to add in the cluster results.
> While not fully fleshed out yet, the collection based mode will take in a list of ids
or just use the whole collection and will produce clusters.  Since this is a longer, typically
offline task, there will need to be some type of storage mechanism (and replication??????)
for the clusters.  I _may_ push this off to a separate JIRA issue, but I at least want to
present the use case as part of the design of this component/contrib.  It may even make sense
that we split this out, such that the building piece is something like an UpdateProcessor
and then the SearchComponent just acts as a lookup mechanism.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message