lucene-solr-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Martijn van Groningen (JIRA)" <j...@apache.org>
Subject [jira] Commented: (SOLR-236) Field collapsing
Date Wed, 28 Oct 2009 22:18:00 GMT

    [ https://issues.apache.org/jira/browse/SOLR-236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12771155#action_12771155
] 

Martijn van Groningen commented on SOLR-236:
--------------------------------------------

It certainly has be going on for a long time :-)
Talking about the last miles there are a few things in my mind about field collapsing:
* Change the response format. Currently if I look at the response even I get confused sometimes
about the information returned. The response should more structured. Something like this:
{code:xml}
<lst name="collapse_counts">
    <str name="field">venue</str>
    <lst name="results">
        <lst name="233238"> <!-- id of most relevant document of the group -->
            <str name="fieldValue">melkweg</str>
            <int name="collapseCount">2</int>
            <!-- and other CollapseCollector specific collapse information -->
        </lst>
        ...
    </lst>
</lst>
{code}
Currently when doing adjacent field collapsing the _collapse_counts_ gives results that are
unusable to use. The _collapse_counts_ use the field value as key which is not unique for
adjacent collapsing as shown in the example: 
{code:xml}
<lst name="collapse_counts">
 <int name="hard">1</int>
 <int name="hard">1</int>
 <int name="electronics">1</int>
 <int name="memory">2</int>
 <int name="monitor">1</int>
</lst>
{code}
* Add the notion of a CollapseMatcher, that decides whether document field values are equal
or not and thus whether they are allowed to be collapsed. This opens the road for more exotic
features like fuzzy field collapsing and collapsing on more than one field. Also this allows
users of the patch to easily implement their own matching rules.
* Distributed field collapsing. Although I have some ideas on how to get started, from my
perspective it not going to be performed. Because somehow the field collapse state has to
be shared between shards in order to do proper field collapsing. This state can potentially
be a lot of data depending on the specific search and corpus.
* And maybe add a collapse collector that collects statistics about most common field value
per collapsed group. 

I think that this is somewhat the roadmap from my side for field collapsing at moment, but
feel free to elaborate on this.
Btw I have recently written a [blog|http://blog.jteam.nl/2009/10/20/result-grouping-field-collapsing-with-solr/]
about field collapsing in general, that might be handy for someone who is implementing field
collapsing. 

> Field collapsing
> ----------------
>
>                 Key: SOLR-236
>                 URL: https://issues.apache.org/jira/browse/SOLR-236
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>    Affects Versions: 1.3
>            Reporter: Emmanuel Keller
>             Fix For: 1.5
>
>         Attachments: collapsing-patch-to-1.3.0-dieter.patch, collapsing-patch-to-1.3.0-ivan.patch,
collapsing-patch-to-1.3.0-ivan_2.patch, collapsing-patch-to-1.3.0-ivan_3.patch, field-collapse-3.patch,
field-collapse-4-with-solrj.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch,
field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch,
field-collapse-5.patch, field-collapse-5.patch, field-collapse-solr-236-2.patch, field-collapse-solr-236.patch,
field-collapsing-extended-592129.patch, field_collapsing_1.1.0.patch, field_collapsing_1.3.patch,
field_collapsing_dsteigerwald.diff, field_collapsing_dsteigerwald.diff, field_collapsing_dsteigerwald.diff,
SOLR-236-FieldCollapsing.patch, SOLR-236-FieldCollapsing.patch, SOLR-236-FieldCollapsing.patch,
solr-236.patch, SOLR-236_collapsing.patch, SOLR-236_collapsing.patch
>
>
> This patch include a new feature called "Field collapsing".
> "Used in order to collapse a group of results with similar value for a given field to
a single entry in the result set. Site collapsing is a special case of this, where all results
for a given web site is collapsed into one or two entries in the result set, typically with
an associated "more documents from this site" link. See also Duplicate detection."
> http://www.fastsearch.com/glossary.aspx?m=48&amid=299
> The implementation add 3 new query parameters (SolrParams):
> "collapse.field" to choose the field used to group results
> "collapse.type" normal (default value) or adjacent
> "collapse.max" to select how many continuous results are allowed before collapsing
> TODO (in progress):
> - More documentation (on source code)
> - Test cases
> Two patches:
> - "field_collapsing.patch" for current development version
> - "field_collapsing_1.1.0.patch" for Solr-1.1.0
> P.S.: Feedback and misspelling correction are welcome ;-)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message