lucene-solr-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Martijn van Groningen (JIRA)" <>
Subject [jira] Commented: (SOLR-236) Field collapsing
Date Sat, 12 Sep 2009 11:42:57 GMT


Martijn van Groningen commented on SOLR-236:

Hi Oleg, no I have not made any progress. I'm still not clear how to solve it in an efficient
manner as I have written in my previous comment:

I was trying to come up with a solution to implement distributed field collapsing, but I ran
into a problem that I could not solve in an efficient manner.

Field collapsing keeps track of the number of document collapsed per unique field value and
the total count documents encountered per unique field. If the total count is greater than
the specified collapse
threshold then the number of documents collapsed is the difference between the total count
and threshold. Lets say we have two shards each shard has one document with the same field
value. The collapse threshold is one, meaning that if we run the collapsing algorithm on the
shard individually both documents will never be collapsed. But when the algorithm applies
to both shards, one of the documents must be collapsed however neither shared knows that its
document is the one to collapse.

There are more situations described as above, but it all boils down to the fact that each
shard does not have meta information about the other shards in the cluster. Sharing the intermediate
collapse results between the shards is in my opinion not an option. This is because if you
do that then you also need to share information about documents / fields that have a collapse
count of zero. This is totally impractical for large indexes.

Besides that there is also another problem with distributed field collapsing. Field collapsing
only keeps the most relevant document in the result set and collapses the less relevant ones.
If scoring is used to sort then field collapsing will fail to do this properly, because of
the fact there is no global scoring (idf).

Does anyone have an idea on how to solve this? The first problem seems related to same kind
of problem implementing global score has.

I recently read something about Katta and . Katta facilitates distributed search and has for
support global scoring. I'm not completely sure how it is implemented in Katta, but maybe
with Katta it is relative efficient to share the intermediate collapse results between shards.

> Field collapsing
> ----------------
>                 Key: SOLR-236
>                 URL:
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>    Affects Versions: 1.3
>            Reporter: Emmanuel Keller
>             Fix For: 1.5
>         Attachments: collapsing-patch-to-1.3.0-dieter.patch, collapsing-patch-to-1.3.0-ivan.patch,
collapsing-patch-to-1.3.0-ivan_2.patch, collapsing-patch-to-1.3.0-ivan_3.patch, field-collapse-3.patch,
field-collapse-4-with-solrj.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch,
field-collapse-solr-236-2.patch, field-collapse-solr-236.patch, field-collapsing-extended-592129.patch,
field_collapsing_1.1.0.patch, field_collapsing_1.3.patch, field_collapsing_dsteigerwald.diff,
field_collapsing_dsteigerwald.diff, field_collapsing_dsteigerwald.diff, SOLR-236-FieldCollapsing.patch,
SOLR-236-FieldCollapsing.patch, SOLR-236-FieldCollapsing.patch, solr-236.patch, SOLR-236_collapsing.patch,
> This patch include a new feature called "Field collapsing".
> "Used in order to collapse a group of results with similar value for a given field to
a single entry in the result set. Site collapsing is a special case of this, where all results
for a given web site is collapsed into one or two entries in the result set, typically with
an associated "more documents from this site" link. See also Duplicate detection."
> The implementation add 3 new query parameters (SolrParams):
> "collapse.field" to choose the field used to group results
> "collapse.type" normal (default value) or adjacent
> "collapse.max" to select how many continuous results are allowed before collapsing
> TODO (in progress):
> - More documentation (on source code)
> - Test cases
> Two patches:
> - "field_collapsing.patch" for current development version
> - "field_collapsing_1.1.0.patch" for Solr-1.1.0
> P.S.: Feedback and misspelling correction are welcome ;-)

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message