lucene-solr-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yonik Seeley (JIRA)" <j...@apache.org>
Subject [jira] Commented: (SOLR-236) Field collapsing
Date Sat, 19 Dec 2009 22:28:20 GMT

    [ https://issues.apache.org/jira/browse/SOLR-236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12792917#action_12792917
] 

Yonik Seeley commented on SOLR-236:
-----------------------------------

First, thanks to everyone who has spent so much time working on this - lack of committer attention
doesn't equate to lack of interest... this is a very much needed feature!

I'd agree with Erik that the most important thing is the interface to the client, and making
it well thought out and semantically "tight".  Martijn's recent improvements to the response
structure is an example of improvements in this area.  It's also important to think about
the interface in terms of how easy it will be to add further features, optimizations, and
support distributed search.  If the code isn't sufficiently standalone, we also need to see
how easily it fits into the rest of Solr (what APIs it adds or modifies, etc).  Actually implementing
performance improvements and more distributed search can come later - as long as we've thought
about it now so we haven't boxed ourselves in.

It seems like field collapsing should just be additional functionality of the query component
rather than a separate component since it changes the results?

The most basic question about the interface would be how to present groups.  Do we stick with
a linear document list and supplement that with extra info in a different part of the response
(as the current approach takes)?  Or stick that extra info in with some of the documents somehow?
 Or if collapse=true, replace the list of documents with a list of groups, each which can
contain many documents?  Which will be easiest for clients to deal with?  If you were starting
from scratch and didn't have to deal with any of Solr's current shortcomings, what would it
look like?

>From the wiki:
collapse.maxdocs - what does this actually mean?  I assume it collects arbitrary documents
up to the max (normally by index order)?  Does this really make sense?  Does it affect faceting,
etc?  If it does make sense, it seems like it would also make sense for normal non-collapsed
query results too, in which case it should be implemented at that level.

collapse.info.doc - what does that do?  I understand counts per group, but what's count per
doc?

collapse.includeCollapsedDocs.fl - I don't understand this one, and can't find an example
on the wiki or blogs.  It says "Parameter indicating to return the collapsed documents in
the response"... but I thought documents were included up until collapse.threshold.

collapse.debug - should perhaps just be rolled into debugQuery, or another general debug param
(someone recently suggested using a comma separated list... debug=timings,query, etc.

Should I be able to specify a completely different sort *within* a group?  collapse.sort=...
 seems nice... what are the implications?  One bit of strangeness: it would seem to allow
a highly ranked document responsible for the group being at the top of the list being dropped
from the group due to a different sort criteria within the group.  It's not necessarily an
implementation problem though (sort values for the group should be maintained separately).

Is there a way to specify the number of groups that I want back instead of the number of documents?
 Or am I supposed to just over-request (rows=num_groups_I_want*threshold) and ignore if I
get too many documents back?

Random thought: We need a test to make sure this works with multi-select faceting (SimpleFacets
asks for the docset of be base query...)

Distributed Search: should be able to use the same type of algorithm that faceting does to
ensure accurate counts.

Performance: yes, it looks like the current code uses a *lot* of memory.
Here's an algorithm that I thought of on my last plane ride that can do much better (assuming
max() is the aggregation function):

{code}
=================== two pass collapsing algorithm for collapse.aggregate=max ====================
First pass: pretend that collapseCount=1
  - Use a TreeSet as  a priority queue since one can remove and insert entries.
  - A HashMap<Key,TreeSetEntry> will be used to map from collapse group to top entry
in the TreeSet
  - compare new doc with smallest element in treeset.  If smaller discard and go to the next
doc.
  - If new doc is bigger, look up it's group.  Use the Map to find if the group has been added
to the TreeSet and add it if not.
  - If the new bigger doc is already in the TreeSet, compare with the document in that group.
 If bigger, update the node,
    remove and re-add to the TreeSet to re-sort.

efficiency: the treeset and hashmap are both only the size of the top number of docs we are
looking at (10 for instance)
We will now have the top 10 documents collapsed by the right field with a collapseCount of
1.  Put another way, we have the top 10 groups.

Second pass (if collapseCount>1):
 - create a priority queue for each group (10) of size collapseCount
 - re-execute the query (or if the sort within the collapse groups does not involve score,
we could just use the docids gathered during phase 1)
 - for each document, find it's appropriate priority queue and insert
 - optimization: we can use the previous info from phase1 to even avoid creating a priority
queue if no other items matched.

So instead of creating collapse groups for every group in the set (as is done now?), we create
it for only 10 groups.
Instead of collecting the score for every document in the set (40MB per request for a 10M
doc index is *big*) we re-execute the query if needed.
We could optionally store the score as is done now... but I bet aggregate throughput on large
indexes would be better by just re-executing.

Other thought: we could also cache the first phase in the query cache which would allow one
to quickly move to the 2nd phase for any collapseCount.
{code}


> Field collapsing
> ----------------
>
>                 Key: SOLR-236
>                 URL: https://issues.apache.org/jira/browse/SOLR-236
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>    Affects Versions: 1.3
>            Reporter: Emmanuel Keller
>            Assignee: Shalin Shekhar Mangar
>             Fix For: 1.5
>
>         Attachments: collapsing-patch-to-1.3.0-dieter.patch, collapsing-patch-to-1.3.0-ivan.patch,
collapsing-patch-to-1.3.0-ivan_2.patch, collapsing-patch-to-1.3.0-ivan_3.patch, field-collapse-3.patch,
field-collapse-4-with-solrj.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch,
field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch,
field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch,
field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch,
field-collapse-solr-236-2.patch, field-collapse-solr-236.patch, field-collapsing-extended-592129.patch,
field_collapsing_1.1.0.patch, field_collapsing_1.3.patch, field_collapsing_dsteigerwald.diff,
field_collapsing_dsteigerwald.diff, field_collapsing_dsteigerwald.diff, quasidistributed.additional.patch,
SOLR-236-FieldCollapsing.patch, SOLR-236-FieldCollapsing.patch, SOLR-236-FieldCollapsing.patch,
SOLR-236.patch, SOLR-236.patch, SOLR-236.patch, solr-236.patch, SOLR-236_collapsing.patch,
SOLR-236_collapsing.patch
>
>
> This patch include a new feature called "Field collapsing".
> "Used in order to collapse a group of results with similar value for a given field to
a single entry in the result set. Site collapsing is a special case of this, where all results
for a given web site is collapsed into one or two entries in the result set, typically with
an associated "more documents from this site" link. See also Duplicate detection."
> http://www.fastsearch.com/glossary.aspx?m=48&amid=299
> The implementation add 3 new query parameters (SolrParams):
> "collapse.field" to choose the field used to group results
> "collapse.type" normal (default value) or adjacent
> "collapse.max" to select how many continuous results are allowed before collapsing
> TODO (in progress):
> - More documentation (on source code)
> - Test cases
> Two patches:
> - "field_collapsing.patch" for current development version
> - "field_collapsing_1.1.0.patch" for Solr-1.1.0
> P.S.: Feedback and misspelling correction are welcome ;-)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message