lucene-solr-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Koji Sekiguchi (JIRA)" <>
Subject [jira] Created: (SOLR-1773) Field Collapsing (lightweight version)
Date Sun, 14 Feb 2010 02:02:27 GMT
Field Collapsing (lightweight version)

                 Key: SOLR-1773
             Project: Solr
          Issue Type: New Feature
          Components: search
    Affects Versions: 1.4
            Reporter: Koji Sekiguchi
            Priority: Minor

I'd like to start another approach for field collapsing suggested by Yonik on 19/Dec/09 at
SOLR-236. Re-posting the idea:

=================== two pass collapsing algorithm for collapse.aggregate=max ====================
First pass: pretend that collapseCount=1
  - Use a TreeSet as  a priority queue since one can remove and insert entries.
  - A HashMap<Key,TreeSetEntry> will be used to map from collapse group to top entry
in the TreeSet
  - compare new doc with smallest element in treeset.  If smaller discard and go to the next
  - If new doc is bigger, look up it's group.  Use the Map to find if the group has been added
to the TreeSet and add it if not.
  - If the new bigger doc is already in the TreeSet, compare with the document in that group.
 If bigger, update the node,
    remove and re-add to the TreeSet to re-sort.

efficiency: the treeset and hashmap are both only the size of the top number of docs we are
looking at (10 for instance)
We will now have the top 10 documents collapsed by the right field with a collapseCount of
1.  Put another way, we have the top 10 groups.

Second pass (if collapseCount>1):
 - create a priority queue for each group (10) of size collapseCount
 - re-execute the query (or if the sort within the collapse groups does not involve score,
we could just use the docids gathered during phase 1)
 - for each document, find it's appropriate priority queue and insert
 - optimization: we can use the previous info from phase1 to even avoid creating a priority
queue if no other items matched.

So instead of creating collapse groups for every group in the set (as is done now?), we create
it for only 10 groups.
Instead of collecting the score for every document in the set (40MB per request for a 10M
doc index is *big*) we re-execute the query if needed.
We could optionally store the score as is done now... but I bet aggregate throughput on large
indexes would be better by just re-executing.

Other thought: we could also cache the first phase in the query cache which would allow one
to quickly move to the 2nd phase for any collapseCount.

The restriction is:

one would not be able to tell the total number of collapsed docs, or the total number of hits
(or the DocSet) after collapsing. So only collapse.facet=before would be supported.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message