lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Koji Sekiguchi (JIRA)" <>
Subject [jira] Resolved: (SOLR-1773) Field Collapsing (lightweight version)
Date Sat, 25 Sep 2010 07:49:33 GMT


Koji Sekiguchi resolved SOLR-1773.

    Resolution: Duplicate

closed as duplicate of SOLR-236, SOLR-1682.

> Field Collapsing (lightweight version)
> --------------------------------------
>                 Key: SOLR-1773
>                 URL:
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>    Affects Versions: 1.4
>            Reporter: Koji Sekiguchi
>            Priority: Minor
>         Attachments: LOADTEST.patch, SOLR-1773.patch
> I'd like to start another approach for field collapsing suggested by Yonik on 19/Dec/09
at SOLR-236. Re-posting the idea:
> {code}
> =================== two pass collapsing algorithm for collapse.aggregate=max ====================
> First pass: pretend that collapseCount=1
>   - Use a TreeSet as  a priority queue since one can remove and insert entries.
>   - A HashMap<Key,TreeSetEntry> will be used to map from collapse group to top
entry in the TreeSet
>   - compare new doc with smallest element in treeset.  If smaller discard and go to the
next doc.
>   - If new doc is bigger, look up it's group.  Use the Map to find if the group has been
added to the TreeSet and add it if not.
>   - If the new bigger doc is already in the TreeSet, compare with the document in that
group.  If bigger, update the node,
>     remove and re-add to the TreeSet to re-sort.
> efficiency: the treeset and hashmap are both only the size of the top number of docs
we are looking at (10 for instance)
> We will now have the top 10 documents collapsed by the right field with a collapseCount
of 1.  Put another way, we have the top 10 groups.
> Second pass (if collapseCount>1):
>  - create a priority queue for each group (10) of size collapseCount
>  - re-execute the query (or if the sort within the collapse groups does not involve score,
we could just use the docids gathered during phase 1)
>  - for each document, find it's appropriate priority queue and insert
>  - optimization: we can use the previous info from phase1 to even avoid creating a priority
queue if no other items matched.
> So instead of creating collapse groups for every group in the set (as is done now?),
we create it for only 10 groups.
> Instead of collecting the score for every document in the set (40MB per request for a
10M doc index is *big*) we re-execute the query if needed.
> We could optionally store the score as is done now... but I bet aggregate throughput
on large indexes would be better by just re-executing.
> Other thought: we could also cache the first phase in the query cache which would allow
one to quickly move to the 2nd phase for any collapseCount.
> {code}
> The restriction is:
> {quote}
> one would not be able to tell the total number of collapsed docs, or the total number
of hits (or the DocSet) after collapsing. So only collapse.facet=before would be supported.
> {quote}

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message