lucene-solr-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Koji Sekiguchi (JIRA)" <>
Subject [jira] Issue Comment Edited: (SOLR-1773) Field Collapsing (lightweight version)
Date Sun, 14 Feb 2010 08:20:27 GMT


Koji Sekiguchi edited comment on SOLR-1773 at 2/14/10 8:19 AM:

Oops, I've glanced at SOLR-236 related issues, but I thought it was for finalize response
format from Description. I'll look into SOLR-1682. Thanks! :)

      was (Author: koji):
    Oops, I've glanced at SOLR-236 related issues, but I wasn't awake to the existence. I'll
look into SOLR-1682. Thanks! :)
> Field Collapsing (lightweight version)
> --------------------------------------
>                 Key: SOLR-1773
>                 URL:
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>    Affects Versions: 1.4
>            Reporter: Koji Sekiguchi
>            Priority: Minor
>         Attachments: LOADTEST.patch, SOLR-1773.patch
> I'd like to start another approach for field collapsing suggested by Yonik on 19/Dec/09
at SOLR-236. Re-posting the idea:
> {code}
> =================== two pass collapsing algorithm for collapse.aggregate=max ====================
> First pass: pretend that collapseCount=1
>   - Use a TreeSet as  a priority queue since one can remove and insert entries.
>   - A HashMap<Key,TreeSetEntry> will be used to map from collapse group to top
entry in the TreeSet
>   - compare new doc with smallest element in treeset.  If smaller discard and go to the
next doc.
>   - If new doc is bigger, look up it's group.  Use the Map to find if the group has been
added to the TreeSet and add it if not.
>   - If the new bigger doc is already in the TreeSet, compare with the document in that
group.  If bigger, update the node,
>     remove and re-add to the TreeSet to re-sort.
> efficiency: the treeset and hashmap are both only the size of the top number of docs
we are looking at (10 for instance)
> We will now have the top 10 documents collapsed by the right field with a collapseCount
of 1.  Put another way, we have the top 10 groups.
> Second pass (if collapseCount>1):
>  - create a priority queue for each group (10) of size collapseCount
>  - re-execute the query (or if the sort within the collapse groups does not involve score,
we could just use the docids gathered during phase 1)
>  - for each document, find it's appropriate priority queue and insert
>  - optimization: we can use the previous info from phase1 to even avoid creating a priority
queue if no other items matched.
> So instead of creating collapse groups for every group in the set (as is done now?),
we create it for only 10 groups.
> Instead of collecting the score for every document in the set (40MB per request for a
10M doc index is *big*) we re-execute the query if needed.
> We could optionally store the score as is done now... but I bet aggregate throughput
on large indexes would be better by just re-executing.
> Other thought: we could also cache the first phase in the query cache which would allow
one to quickly move to the 2nd phase for any collapseCount.
> {code}
> The restriction is:
> {quote}
> one would not be able to tell the total number of collapsed docs, or the total number
of hits (or the DocSet) after collapsing. So only collapse.facet=before would be supported.
> {quote}

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message