lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marcus Herou <marcus.he...@tailsweep.com>
Subject Re: Group by in Lucene ?
Date Sun, 01 Feb 2009 15:00:42 GMT
Yep. Probably an external sort should be used when flushing to disk. I have
written such code so that is probably a no brainer, the problem is to get it
speedy :)
<http://dev.tailsweep.com/projects/utils/apidocs/org/tailsweep/utils/sort/TupleSorter.html>
http://dev.tailsweep.com/projects/utils/apidocs/com/tailsweep/utils/sort/TupleSorter.html

Another way could be to use HDFS and MapFiles/SequenceFiles Not speedy at
all but scalable.

Thinking of writing my own Inverted Index, specialized for these kind of
operations. Any pointers in where to start look for material for that ?

/Marcus























On Wed, Jan 28, 2009 at 5:02 PM, Mark Miller <markrmiller@gmail.com> wrote:

> Group-by in Lucene/Solr has not been solved in a great general way yet to
> my knowledge.
>
> Ideally, we would want a solution that does not need to fit into memory.
> However, you need the value of the field for each document. to do the
> grouping As you are finding, this is not cheap to get. Currently, the
> efficient way to get it is to use a FieldCache. This, however, requires that
> every distinct value can fit into memory.
>
> Once you have efficient access to the values, you need to be able to
> efficiently group the results, again not bounded by memory (which we already
> are with the FieldCache).
>
> There are quite a few ways to do this. The simplest is to group until you
> have used all the memory you want, then for everything left, anything that
> doesnt match a group, write it to a file, if it does, increment the group
> count. Use the overflow file as the input in the next run, repeat until
> there is no overflow. You can improve on that by partitioning the overflow
> file.
>
> And then there are a dozen other methods.
>
> Solr has a patch in JIRA that uses a sorting method. First the results are
> sorted on the group-by field, then scanned through for grouping - all field
> values that are the same will be next to each other. Finally, if you really
> wanted to sort on a different field, another sort is applied. Thats not
> ideal IMO, but its a start.
>
> - Mark
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


-- 
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
marcus.herou@tailsweep.com
http://www.tailsweep.com/
http://blogg.tailsweep.com/

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message