lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Wang <john.w...@gmail.com>
Subject Re: Faceted search with OpenBitSet/SortedVIntList
Date Sat, 07 Feb 2009 23:35:10 GMT
Our implementation of facet search can handle this. Using bitsets for
intersection is not scalable performance wise when index is large.

We are using a compact forwarded index representation in memory for the
counting. Similar to FieldCache idea but more compact.

Check it out at: http://sourceforge.net/projects/bobo-browse/
Use the branch: BR_DEV_1_5_0
It is rather stable now, we are releasing it soon.

We are using this on production for LinkedIn with live traffic and it is
working well.

Test it out to see if your performance requirement is met and let us know
how to improve.

Thanks

-John

On Sat, Feb 7, 2009 at 2:06 PM, Paul Elschot <paul.elschot@xs4all.nl> wrote:

> On Saturday 07 February 2009 19:57:19 Raffaella Ventaglio wrote:
> > Hi,
> >
> > I am trying to implement a kind of faceted search using Lucene 2.4.0.
> >
> > I have a list of configuration rules that tell me how to generate this
> > facets and the corresponding queries (that can range from simple term
> > queries to complex boolean queries).
> >
> > When my application starts, it creates the whole set of facets objects
> and
> > initializes them.
> > For each facet:
> > - I create the query according to the configured rule;
> > - I ask the reader for the bitset corresponding to that query and I store
> it
> > in the Facet object;
> > - I get the cardinality of the bitset and I save it in the Facet object
> as
> > its "initial count".
> >
> > When the user does a search I have to update the "counts" associated to
> each
> > Facet:
> > - I get the bitset corresponding to the "query + filter" generated by the
> > user search;
> > - I get the cardinality of the ("search bitset" AND "facet bitset") and I
> > save it as the updated count.
> >
> >
> > In my first solution, I used only "OpenBitSetDISI" objects, both for
> Facet
> > bitset and for search bitset.
> > So I could use "intersectionCount" method to get updated counts after
> user
> > search.
> >
> > This works very well and it is very fast, but when the number of
> documents
> > in the index and the number of facets grows it is too memory consuming.
> >
> >
> > So I tried a different solution: when I create facet bitsets I use the
> same
> > rule applied in ChainedFilter/BooleanFilter to decide if I have to store
> an
> > OpenBitSet or a SortedVIntList.
> > When I have to calculate updated counts:
> > - if the facet has an OpenBitSet, I use the "intersectionCount" method
> > directly;
> > - if the facet has a SortedVIntList, I first create a new OpenBitSetDISI
> > using the SortedVIntList.iterator and then I use the "intersectionCount"
> > method.
> >
> > In this way, I use a smaller amount of memory at initialization time, but
> > for each user search I create a large number of objects (that I suddenly
> > throw away) and this affects application performance because it wastes a
> lot
> > of time doing GC.
> >
> > So my question is: is there a better way to accomplish this task?
> >
> > I think, it would be fine if I could calculate "intersectionCount"
> directly
> > on SortedVIntList objects, but I have not found nothing like that in
> Lucene
> > 2.4 JavaDoc.
> > Am I missing something?
>
> You are not missing anything.
>
> OpenBitSet has an optimized implementation for intersection count, and
> there is no counterpart of that in SortedVIntList because until now there
> has been no need for it.
>
> One way to implement that would be to use one of the boolean combination
> filters in contrib, BooleanFilter or ChainedFilter,  and simply count the
> the number of times next() returns true on the result.
>
> In case the performance of that is not good enough, another way would be
> to directly add an intersection count method to SortedVIntList.
> However, SortedVIntList does not allow for an efficient iterator
> implementation of skipTo(), and skipTo() is used intensively by
> intersections.
>
> > As a reference, now my index contains more than 500.000 documents and I
> have
> > to create/manage up to 50.000 facets.
> > Using "second solution", at initialization time my facets structure
> requires
> > more or less 120MB (and this is good enough), while updating counts it
> uses
> > even 2GB of memory (and this is very bad).
>
> 50.000 facets? Well, in case the performance of the last suggestion is
> not good enough, one could try and implement a better data structure
> than OpenBitSet and SortedVIntList to provide a DocIdSetIterator,
> preferably with a fast skipTo() and possibly with a fast intersection
> count.
> In that case, you may want to ask further on the java-dev list.
>
> Regards,
> Paul Elschot
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message