lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Lu <chris...@gmail.com>
Subject Re: Faceted search with OpenBitSet/SortedVIntList
Date Sat, 07 Feb 2009 20:29:28 GMT
The first approach is rather limiting when facets number grows.

The "SortedVIntList" approach is similar to field cache. It's better to use
the fieldcache for the facet search, which is the "normal" approach and used
in tools like Solr, DBSight, Bobo Browse Engine, etc.

To avoid creating a lot of objects and quickly throwing them away, you can
adjust Eden memory size, or you can create a bunch of objects and try to
re-use them.

-- 
Chris Lu
-------------------------
Instant Scalable Full-Text Search On Any Database/Application
site: http://www.dbsight.net
demo: http://search.dbsight.com
Lucene Database Search in 3 minutes:
http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
DBSight customer, a shopping comparison site, (anonymous per request) got
2.6 Million Euro funding!

On Sat, Feb 7, 2009 at 10:57 AM, Raffaella Ventaglio
<r.ventaglio@gmail.com>wrote:

> Hi,
>
> I am trying to implement a kind of faceted search using Lucene 2.4.0.
>
> I have a list of configuration rules that tell me how to generate this
> facets and the corresponding queries (that can range from simple term
> queries to complex boolean queries).
>
> When my application starts, it creates the whole set of facets objects and
> initializes them.
> For each facet:
> - I create the query according to the configured rule;
> - I ask the reader for the bitset corresponding to that query and I store
> it
> in the Facet object;
> - I get the cardinality of the bitset and I save it in the Facet object as
> its "initial count".
>
> When the user does a search I have to update the "counts" associated to
> each
> Facet:
> - I get the bitset corresponding to the "query + filter" generated by the
> user search;
> - I get the cardinality of the ("search bitset" AND "facet bitset") and I
> save it as the updated count.
>
>
> In my first solution, I used only "OpenBitSetDISI" objects, both for Facet
> bitset and for search bitset.
> So I could use "intersectionCount" method to get updated counts after user
> search.
>
> This works very well and it is very fast, but when the number of documents
> in the index and the number of facets grows it is too memory consuming.
>
>
> So I tried a different solution: when I create facet bitsets I use the same
> rule applied in ChainedFilter/BooleanFilter to decide if I have to store an
> OpenBitSet or a SortedVIntList.
> When I have to calculate updated counts:
> - if the facet has an OpenBitSet, I use the "intersectionCount" method
> directly;
> - if the facet has a SortedVIntList, I first create a new OpenBitSetDISI
> using the SortedVIntList.iterator and then I use the "intersectionCount"
> method.
>
> In this way, I use a smaller amount of memory at initialization time, but
> for each user search I create a large number of objects (that I suddenly
> throw away) and this affects application performance because it wastes a
> lot
> of time doing GC.
>
> So my question is: is there a better way to accomplish this task?
>
> I think, it would be fine if I could calculate "intersectionCount" directly
> on SortedVIntList objects, but I have not found nothing like that in Lucene
> 2.4 JavaDoc.
> Am I missing something?
>
>
> As a reference, now my index contains more than 500.000 documents and I
> have
> to create/manage up to 50.000 facets.
> Using "second solution", at initialization time my facets structure
> requires
> more or less 120MB (and this is good enough), while updating counts it uses
> even 2GB of memory (and this is very bad).
>
> Thanks in advance,
> Raf
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message