lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From mark harwood <markharw...@yahoo.co.uk>
Subject Re: Lucene Queries Over User-Editable Dynamic Categories of Documents
Date Wed, 24 Oct 2007 10:00:08 GMT
Given the volatility in the set membership I'd be tempted to keep that grouping info in a database
rather than doing the reader/writer-open/close dance in Lucene before you can see any updates.
(I suspect this is the reason you've opted not to keep the info in Lucene).
You can pull a user's list of a hundred or so terms out of the database (typically primary
keys) and add them as a TermsFilter to your Lucene queries.
I've found that using this approach can be pretty fast even with a large list of filter terms
- it was a while ago so I can't quote stats, you'll need to try it for yourself.

Caching these filters may prove useful but if it's a big dataset Bitsets don't sound like
a memory-efficient form of storing these lists as it sounds like they'll be sparsely populated.
You may be interested in the more memory-efficient options such as SortedVIntList here: http://issues.apache.org/jira/browse/LUCENE-584.

Without taking the whole of that patch on board you could have a caching strategy based on
this pseudocode:

getFilter(Set primaryKeys, IndexReader reader)
{
   TermsFilter tf= new TermsFilter()
   for all primaryKeys:
       tf.addTerm(primaryKey)
  BitSet bits;
  SortedVIntList cached=lruCachedMap.get(tf);
  if(cached==null)
        bits=tf.bits(reader)
        lruCachedMap.put(tf, convertBitsToSortedVIntList(bits))
  else
        bits=convertSortedVIntListToBits(bits)
  return new Filter()
       {
                 BitSet bits(IndexReader reader)
                 {
                     return bits;
                 }
       };
}


On a bit of a lucene-dev tangent, I think the above code has the makings of an optimisation
to CachingWrapperFilter - it could choose to cache SortedVIntLists or BitSets depending on
the sparseness of the set and transparently handles any required conversions.



----- Original Message ----
From: lucene user <luz290@gmail.com>
To: java-user@lucene.apache.org
Sent: Wednesday, 24 October, 2007 7:18:10 AM
Subject: Lucene Queries Over User-Editable Dynamic Categories of Documents

Folks!

We are building a web-based multi-user system. Users of our system are
 able
to categorize items that they have found into groups of related
 documents.
We would like users to be able to search these document groups and
 rapidly
find matches. Each user might have ten of these categories and might
 have
perhaps a few hundred documents in each. These categories might be
 highly
dynamic, with users adding and deleting documents from these categories
 many
times a day. How might we use Lucene to perform searches limited to
 these
very dynamic and end-user editable categories? Any ideas for how we
 might do
this efficiently?

If all the data were in a SQL database, we could run a subquery that
returned the IDs of the items in categories and use that to limit the
results of the super query.

Currently we do not plan to maintain the information about the
 end-user's
categories in the Lucene index at all, or not in a big, main Lucene
 index
anyway.

What our the reasonable options for handling this? What are the
 performance
implications of various choices?

Thanks!





      ___________________________________________________________
Yahoo! Answers - Got a question? Someone out there knows the answer. Try it
now.
http://uk.answers.yahoo.com/ 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message