lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lionel Duboeuf <>
Subject index per-user basis and document frequency
Date Mon, 15 Jun 2009 21:06:20 GMT

I use Lucene to index user's documents. I have a potential of 2 or more 
millions users so that i think a per-user index will not be a scalable 
solution. All my searches are filtered with a user UID  field.
As far as i know the default similarity calculate Inverse Document 
Frequency  as follow:
 Math.log(numDocs/(double)(docFreq+1)) + 1.0)
where numDocs stands for the number of documents within the whole 
collection and docFreq for the number of times Term t appear in the 
whole collection.
My problem here is that this formula seems not to be reliable for my 
system because numDocs should correspond to the number of documents in 
the user's collection  and docFreq for the number of times the Term T 
appears in the user's collection.
Because Terms are stored as a single token i was thinking of 
concatenating terms with a UID in order to separate them because :
Term "car" for user1 is different to term "car" for user2. My solution 
would index "carUSERUID1" "carUSERUID2".

What would you suggest ?



View raw message