lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Winton Davies <>
Subject Efficient doc information retrieval.
Date Wed, 14 Nov 2001 23:05:24 GMT
Hi all,
  Thanks for all your continuing help! I have got the go ahead to 
build a production-level prototype of my project. I have to be able 
to serve several 100s of queries a second (on big boxes), and I'm 
currently getting 2 or 3 seconds/query with a sloppy phrase match. I 
was trying to profile my usercode, and I saw that it is the 
uniqification loop that is killing me.

  In my application, I have to be able to return a list of documents, 
that have been uniqified according to an accountID. The most relevant 
document for an accountID is returned, and then susequent hits that 
have the same accountID are dropped.

  So, in a recent search of an 8 million document index, I got around 
200? hits sloppy phrase hits, and I needed to weed out the duplicates.

  so the pseudo code is:

   while ( i < hits.length && resultSet.size < 40) {
    accountID = doc(i).get("accountID");
    if hashtable.get(accountID) != null continue;
    else insert accountID in hashtable, add result to resultSet.

  I timed it, and I was getting about 60 msecs each time round that 
loop, which makes me suspect the doc(i).get().

  This seems to be really inefficient (the query is a sloppy Phrase 
matcher). Any ideas how I can speed this up? I'm obviously going to 
try a RAMDirectory version, but it seems that the 60msec delay is 
over the top ?

I guess the short version of this is

  (a) Is there a way to do this uniqification somehow in the index itself ?
  (b) or have a special kind of field which is ultrafast to access given "i" ?
  (c) or anyway to speed up the existing behaviour!


Winton Davies
Lead Engineer, Overture (NSDQ: OVER)
1820 Gateway Drive, Suite 360
San Mateo, CA 94404
work: (650) 403-2259
cell: (650) 867-1598

To unsubscribe, e-mail:   <>
For additional commands, e-mail: <>

View raw message