lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Hostetter <hossman_luc...@fucit.org>
Subject Re: Best way to find if a document exists, using Reader ...
Date Tue, 18 Jan 2005 02:22:04 GMT

: 1) Adding 250K documents took half an hour for lucene.
: 2) Deleting and adding same 250K documents took more than 50 minutes. In my
: test all 250K objects are new so there is nothing to delete.
:
: Looks like there is no other way to make it fast.

I bet you can find an improvement in the specific case -- put probably not
in the general case.

Let's summarize your current process in psuedo code:

   open an existing index (which may be empty)
   foreach item in very large set of items:
      id = item.getId()
      index.delete(id)
      index.add(id)

...except that 99% of the time, that delete isn't neccesary right?

so what if you traded space for time, and kept an in memory Cache of all
IDs in your index?

   open an existing index (which may be empty)
   cache = all ids in TermDoc iterator of id field;
   foreach item in very large set of items:
      id = item.getId()
      if cache.contains(id):
         index.delete(id)
      cache.add(id)
      index.add(id)

... assuming you have enough ram to keep a HashMap of every id in your
index arround, i'm fairly confident that would be faster then doing the
delete every time.


-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message