lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Toke Eskildsen ...@statsbiblioteket.dk>
Subject RE: Polymorphic Index
Date Thu, 21 Oct 2010 22:32:04 GMT
From: Mark Harwood [markharw00d@yahoo.co.uk]
> Good point, Toke. Forgot about that. Of course doubling the number
> of hash algos used to 4 increases the space massively.

Maybe your hashing-idea could work even with collisions?

Using your original two-hash suggestion, we're just about sure to get collisions. However,
we are still able to uniquely identify the right document as the UID is also stored (search
for the hashes, iterate over the results and get the UID for each). When an update is requested
for an existing document, the indexer extracts the UIDs from all the documents that matches
the hash. Then it performs a delete of the hash-terms and re-indexes all the documents that
had "false" collisions. As the number of unique hash-values as well as hash-function can be
adjusted, this could be a nicely tweakable performance-vs-space trade off.

This will only work if it is possible to re-create the documents from stored terms or by requesting
the data from outside of Lucene by UID. Is this possible with your setup, eks dev?
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message