Jerry Jalenak wrote:
>Nice idea John - one I hadn't considered. Once you have the checksum, do
>you 'check' in the index first before storing the second document? Or do
>you filter on the query side?
>
>
I do a quick search for the md5 checksum before indexing.
Although I suspect not applicable in your case, I also maintained a
"last time something was indexed" time alongside the index. I used this
to drastically prune the number of documents that needed to be
considered for indexing if I restarted; anything modified before then
wasn't a candidate. Since the MD5 checksum provides the definitive (for
a sufficiently loose definition of definitive) indication of whether a
document is indexed I didn't need to worry about ultra-fine granularity
in the time stamp and I didn't need to worry about it being committed to
disk; it generally got committed to the magnetic stuff every few seconds
or so.
It does help a lot though if documents have nice unique identifiers that
you can use instead, then you can use the identifier and the last
modified time to decide whether or not to re-index.
jch
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
|