lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrzej Bialecki ...@getopt.org>
Subject Re: document field updates
Date Thu, 01 Mar 2007 16:40:52 GMT
Erik Hatcher wrote:
>>
>> I'm pretty sure this has been done, I'm just not 100% sure where. Does
>> Nutch index link text?
>
> Nutch does do this sort of thing, but I'm not quite sure how.  It 
> isn't doing any operations to the Lucene index beyond what plain ol' 
> Lucene does.
>

Nutch maintains a set of separate DBs (using Hadoop 
MapFile/SequenceFile), where inlinks are stored (together with their 
anchor text). During indexing this data is pulled in from the DBs piece 
by piece using the URLs as "primary keys".

Nutch doesn't update _any_ data structures in-place - all "update" 
operations involve creating new data files and optionally deleting old 
data files. This includes also indexes - new indexes are being created 
from newly updated pages, and then only individual Lucene documents are 
deleted from older indexes to get rid of duplicates. After a while, 
really old indexes are removed completely, because their content is 
likely to be present in one of the newer indexes.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message