nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <>
Subject Re: RSS-fecter and index individul-how can i realize this function
Date Tue, 06 Feb 2007 17:59:31 GMT
Doğacan Güney wrote:
> OK, then should I go forward with this and implement something?   This
> should be pretty easy,
> though I am not sure what to give as keys to a Parse[].
> I mean, when getParse returned a single Parse, ParseSegment output them
> as <url, Parse>. But, if getParse
> returns an array, what will be the key for each element?

Perhaps Parser#parser could return a Map<String,Parse>, where the keys 
are URLs?

> Something like <url#i, Parse[i]> may work, but this may cause problems
> in dedup(for example,
> assume we fetched the same rss feed twice, and indexed them in different
> indexes. Two version's url#0 may be
> different items but since they have the same key, dedup will delete the
> older).

If the feed contains unique ids for items, then that can be used to 
qualify the URL.  Otherwise one could use the hash of the link of the item.

Since the target of the link must still be indexed separately from the 
item itself, how much use is all this?  If the RSS document is 
considered a single page that changes frequently, and item's links are 
considered ordinary outlinks, isn't much the same effect achieved?


View raw message