nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doğacan Güney <dogacan.gu...@agmlab.com>
Subject Re: RSS-fecter and index individul-how can i realize this function
Date Mon, 05 Feb 2007 13:28:01 GMT
Doug Cutting wrote:
> Gal Nitzan wrote:
>> IMHO the data that is needed i.e. the data that will be fetched in 
>> the next fetch process is already available in the <item> element. 
>> Each <item> element represents one web resource. And there is no 
>> reason to go to the server and re-fetch that resource.
>
> Perhaps ProtocolOutput should change.  The method:
>
>   Content getContent();
>
> could be deprecated and replaced with:
>
>   Content[] getContents();
>
> This would require changes to the indexing pipeline.  I can't think of 
> any severe complications, but I haven't looked closely.

Since getProtocolOutput is called by Fetcher, fetcher(actually, the 
underlying protocol plugin) needs to be aware that we are actually 
fetching a rss feed and partially parse it to return an array of Contents.

I think it would make much more sense to change parse plugins to take 
content and return Parse[] instead of Parse.

--
Doğacan Güney
>
> Could something like that work?
>
> Doug
>
>
>


Mime
View raw message