nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "HUYLEBROECK Jeremy RD-ILAB-SSF" <jeremy.huylebro...@orange-ftgroup.com>
Subject FW: RSS-fecter and index individul-how can i realize this function
Date Thu, 08 Feb 2007 21:41:33 GMT

I send again this message as it apparently didn't go through.
(I am messing up with my email addresses on the mailing list...) 

-----Original Message-----
Sent: Friday, February 02, 2007 10:29 AM

Using Nutch 0.8, we modified the code starting at the fetching/parsing steps and the following.
We have a different implementation of the Parse Object and OutputFormat including an additional
list of ParseData objects saved in an additionnal subfolder in the DFS.
We changed the indexing step a lot too, so we don't use the nutch code there.


-----Original Message-----
From: Doug Cutting [mailto:cutting@apache.org]
Sent: Friday, February 02, 2007 10:19 AM
To: nutch-dev@lucene.apache.org
Subject: Re: RSS-fecter and index individul-how can i realize this function

Attention, votre correspondant continue de vous écrire à votre ancienne adresse en @orange-ft.com,
qui va être désactivée début avril. Veuillez lui demander de mettre à jour son carnet
d'adresses avec votre nouvelle adresse en @orange-ftgroup.com.

Caution : your correspondent is still writing to your orange-ft.com address, which will be
disabled beginning of April. Please ask him/her to update his/her address book to orange-ftgroup.com
..................................................

Gal Nitzan wrote:
> IMHO the data that is needed i.e. the data that will be fetched in the next fetch process
is already available in the <item> element. Each <item> element represents one
web resource. And there is no reason to go to the server and re-fetch that resource.

Perhaps ProtocolOutput should change.  The method:

   Content getContent();

could be deprecated and replaced with:

   Content[] getContents();

This would require changes to the indexing pipeline.  I can't think of

any severe complications, but I haven't looked closely.

Could something like that work?

Doug


Mime
View raw message