nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Renaud Richardet <ren...@apache.org>
Subject Re: FW: RSS-fecter and index individul-how can i realize this function
Date Thu, 08 Feb 2007 22:15:47 GMT
HUYLEBROECK Jeremy RD-ILAB-SSF wrote:
> I send again this message as it apparently didn't go through.
> (I am messing up with my email addresses on the mailing list...) 
>
> -----Original Message-----
> Sent: Friday, February 02, 2007 10:29 AM
>
> Using Nutch 0.8, we modified the code starting at the fetching/parsing steps and the
following.
> We have a different implementation of the Parse Object and OutputFormat including an
additional list of ParseData objects saved in an additionnal subfolder in the DFS.
> We changed the indexing step a lot too, so we don't use the nutch code there.
>   
Is your implementation similar to what we started at 
https://issues.apache.org/jira/browse/NUTCH-443? If you think some of 
your changes could be integrated, please post a patch there.

Thanks for sharing,
Renaud
>
> -----Original Message-----
> From: Doug Cutting [mailto:cutting@apache.org]
> Sent: Friday, February 02, 2007 10:19 AM
> To: nutch-dev@lucene.apache.org
> Subject: Re: RSS-fecter and index individul-how can i realize this function
>
> Attention, votre correspondant continue de vous écrire à votre ancienne adresse en
@orange-ft.com, qui va être désactivée début avril. Veuillez lui demander de mettre à
jour son carnet d'adresses avec votre nouvelle adresse en @orange-ftgroup.com.
>
> Caution : your correspondent is still writing to your orange-ft.com address, which will
be disabled beginning of April. Please ask him/her to update his/her address book to orange-ftgroup.com
..................................................
>
> Gal Nitzan wrote:
>   
>> IMHO the data that is needed i.e. the data that will be fetched in the next fetch
process is already available in the <item> element. Each <item> element represents
one web resource. And there is no reason to go to the server and re-fetch that resource.
>>     
>
> Perhaps ProtocolOutput should change.  The method:
>
>    Content getContent();
>
> could be deprecated and replaced with:
>
>    Content[] getContents();
>
> This would require changes to the indexing pipeline.  I can't think of
>
> any severe complications, but I haven't looked closely.
>
> Could something like that work?
>
> Doug
>
>
>   


-- 
Renaud Richardet                                      +1 617 230 9112
my email is my first name at apache.org      http://www.oslutions.com


Mime
View raw message