nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrzej Bialecki ...@getopt.org>
Subject Re: svn commit: r384219 - /lucene/nutch/trunk/src/java/org/apache/nutch/crawl/Generator.java
Date Wed, 08 Mar 2006 21:26:19 GMT
Stefan Groschupf wrote:
> I notice filtering urls is done in the output format until parsing. 
> Wouldn't it be better to filter it until updating crawlDb?

"Until" == "during" ?

As you observed, doing it at this stage saves space in segment data, and 
in consequence saves on processing time (no CPU/IO needed to process 
useless data, throw away junk as soon as possible).

> Sure it would require to have some more disk space but since parsing 
> is done until fetching it may be improve fetching speed.

Parsing is not always done at fetching stage (Fetcher.parsing == false).

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Mime
View raw message