nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrzej Bialecki>
Subject Re: CrawlDbReducer and the lone STATUS_SIGNATURE record
Date Sat, 29 Apr 2006 07:49:50 GMT
(redirected to nutch-dev) wrote:
> CrawlDbReducer#reduce doesn't have a switch case for 
> CrawlDatum.STATUS-SIGNATURE so we fall into the default (line #121) 
> block which throws a RuntimeException.   This causes my update db job 
> to never succeed.
> This has just recently started happening.
> Enabling logging I see that what usually happens is that a CrawlDatum 
> with a STATUS_SIGNATURE status comes through first and is set to be 
> 'highest' (line #49) but then the next record through takes over the 
> 'highest' role because its status is higher, usually 'fetch_success' 
> or 'linked' in my case.
> But for reasons not clear to me, I'll sometimes have a lone CrawlDatum 
> with a status of STATUS_SIGNATURE (A mapout lost a record?) with no 
> following 'fetch_success' or 'linked' CrawlDatum. 
> This probably shouldn't fail the job.
> Attached is a patch that logs a warning and keeps going but probably 
> not the right soln.

How weird, This Should Never Happen(tm) ... ;) Lost map output should 
show up in logs, or perhaps even should've killed the job, isn't that 
so? I'll apply your patch for now, but we need to keep an eye on this.

Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration  Contact: info at sigram dot com

View raw message