nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrzej Bialecki (JIRA)" <>
Subject [jira] Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser
Date Tue, 27 Feb 2007 18:30:06 GMT


Andrzej Bialecki  commented on NUTCH-443:

Overall the idea of this improvement looks very useful, but I'm -1 on this patch as it looks
right now (only now I had a chance to review it in detail). I'd like to see the following
issues addressed in this patch before it's committed:

* in my opinion it's easier to add missing CrawlDatum's (with correctly set fetch time) for
the new urls to the output rather than work-around this by passing around the fetch time in
metadata, and then again compensating in Indexer and CrawlDbReducer for the lack of these
fetchDatum-s ..

* in Fetcher / Fetcher2 you don't pass the signature in case when there is no valid Parse
output, but in the current versions of Fetchers the signature is still calculated and passed
in datum.setSignature() (which ends up in crawl_fetch).

* using a generic Map<String, Parse> is IMHO inappropriate, as I indicated earlier,
especially since this Map requires special post-processing in ParseUtil.processParseMap -
and what would happen if I didn't use ParseUtil? I think this calls for a special-purpose
class (ParseResult?), which would encapsulate this behavior without exposing it to its users
(or even worse - allowing users to bypass it). This class would also help us to avoid somewhat
ugly "convenience" methods in ParseStatus and ParseImpl - these details would be hidden in
one of the constructors of ParseResult.

* I'm also not sure why we use Map<String, Parse> and not Map<Text, Parse>, since
in all further processing we need to create Text objects ...

* the new section in HtmlParseFilters breaks the loop on encountering the first error, and
leaves the parse results incompletely filtered. It should simply continue - the result is
an aggregation of more or less independent documents that are parsed on their own.

* the comment about redirects in is misplaced - I think this contract should be
both defined and enforced in the Fetcher.

And finally, I think this is a significant change in the way how content parsers work with
the rest of the framework, so we should wait with this patch after the 0.9 release - and we
should push 0.9 out of the door really soon ...

> allow parsers to return multiple Parse object, this will speed up the rss parser
> --------------------------------------------------------------------------------
>                 Key: NUTCH-443
>                 URL:
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>    Affects Versions: 0.9.0
>            Reporter: Renaud Richardet
>         Assigned To: Chris A. Mattmann
>            Priority: Minor
>             Fix For: 0.9.0
>         Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, NUTCH-443-draft-v3.patch,
NUTCH-443-draft-v4.patch, NUTCH-443-draft-v5.patch, NUTCH-443-draft-v6.patch, NUTCH-443-draft-v7.patch,
NUTCH-443.022507.patch.txt, parse-map-core-draft-v1.patch, parse-map-core-untested.patch,
> allow Parser#parse to return a Map<String,Parse>. This way, the RSS parser can
return multiple parse objects, that will all be indexed separately. Advantage: no need to
fetch all feed-items separately.
> see the discussion at

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message