nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrzej Bialecki (JIRA)" <>
Subject [jira] Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser
Date Wed, 28 Feb 2007 15:01:01 GMT


Andrzej Bialecki  commented on NUTCH-443:

Almost there ... ParseResult seemed to tidy up this patch quite a bit. Remaining issues:

* you create the "fake" CrawlDatum-s in ParseOutputFormat, and then set fetchTime to the current
time. This is incorrect - parsing may have been performed long after the content was fetched.
The correct place to create and store these "fake" CrawlDatum-s is in the FetcherThread.output(),
where you loop through Entry<Text, Parse>, i.e.:

          long curTime = System.currentTimeMillis();
          for (Entry<Text, Parse> entry : parseResult) {
            Text k = entry.getKey();
                new ObjectWritable(new ParseImpl(entry.getValue())));
            if (!k.equals(key)) {
              CrawlDatum fake = datum.clone();
              output.collect(k, new ObjectWritable(fake)); 
            } else {
              // save the real datum
              output.collect(k, new ObjectWritable(datum));

* I'm pretty sure that ParseResult.filter() must NOT be called under normal circumstances
... We need to store the information that parsing was unsuccessful - if we remove this information
from the ParseResult we will never know that parsing failed for this content (or a part thereof).

* we have a backward-compatibility issue with ParseImpl.isFetched - i.e. data created with
earlier versions of Nutch won't be compatible with the new format, and there is no versioning
information in the already existing data. We need to do one of the following:
  - bite the bullet, and don't care about backward compatibility - not so nice ... all existing
segments will have to be re-parsed. Ouch.
  - add look-ahead code to test the data coming from DataInput if it contains this boolean
flag or a likely Text length - somewhat unreliable...
  - store this flag in ParseData.contentMeta - ugly hack.

Out of these three the last option seems the safest for now. From the long-term point of view
we should later on add versioning information and handling of different versions in Parse.

* the name of this method Parse.isFetched is somewhat misleading - it's not about fetching
or not, it's whether this Parse corresponds to the original url or to a sub-url. Perhaps isCanonical,
isRoot, or some other name ...?

* in ParseSegment - what's the reason for creating a new copy of ParseImpl in this line below?
I think we should store the one we already have in "parse":

      output.collect(url, new ParseImpl(new ParseText(parse.getText()), 
                                        parse.getData(), parse.isFetched()));

Thank you for your perseverance!

> allow parsers to return multiple Parse object, this will speed up the rss parser
> --------------------------------------------------------------------------------
>                 Key: NUTCH-443
>                 URL:
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>    Affects Versions: 0.9.0
>            Reporter: Renaud Richardet
>         Assigned To: Chris A. Mattmann
>            Priority: Minor
>             Fix For: 0.9.0
>         Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, NUTCH-443-draft-v3.patch,
NUTCH-443-draft-v4.patch, NUTCH-443-draft-v5.patch, NUTCH-443-draft-v6.patch, NUTCH-443-draft-v7.patch,
NUTCH-443.022507.patch.txt, NUTCH-443.02282007.patch, parse-map-core-draft-v1.patch, parse-map-core-untested.patch,
> allow Parser#parse to return a Map<String,Parse>. This way, the RSS parser can
return multiple parse objects, that will all be indexed separately. Advantage: no need to
fetch all feed-items separately.
> see the discussion at

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message