nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doğacan Güney (JIRA) <>
Subject [jira] Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser
Date Wed, 28 Feb 2007 15:28:57 GMT


Doğacan Güney commented on NUTCH-443:

> * you create the "fake" CrawlDatum-s in ParseOutputFormat, and then set fetchTime to
the current time. This is incorrect - 
> parsing may have been performed long after the content was fetched. The correct place
to create and store these "fake" 
> CrawlDatum-s is in the FetcherThread.output(), where you loop through Entry<Text,
Parse>, i.e.:

What if I run my fetcher in non-parsing mode?(which, coincidentally, is always for me) I can
add the code to fetcher but it will still be wrong in parse. I guess I will have to put FETCH_TIME_KEY
back in. What do you think? Is there a better way to handle this?

> * I'm pretty sure that ParseResult.filter() must NOT be called under normal circumstances
... We need to store the information 
> that parsing was unsuccessful - if we remove this information from the ParseResult we
will never know that parsing failed for 
> this  content (or a part thereof). 

The current code does not store unsuccessful parses. I mean, take ParseSegment, it only outputs
code if parse status is success. So Nutch removes this information anyway, I just changed
the place where Nutch removes this information. My approach is cleaner (IMO), but I don't
really feel that strongly about it, so I can change it. 

> * we have a backward-compatibility issue with ParseImpl.isFetched - i.e. data created
with earlier versions of Nutch won't be 
> compatible with the new format, and there is no versioning information in the already
existing data. We need to do one of the > following:
>  - bite the bullet, and don't care about backward compatibility - not so nice ... all
existing segments will have to be re-parsed. > Ouch.
> - add look-ahead code to test the data coming from DataInput if it contains this boolean
flag or a likely Text length - 
> somewhat unreliable...
>  - store this flag in ParseData.contentMeta - ugly hack. 

> Out of these three the last option seems the safest for now. From the long-term point
of view we should later on add 
> versioning information and handling of different versions in Parse. 

Parse (actually ParseImpl) is used as a temporary data structure to pass data from
to ParseSegment.reduce (or Fetcher.something but you get the point). So, unless someone stores
the temporary outputs of and wants to reduce them with this patch, I don't
see what can go wrong. ParseOutputFormat writes parse text and parse data doesn't care about
what else is in there.

> * the name of this method Parse.isFetched is somewhat misleading - it's not about fetching
or not, it's whether this Parse 
> corresponds to the original url or to a sub-url. Perhaps isCanonical, isRoot, or some
other name ...? 

Giving names to things is hard. Usually harder than creating them :). Will think of something

> * in ParseSegment - what's the reason for creating a new copy of ParseImpl in this line
below? I think we should store the one > we already have in "parse":

That's because Parser.getParse method's return value is Parse - not ParseImpl - which is not
writable. So I take the not Writable Parse and create a Writable ParseImpl from it.

This is almost certainly not necessary, though. I will check this and update the patch.

> Thank you for your perseverance!

Sure, I just want to get this patch out of the way, so I can bug you all with my other patches:).

I will not send another patch, since I need some guidance on 1, I don't think that 2 and 3
are issues(but feel free to prove me wrong) and 4-5 are easy to solve.

Thanks again for your review.

> allow parsers to return multiple Parse object, this will speed up the rss parser
> --------------------------------------------------------------------------------
>                 Key: NUTCH-443
>                 URL:
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>    Affects Versions: 0.9.0
>            Reporter: Renaud Richardet
>         Assigned To: Chris A. Mattmann
>            Priority: Minor
>             Fix For: 0.9.0
>         Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, NUTCH-443-draft-v3.patch,
NUTCH-443-draft-v4.patch, NUTCH-443-draft-v5.patch, NUTCH-443-draft-v6.patch, NUTCH-443-draft-v7.patch,
NUTCH-443.022507.patch.txt, NUTCH-443.02282007.patch, parse-map-core-draft-v1.patch, parse-map-core-untested.patch,
> allow Parser#parse to return a Map<String,Parse>. This way, the RSS parser can
return multiple parse objects, that will all be indexed separately. Advantage: no need to
fetch all feed-items separately.
> see the discussion at

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message