nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doğacan Güney (JIRA) <>
Subject [jira] Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser
Date Wed, 14 Feb 2007 15:53:06 GMT


Doğacan Güney commented on NUTCH-443:


Thanks for taking the time to review this.

> The contract for ParseUtil.getFirstParseEntry() seems unclear - since in most cases this
is a HashMap, there is no predictable > way to get the first entry added to the map ...
I propose also that we should use a specialized class instead of 
> general-purpose Map; and then we can record in that class which entry was the first.

ParseUtil.getFirstParseEntry is only a convenience method used by plugins to get the first(and
only) entry in a map when it knows that it will create a one-entry parse map(with original
url as the key) and it is mostly used in a plugin's main method to get the parse and print
it. It is not used in any core part of Nutch. 

Anyway, it is very incorrectly named. What we meant was ParseUtil.getOnlyParseEntry. Hmm,
that doesn't make any sense either :D

Instead of creating a specialized class, how about removing the method and just using parseMap.get(key)?
Most plugins will use it like parseMap.get(content.getUrl()). 

> Also, the naming of some methods 
> seems a bit awkward - why should we insist that we createSingleEntryMap while we create
an ordinary Map, and we don't use > this special-case knowledge later? I suggest to simply
name it createParseMap.

You are right, I will change this in the next patch.

> In recent versions of Hadoop there is a GenericWritable class - it replaces ObjectWritable
when classes are known in advance, > and provides a more compact representation.

Didn't know this, will change this too. (Why is Nutch not using this class in Indexer?)

> Changes to MapWritable must preserve old code values, at most adding some new ones -
otherwise the new code will get 
> confused when working with older data.

I see your point but I am not sure how to fix this. Since this patch removes the FetcherOutput
class, what to put there instead of it? I guess we can just keep FetcherOutput as it is, and
update its javadoc to reflect the fact that it is not used anymore.

> CrawlDbReducer, TODO item: this should be the time stored under Nutch.FETCH_TIME_KEY,
> If I'm not mistaken, ParseUtil doesn't need the import of HashMap, only Map.

I will remove the TODO item and fix the imports in the next patch.

> allow parsers to return multiple Parse object, this will speed up the rss parser
> --------------------------------------------------------------------------------
>                 Key: NUTCH-443
>                 URL:
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>    Affects Versions: 0.9.0
>            Reporter: Renaud Richardet
>         Assigned To: Chris A. Mattmann
>            Priority: Minor
>             Fix For: 0.9.0
>         Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, NUTCH-443-draft-v3.patch,
NUTCH-443-draft-v4.patch, NUTCH-443-draft-v5.patch, NUTCH-443-draft-v6.patch, parse-map-core-draft-v1.patch,
parse-map-core-untested.patch, parsers.diff
> allow Parser#parse to return a Map<String,Parse>. This way, the RSS parser can
return multiple parse objects, that will all be indexed separately. Advantage: no need to
fetch all feed-items separately.
> see the discussion at

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message