nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "nutch.newbie (JIRA)" <j...@apache.org>
Subject [jira] Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser
Date Fri, 09 Feb 2007 14:19:05 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12471703
] 

nutch.newbie commented on NUTCH-443:
------------------------------------

I tried the patch with about 100 rss feed. Some problems

1. atom+xml content type gives trouble .. I am not sure if commons feedparser supports atom
1.0
2. In my case sometime the RSS URL doesn't end with .xml or .rss so some of the feeds got
indexed like the way current nutch trunk do i.e as html.

Just some early feedback.. I will do some more testing this weekend. One question I do have
is that - it still doesn't solve the problem of index just the RSS feeds.. even if I take
away all my other parsers .. I still need HTML parser and index-basic.. maybe its time for
index-rss? no?

Cheers

> allow parsers to return multiple Parse object, this will speed up the rss parser
> --------------------------------------------------------------------------------
>
>                 Key: NUTCH-443
>                 URL: https://issues.apache.org/jira/browse/NUTCH-443
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>    Affects Versions: 0.9.0
>            Reporter: Renaud Richardet
>            Priority: Minor
>             Fix For: 0.9.0
>
>         Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, parse-map-core-draft-v1.patch,
parse-map-core-untested.patch, parsers.diff
>
>
> allow Parser#parse to return a Map<String,Parse>. This way, the RSS parser can
return multiple parse objects, that will all be indexed separately. Advantage: no need to
fetch all feed-items separately.
> see the discussion at http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message