nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "nutch.newbie (JIRA)" <j...@apache.org>
Subject [jira] Commented: (NUTCH-444) Possibly use a different library to parse RSS feed for improved performance and compatibility
Date Mon, 12 Feb 2007 00:17:05 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12472163
] 

nutch.newbie commented on NUTCH-444:
------------------------------------

Hi: 

I have now done my initial test run with 10 000 + feeds in 3 batch. 

Batch 1
======
A total of 8000 feed ending URL ".rss" and RSS feeds only.. works out of the box.

Batch 2
======
A total of  3000 Atom feeds ending with ".xml" most of the time throws error during dedup
process. Sometime gets parsed by parse-html 

Batch 3
======
A total of 2000 feeds endinf with all kinds of extension example .aspx, .php .jsp .ece and
what not.. also throws error just like batch 2.

Batch 2 and Batch 3 provides same identical bug as before. Note I have ran only 1 round of
fetch. One thing that I am a bit confused is the following. Lets say you have a feed with
5 items i.e. 5 title 5 desc shouldn't the search result i.e. if you do url:feed.com shoot
out 6 results? 1 for the main feed page which is the actual feed URL and the other 5 for the
5 items.. Currently I get only 1 search result which is the feed URL.
Do I need to do 2 round of fetch? Cos things are getting parsed correctly.. maybe its because
I don't have the indexing plugin i.e index-feed? no? I know we will work on it after Nutch-443
is done..but I want to get a clarification..thats all :-) Cheers!


Some log trace from Batch 1
===================
2007-02-12 00:55:23,607 DEBUG parse.ParseUtil - Parsing [http://rss.cnn.com/rss/cnn_marquee.rss]
with [org.apache.nutch.parse.feed.FeedParser@f47af3]
2007-02-12 00:55:23,648 INFO  mapred.JobClient -  map 100% reduce 0%
2007-02-12 00:55:24,690 INFO  mapred.LocalJobRunner - 0 pages, 0 errors, 0.0 pages/s, 0 kb/s,

2007-02-12 00:55:25,020 WARN  parse.ParserFactory - ParserFactory:Plugin: org.apache.nutch.parse.html.HtmlParser
mapped to contentType application/xhtml+xml via parse-plugins.xml, but its plugin.xml file
does not claim to support contentType: application/xhtml+xml
2007-02-12 00:55:25,225 DEBUG parse.html - http://www.cnn.com/HEALTH/blogs/paging.dr.gupta/2007/02/handling-friends-diagnosis.html:
falling back to windows-1252
2007-02-12 00:55:25,225 DEBUG parse.html - Parsing...
2007-02-12 00:55:25,255 DEBUG parse.html - http://rss.cnn.com/~r/rss/cnn_warpcnn/~3/88497144/american-voices-savings-lowest-since.html:
falling back to windows-1252
2007-02-12 00:55:25,255 DEBUG parse.html - Parsing...
2007-02-12 00:55:25,277 DEBUG parse.html - http://rss.cnn.com/~r/rss/cnn_ac360blog/~3/88245057/new-orleans-parents-fear-losing-kids.html:
falling back to windows-1252
2007-02-12 00:55:25,277 DEBUG parse.html - Parsing...
2007-02-12 00:55:25,277 DEBUG parse.html - http://rss.cnn.com/~r/rss/cnn_marquee/~3/88516140/anna-nicole-why.html:
falling back to windows-1252
2007-02-12 00:55:25,278 DEBUG parse.html - Parsing...
2007-02-12 00:55:25,691 INFO  mapred.LocalJobRunner - 0 pages, 0 errors, 0.0 pages/s, 0 kb/s,

2007-02-12 00:55:26,309 DEBUG parse.html - Meta tags for http://www.cnn.com/HEALTH/blogs/paging.dr.gupta/2007/02/handling-friends-diagnosis.html:
base=null, noCache=false, noFollow=false, noIndex=false, refresh=false, refreshHref=null
 * general tags:
 * http-equiv tags:

2007-02-12 00:55:26,310 DEBUG parse.html - Getting text...
2007-02-12 00:55:26,315 DEBUG parse.html - Getting title...
2007-02-12 00:55:26,316 DEBUG parse.html - Getting links...
2007-02-12 00:55:26,318 WARN  regex.RegexURLNormalizer - can't find rules for scope 'outlink',
using default
2007-02-12 00:55:26,319 DEBUG parse.html - found 1 outlinks in http://www.cnn.com/HEALTH/blogs/paging.dr.gupta/2007/02/handling-friends-diagnosis.html
2007-02-12 00:55:26,321 DEBUG parse.html - Meta tags for http://rss.cnn.com/~r/rss/cnn_ac360blog/~3/88245057/new-orleans-parents-fear-losing-kids.html:
base=null, noCache=false, noFollow=false, noIndex=false, refresh=false, refreshHref=null
 * general tags:
 * http-equiv tags:

2007-02-12 00:55:26,321 DEBUG parse.html - Getting text...
2007-02-12 00:55:26,330 DEBUG parse.html - Getting title...
2007-02-12 00:55:26,331 DEBUG parse.html - Getting links...



> Possibly use a different library to parse RSS feed for improved performance and compatibility
> ---------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-444
>                 URL: https://issues.apache.org/jira/browse/NUTCH-444
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 0.9.0
>            Reporter: Renaud Richardet
>            Priority: Minor
>             Fix For: 0.9.0
>
>         Attachments: parse-feed-v2.tar.bz2, parse-feed.tar.bz2
>
>
> As discussed by Nutch Newbie, Gal, and Chris on NUTCH-443, the current library (feedparser)
has the following issues:
> - OutOfMemory when parsing > 100k feeds, since it has to convert the feed to jdom
first
> - no support for Atom 1.0
> - there has been no development in the last year
> Alternatives are:
> - Rome 
> - Informa
> - custom implementation based on Stax
> - ??

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message