nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "nutch.newbie (JIRA)" <j...@apache.org>
Subject [jira] Commented: (NUTCH-444) Possibly use a different library to parse RSS feed for improved performance and compatibility
Date Sun, 11 Feb 2007 12:50:05 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12472099
] 

nutch.newbie commented on NUTCH-444:
------------------------------------

Otis:

Thanks for the info. But as for me I am going with parse-feed. I will also like to give stax
based solution a try. 

Dogacan: 

It's working rather well with parse-feed. However I would be glad if you could do a quick
check on my parse-plugins.xml modifications. Cos this also throws error during dedup... (when
magic is false in nutch-site.xml). My intention is to know if its something I am doing wrong
or is it some other bug.. 

I am thinking of doing a test run later tonight with 10 000 feeds. So I would be glad if you
could clarify the following cases. (The following case only happens when there is just 1 url)

- urls.txt file contains 1 url, which is http://blog.foofactory.fi/atom.xml
- bin/nutch crawl with depth 1 gives me the following error during dedup

2007-02-11 13:32:26,846 WARN  mapred.LocalJobRunner - job_k9e9c2
java.lang.ArrayIndexOutOfBoundsException: -1
        at org.apache.lucene.index.MultiReader.isDeleted(MultiReader.java:109)
        at org.apache.nutch.indexer.DeleteDuplicates$InputFormat$DDRecordReader.next(DeleteDuplicates.java:176)
        at org.apache.hadoop.mapred.MapTask$2.next(MapTask.java:166)
        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:46)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:183)
        at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:109)

and during the parse phase of the above blog gives me the following:

2007-02-11 13:32:09,673 DEBUG http.Http - fetched 208 bytes from http://blog.foofactory.fi/robots.txt
2007-02-11 13:32:09,674 DEBUG http.Http - fetching http://blog.foofactory.fi/atom.xml
2007-02-11 13:32:10,560 INFO  mapred.JobClient -  map 100% reduce 0%
2007-02-11 13:32:10,769 DEBUG http.Http - fetched 38151 bytes from http://blog.foofactory.fi/atom.xml
2007-02-11 13:32:10,965 DEBUG parse.ParseUtil - Parsing [http://blog.foofactory.fi/atom.xml]
with [org.apache.nutch.parse.feed.FeedParser@360771]
2007-02-11 13:32:11,292 INFO  mapred.LocalJobRunner - 0 pages, 0 errors, 0.0 pages/s, 0 kb/s,

2007-02-11 13:32:11,627 INFO  crawl.SignatureFactory - Using Signature impl: org.apache.nutch.crawl.MD5Signature
2007-02-11 13:32:11,654 WARN  fetcher.Fetcher - Error parsing: http://blog.foofactory.fi/atom.xml:
failed(2,200): java.lang.NullPointerException
2007-02-11 13:32:12,293 INFO  mapred.LocalJobRunner - 1 pages, 0 errors, 0.3 pages/s, 99 kb/s,

2007-02-11 13:32:12,306 DEBUG mapred.MapTask - opened spill0.out
2007-02-11 13:32:12,381 INFO  mapred.LocalJobRunner - 1 pages, 0 errors, 0.3 pages/s, 99 kb/s,

Below is my Parse-plugins.xml changes...

       <mimeType name="application/rss+xml">
                <plugin id="parse-feed" />
        </mimeType>

        <mimeType name="text/xml">
                <plugin id="parse-feed" />
         </mimeType>

                <alias name="parse-feed"
                        extension-id="org.apache.nutch.parse.feed.FeedParser" />

I have also mapped text/xml in parse-feed/plugin.xml cos most of the time I get xml rather
then rss+xml as content type.. Also as you mentioned you are using this to test .. how is
your test configuration? can you re-create my problem.. 

Thanks again for the plugin and many thanks for your help. I look forward to contribute in
terms of index-feed and query-feed.











 

> Possibly use a different library to parse RSS feed for improved performance and compatibility
> ---------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-444
>                 URL: https://issues.apache.org/jira/browse/NUTCH-444
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 0.9.0
>            Reporter: Renaud Richardet
>            Priority: Minor
>             Fix For: 0.9.0
>
>         Attachments: parse-feed.tar.bz2
>
>
> As discussed by Nutch Newbie, Gal, and Chris on NUTCH-443, the current library (feedparser)
has the following issues:
> - OutOfMemory when parsing > 100k feeds, since it has to convert the feed to jdom
first
> - no support for Atom 1.0
> - there has been no development in the last year
> Alternatives are:
> - Rome 
> - Informa
> - custom implementation based on Stax
> - ??

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message