nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From kauu <bab...@gmail.com>
Subject Re: RSS-fecter and index individul-how can i realize this function
Date Mon, 05 Feb 2007 03:30:32 GMT
I've change code like what u said, but i get an exception like this.
why, why is the MD5Signature class's exception


2007-02-05 11:28:38,453 WARN  feedparser.FeedFilter (
FeedFilter.java:doDecodeEntities(223)) - Filter encountered unknown entities
2007-02-05 11:28:39,390 INFO  crawl.SignatureFactory (
SignatureFactory.java:getSignature(45)) - Using Signature impl:
org.apache.nutch.crawl.MD5Signature
2007-02-05 11:28:40,078 WARN  mapred.LocalJobRunner
(LocalJobRunner.java:run(120))
- job_f6j55m
java.lang.NullPointerException
    at org.apache.nutch.parse.ParseOutputFormat$1.write(
ParseOutputFormat.java:121)
    at org.apache.nutch.fetcher.FetcherOutputFormat$1.write(
FetcherOutputFormat.java:87)
    at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:235)
    at org.apache.hadoop.mapred.lib.IdentityReducer.reduce(
IdentityReducer.java:39)
    at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:247)
    at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java
:112)


On 2/3/07, Renaud Richardet <ren@oslutions.com> wrote:
>
> Gal, Chris, Kauu,
>
> So, if I understand correctly, you need a way to pass information along
> the fetches, so that when Nutch fetches a feed entry, its <item> value
> previously fetched is available.
>
> This is how I tackled the issue:
> - extend Outlinks.java and allow to create outlinks with more meta data.
> So, in your feed parser, use this way to create outlinks
> - pass on the metadata through ParseOutputFormat.java and Fetcher.java
> - retrieve the metadata in HtmlParser.java and use it
>
> This is very tedious, will blow the size of your outlinks db, makes
> changes in the core code of Nutch, etc... But this is the only way I
> came up with...
> If someone sees a better way, please let me know :-)
>
> Sample code, for Nutch 0.8.x :
>
> Outlink.java
> +  public Outlink(String toUrl, String anchor, String entryContents,
> Configuration conf) throws MalformedURLException {
> +      this.toUrl = new
> UrlNormalizerFactory(conf).getNormalizer().normalize(toUrl);
> +      this.anchor = anchor;
> +
> +      this.entryContents= entryContents;
> +  }
> and update the other methods
>
> ParseOutputFormat.java, around lines 140
> +            // set outlink info in metadata ME
> +            String entryContents= links[i].getEntryContents();
> +
> +            if (entryContents.length() > 0) { // it's a feed entry
> +                MapWritable meta = new MapWritable();
> +                meta.put(new UTF8("entryContents"), new
> UTF8(entryContents));//key/value
> +                target = new CrawlDatum(CrawlDatum.STATUS_LINKED,
> interval);
> +                target.setMetaData(meta);
> +            } else {
> +                target = new CrawlDatum(CrawlDatum.STATUS_LINKED,
> interval); // no meta
> +            }
>
> Fetcher.java, around l. 266
> +      // add feed info to metadata
> +      try {
> +          String entryContents = datum.getMetaData().get(new
> UTF8("entryContents")).toString();
> +          metadata.set("entryContents", entryContents);
> +      } catch (Exception e) { } //not found
>
> HtmlParser.java
> // get entry metadata
>     String entryContents = content.getMetadata().get("entryContents");
>
> HTH,
> Renaud
>
>
>
> Gal Nitzan wrote:
> > Hi Chris,
> >
> > I'm sorry I wasn't clear enough. What I mean is that in the current
> implementation:
> >
> > 1. The RSS (channels, items) page ends up as one Lucene document in the
> index.
> > 2. Indeed the links are extracted and each <item> link will be fetched
> in the next fetch as a separate page and will end up as one Lucene document.
> >
> > IMHO the data that is needed i.e. the data that will be fetched in the
> next fetch process is already available in the <item> element. Each <item>
> element represents one web resource. And there is no reason to go to the
> server and re-fetch that resource.
> >
> > Another issue that arises from rss feeds is that once the feed page is
> fetched you can not re-fetch it until its "time to fetch" expired. The feeds
> TTL is usually very short. Since for now in Nutch, all pages created equal
> :) it is one more thing to think about.
> >
> > HTH,
> >
> > Gal.
> >
> > -----Original Message-----
> > From: Chris Mattmann [mailto:chris.mattmann@jpl.nasa.gov]
> > Sent: Thursday, February 01, 2007 7:01 PM
> > To: nutch-dev@lucene.apache.org
> > Subject: Re: RSS-fecter and index individul-how can i realize this
> function
> >
> > Hi Gal, et al.,
> >
> >   I'd like to be explicit when we talk about what the issue with the RSS
> > parsing plugin is here; I think we have had conversations similar to
> this
> > before and it seems that we keep talking around each other. I'd like to
> get
> > to the heart of this matter so that the issue (if there is an actual
> one)
> > gets addressed ;)
> >
> >   Okay, so you mention below that the thing that you see missing from
> the
> > current RSS parsing plugin is the ability to store data in the
> CrawlDatum,
> > and parse "it" in the next fetch phase. Well, there are 2 options here
> for
> > what you refer to as "it":
> >
> >  1. If you're talking about the RSS file, then in fact, it is parsed,
> and
> > its data is stored in the CrawlDatum, akin to any other form of content
> that
> > is fetched, parsed and indexed.
> >
> >  2. If you're talking about the item links within the RSS file, in fact,
> > they are parsed (eventually), and their data stored in the CrawlDatum,
> akin
> > to any other form of content that is fetched, parsed, and indexed. This
> is
> > accomplished by adding the RSS items as Outlinks when the RSS file is
> > parsed: in this fashion, we go after all of the links in the RSS file,
> and
> > make sure that we index their content as well.
> >
> > Thus, if you had an RSS file R that contained links in it to a PDF file
> A,
> > and another HTML page P, then not only would R get fetched, parsed, and
> > indexed, but so would A and P, because they are item links within R.
> Then
> > queries that would match R (the physical RSS file), would additionally
> match
> > things such as P and A, and all 3 would be capable of being returned in
> a
> > Nutch query. Does this make sense? Is this the issue that you're talking
> > about? Am I nuts? ;)
> >
> > Cheers,
> >   Chris
> >
> >
> >
> >
> > On 1/31/07 10:40 PM, "Gal Nitzan" <gnitzan@usa.net> wrote:
> >
> >
> >> Hi,
> >>
> >> Many sites provide RSS feeds for several reasons, usually to save
> bandwidth,
> >> to give the users concentrated data and so forth.
> >>
> >> Some of the RSS files supplied by sites are created specially for
> search
> >> engines where each RSS "item" represent a web page in the site.
> >>
> >> IMHO the only thing "missing" in the parse-rss plugin is storing the
> data in
> >> the CrawlDatum and "parsing" it in the next fetch phase. Maybe adding a
> new
> >> flag to CrawlDatum, that would flag the URL as "parsable" not
> "fetchable"?
> >>
> >> Just my two cents...
> >>
> >> Gal.
> >>
> >> -----Original Message-----
> >> From: Chris Mattmann [mailto:chris.mattmann@jpl.nasa.gov]
> >> Sent: Wednesday, January 31, 2007 8:44 AM
> >> To: nutch-dev@lucene.apache.org
> >> Subject: Re: RSS-fecter and index individul-how can i realize this
> function
> >>
> >> Hi there,
> >>
> >>   With the explanation that you give below, it seems like parse-rss as
> it
> >> exists would address what you are trying to do. parse-rss parses an RSS
> >> channel as a set of items, and indexes overall metadata about the RSS
> file,
> >> including parse text, and index data, but it also adds each item (in
> the
> >> channel)'s URL as an Outlink, so that Nutch will process those pieces
> of
> >> content as well. The only thing that you suggest below that parse-rss
> >> currently doesn't do, is to allow you to associate the metadata fields
> >> category:, and author: with the item Outlink...
> >>
> >> Cheers,
> >>   Chris
> >>
> >>
> >>
> >> On 1/30/07 7:30 PM, "kauu" <babatu@gmail.com> wrote:
> >>
> >>
> >>> thx for ur reply .
> >>>
> >> mybe i didn't tell clearly .
> >>  I want to index the item as a
> >>
> >>> individual page .then when i search the some
> >>>
> >> thing for example "nutch-open
> >>
> >>> source", the nutch return a hit which contain
> >>>
> >>    title : nutch-open source
> >>
> >>
> >>> description : nutch nutch nutch ....nutch  nutch
> >>>
> >>    url :
> >>
> >>> http://lucene.apache.org/nutch
> >>>
> >>    category : news
> >>   author  : kauu
> >>
> >> so , is
> >>
> >>> the plugin parse-rss can satisfy what i need?
> >>>
> >> <item>
> >>     <title>nutch--open
> >>
> >>> source</title>
> >>>
> >>    <description>
> >>
> >>>        nutch nutch nutch ....nutch
> >>> nutch
> >>>
> >>>>     </description>
> >>>>
> >>>>
> >>>>
> >>>>
> >>> <link>http://lucene.apache.org/nutch</link>
> >>>
> >>>>     <category>news
> >>>>
> >>> </category>
> >>>
> >>>>     <author>kauu</author>
> >>>>
> >>
> >> On 1/31/07, Chris
> >>
> >>> Mattmann <chris.mattmann@jpl.nasa.gov> wrote:
> >>>
> >>> Hi there,
> >>>
> >>> I could most
> >>> likely be of assistance, if you gave me some more
> >>> information.
> >>> For
> >>> instance: I'm wondering if the use case you describe below is already
> >>>
> >>> supported by the current RSS parse plugin?
> >>>
> >>> The current RSS parser,
> >>> parse-rss, does in fact index individual items
> >>> that
> >>> are pointed to by an
> >>> RSS document. The items are added as Nutch Outlinks,
> >>> and added to the
> >>> overall queue of URLs to fetch. Doesn't this satisfy what
> >>> you mention below?
> >>> Or am I missing something?
> >>>
> >>> Cheers,
> >>>   Chris
> >>>
> >>>
> >>>
> >>> On 1/30/07 6:01 PM,
> >>> "kauu" <babatu@gmail.com> wrote:
> >>>
> >>>
> >>>> Hi folks :
> >>>>
> >>>>    What's I want to
> >>>>
> >>> do is to separate a rss file into several pages .
> >>>
> >>>>   Just as what has
> >>>>
> >>> been discussed before. I want fetch a rss page and
> >>> index
> >>>
> >>>> it as different
> >>>>
> >>> documents in the index. So the searcher can search the
> >>>
> >>>> Item's info as a
> >>>>
> >>> individual hit.
> >>>
> >>>>  What's my opinion create a protocol for fetch the rss
> >>>>
> >>> page and store it
> >>> as
> >>>
> >>>> several one which just contain one ITEM tag .but
> >>>>
> >>> the unique key is the
> >>> url ,
> >>>
> >>>> so how can I store them with the ITEM's link
> >>>>
> >>> tag as the unique key for a
> >>>
> >>>> document.
> >>>>
> >>>>   So my question is how to
> >>>>
> >>> realize this function in nutch-.0.8.x.
> >>>
> >>>>   I've check the code of the
> >>>>
> >>> plug-in protocol-http's code ,but I can't
> >>>
> >>>> find the code where to store a
> >>>>
> >>> page to a document. I want to separate
> >>> the
> >>>
> >>>> rss page to several ones
> >>>>
> >>> before storing it as a document but several
> >>> ones.
> >>>
> >>>>   So any one can
> >>>>
> >>> give me some hints?
> >>>
> >>>> Any reply will be appreciated !
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>   ITEM's structure
> >>>>
> >>>>  <item>
> >>>>
> >>>>
> >>>>     <title>欧洲暴风雪后发制人 致航班
> >>>>
> >>> 延误交通混乱(组图)</title>
> >>>
> >>>>     <description>暴风雪横扫欧洲,导致多次航班延误
1
> >>>>
> >>> 月24日,几架民航客机在德
> >>>
> >>>> 国斯图加特机场内等待去除机身上冰雪。1月24日,工作人员在德国南部
> >>>>
> >>> 的慕尼黑机场
> >>>
> >>>> 清扫飞机跑道上的积雪。 据报道,迟来的暴风雪连续两天横扫中...
> >>>>
> >>>>
> >>>>     </description>
> >>>>
> >>>>
> >>>>
> >>>>
> >>> <link>http://news.sohu.com/20070125
> >>>
> >>> <http://news.sohu.com/20070125/n247833568.shtml> /n247833568.shtml</
> >>>
> >>> link>
> >>>
> >>>>     <category>搜狐焦点图新闻</category>
> >>>>
> >>>>
> >>>>
> >>>>
> >>> <author>cms@sohu.com
> >>>
> >>>> </author>
> >>>>
> >>>>
> >>>>     <pubDate>Thu, 25 Jan 2007
> >>>>
> >>> 11:29:11 +0800</pubDate>
> >>>
> >>>>     <comments
> >>>>
> >>> http://comment.news.sohu.com
> >>>
> >>> <http://comment.news.sohu.com/comment/topic.jsp?id=247833847>
> >>>
> >>> /comment/topic.jsp?id=247833847</comments>
> >>>
> >>>> </item
> >>>>
> >>>>
> >>>>
> >>>
> >>>
> >
> > ______________________________________________
> > Chris A. Mattmann
> > Chris.Mattmann@jpl.nasa.gov
> > Staff Member
> > Modeling and Data Management Systems Section (387)
> > Data Management Systems and Technologies Group
> >
> > _________________________________________________
> > Jet Propulsion Laboratory            Pasadena, CA
> > Office: 171-266B                        Mailstop:  171-246
> > _______________________________________________________
> >
> > Disclaimer:  The opinions presented within are my own and do not reflect
> > those of either NASA, JPL, or the California Institute of Technology.
> >
> >
> >
> >
> >
> >
>
>
> --
> renaud richardet                           +1 617 230 9112
> renaud <at> oslutions.com         http://www.oslutions.com
>
>


-- 
www.babatu.com
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message