nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From kauu <bab...@gmail.com>
Subject Re: RSS-fecter and index individul-how can i realize this function
Date Fri, 02 Feb 2007 01:42:10 GMT
hi all,
  what Gal said is just my meaning on the rss-parse need.
  i just want to fetch rss seeds once,



On 2/2/07, Gal Nitzan <gnitzan@usa.net> wrote:
>
>
> Hi Chris,
>
> I'm sorry I wasn't clear enough. What I mean is that in the current
> implementation:
>
> 1. The RSS (channels, items) page ends up as one Lucene document in the
> index.
> 2. Indeed the links are extracted and each <item> link will be fetched in
> the next fetch as a separate page and will end up as one Lucene document.
>
> IMHO the data that is needed i.e. the data that will be fetched in the
> next fetch process is already available in the <item> element. Each <item>
> element represents one web resource. And there is no reason to go to the
> server and re-fetch that resource.
>
> Another issue that arises from rss feeds is that once the feed page is
> fetched you can not re-fetch it until its "time to fetch" expired. The feeds
> TTL is usually very short. Since for now in Nutch, all pages created equal
> :) it is one more thing to think about.
>
> HTH,
>
> Gal.
>
> -----Original Message-----
> From: Chris Mattmann [mailto:chris.mattmann@jpl.nasa.gov]
> Sent: Thursday, February 01, 2007 7:01 PM
> To: nutch-dev@lucene.apache.org
> Subject: Re: RSS-fecter and index individul-how can i realize this
> function
>
> Hi Gal, et al.,
>
>   I'd like to be explicit when we talk about what the issue with the RSS
> parsing plugin is here; I think we have had conversations similar to this
> before and it seems that we keep talking around each other. I'd like to
> get
> to the heart of this matter so that the issue (if there is an actual one)
> gets addressed ;)
>
>   Okay, so you mention below that the thing that you see missing from the
> current RSS parsing plugin is the ability to store data in the CrawlDatum,
> and parse "it" in the next fetch phase. Well, there are 2 options here for
> what you refer to as "it":
>
> 1. If you're talking about the RSS file, then in fact, it is parsed, and
> its data is stored in the CrawlDatum, akin to any other form of content
> that
> is fetched, parsed and indexed.
>
> 2. If you're talking about the item links within the RSS file, in fact,
> they are parsed (eventually), and their data stored in the CrawlDatum,
> akin
> to any other form of content that is fetched, parsed, and indexed. This is
> accomplished by adding the RSS items as Outlinks when the RSS file is
> parsed: in this fashion, we go after all of the links in the RSS file, and
> make sure that we index their content as well.
>
> Thus, if you had an RSS file R that contained links in it to a PDF file A,
> and another HTML page P, then not only would R get fetched, parsed, and
> indexed, but so would A and P, because they are item links within R. Then
> queries that would match R (the physical RSS file), would additionally
> match
> things such as P and A, and all 3 would be capable of being returned in a
> Nutch query. Does this make sense? Is this the issue that you're talking
> about? Am I nuts? ;)
>
> Cheers,
>   Chris
>
>
>
>
> On 1/31/07 10:40 PM, "Gal Nitzan" <gnitzan@usa.net> wrote:
>
> > Hi,
> >
> > Many sites provide RSS feeds for several reasons, usually to save
> bandwidth,
> > to give the users concentrated data and so forth.
> >
> > Some of the RSS files supplied by sites are created specially for search
> > engines where each RSS "item" represent a web page in the site.
> >
> > IMHO the only thing "missing" in the parse-rss plugin is storing the
> data in
> > the CrawlDatum and "parsing" it in the next fetch phase. Maybe adding a
> new
> > flag to CrawlDatum, that would flag the URL as "parsable" not
> "fetchable"?
> >
> > Just my two cents...
> >
> > Gal.
> >
> > -----Original Message-----
> > From: Chris Mattmann [mailto:chris.mattmann@jpl.nasa.gov]
> > Sent: Wednesday, January 31, 2007 8:44 AM
> > To: nutch-dev@lucene.apache.org
> > Subject: Re: RSS-fecter and index individul-how can i realize this
> function
> >
> > Hi there,
> >
> >   With the explanation that you give below, it seems like parse-rss as
> it
> > exists would address what you are trying to do. parse-rss parses an RSS
> > channel as a set of items, and indexes overall metadata about the RSS
> file,
> > including parse text, and index data, but it also adds each item (in the
> > channel)'s URL as an Outlink, so that Nutch will process those pieces of
> > content as well. The only thing that you suggest below that parse-rss
> > currently doesn't do, is to allow you to associate the metadata fields
> > category:, and author: with the item Outlink...
> >
> > Cheers,
> >   Chris
> >
> >
> >
> > On 1/30/07 7:30 PM, "kauu" <babatu@gmail.com> wrote:
> >
> >> thx for ur reply .
> > mybe i didn't tell clearly .
> >  I want to index the item as a
> >> individual page .then when i search the some
> > thing for example "nutch-open
> >> source", the nutch return a hit which contain
> >
> >    title : nutch-open source
> >
> >> description : nutch nutch nutch ....nutch  nutch
> >    url :
> >> http://lucene.apache.org/nutch
> >    category : news
> >   author  : kauu
> >
> > so , is
> >> the plugin parse-rss can satisfy what i need?
> >
> > <item>
> >     <title>nutch--open
> >> source</title>
> >    <description>
> >>
> >>        nutch nutch nutch ....nutch
> >> nutch
> >>>     </description>
> >>>
> >>>
> >>>
> >> <link>http://lucene.apache.org/nutch</link>
> >>>
> >>>
> >>>     <category>news
> >> </category>
> >>>
> >>>
> >>>     <author>kauu</author>
> >
> >
> >
> > On 1/31/07, Chris
> >> Mattmann <chris.mattmann@jpl.nasa.gov> wrote:
> >>
> >> Hi there,
> >>
> >> I could most
> >> likely be of assistance, if you gave me some more
> >> information.
> >> For
> >> instance: I'm wondering if the use case you describe below is already
> >>
> >> supported by the current RSS parse plugin?
> >>
> >> The current RSS parser,
> >> parse-rss, does in fact index individual items
> >> that
> >> are pointed to by an
> >> RSS document. The items are added as Nutch Outlinks,
> >> and added to the
> >> overall queue of URLs to fetch. Doesn't this satisfy what
> >> you mention below?
> >> Or am I missing something?
> >>
> >> Cheers,
> >>   Chris
> >>
> >>
> >>
> >> On 1/30/07 6:01 PM,
> >> "kauu" <babatu@gmail.com> wrote:
> >>
> >>> Hi folks :
> >>>
> >>>    What's I want to
> >> do is to separate a rss file into several pages .
> >>>
> >>>   Just as what has
> >> been discussed before. I want fetch a rss page and
> >> index
> >>> it as different
> >> documents in the index. So the searcher can search the
> >>> Item's info as a
> >> individual hit.
> >>>
> >>>  What's my opinion create a protocol for fetch the rss
> >> page and store it
> >> as
> >>> several one which just contain one ITEM tag .but
> >> the unique key is the
> >> url ,
> >>> so how can I store them with the ITEM's link
> >> tag as the unique key for a
> >>> document.
> >>>
> >>>   So my question is how to
> >> realize this function in nutch-.0.8.x.
> >>>
> >>>   I've check the code of the
> >> plug-in protocol-http's code ,but I can't
> >>> find the code where to store a
> >> page to a document. I want to separate
> >> the
> >>> rss page to several ones
> >> before storing it as a document but several
> >> ones.
> >>>
> >>>   So any one can
> >> give me some hints?
> >>>
> >>> Any reply will be appreciated !
> >>>
> >>>
> >>>
> >>>
> >>
> >>>
> >>>   ITEM's structure
> >>>
> >>>  <item>
> >>>
> >>>
> >>>     <title>欧洲暴风雪后发制人 致航班
> >> 延误交通混乱(组图)</title>
> >>>
> >>>
> >>>     <description>暴风雪横扫欧洲,导致多次航班延误
1
> >> 月24日,几架民航客机在德
> >>> 国斯图加特机场内等待去除机身上冰雪。1月24日,工作人员在德国南部
> >> 的慕尼黑机场
> >>> 清扫飞机跑道上的积雪。 据报道,迟来的暴风雪连续两天横扫中...
> >>>
> >>
> >>>
> >>>
> >>>     </description>
> >>>
> >>>
> >>>
> >> <link>http://news.sohu.com/20070125
> >>>
> >> <http://news.sohu.com/20070125/n247833568.shtml> /n247833568.shtml</
> >>>
> >> link>
> >>>
> >>>
> >>>     <category>搜狐焦点图新闻</category>
> >>>
> >>>
> >>>
> >> <author>cms@sohu.com
> >>> </author>
> >>>
> >>>
> >>>     <pubDate>Thu, 25 Jan 2007
> >> 11:29:11 +0800</pubDate>
> >>>
> >>>
> >>>     <comments
> >>>>
> >> http://comment.news.sohu.com
> >>>
> >> <http://comment.news.sohu.com/comment/topic.jsp?id=247833847>
> >>>
> >> /comment/topic.jsp?id=247833847</comments>
> >>>
> >>>
> >>> </item
> >>>
> >>>
> >>
> >>>
> >>
> >>
> >>
> >
>
> ______________________________________________
> Chris A. Mattmann
> Chris.Mattmann@jpl.nasa.gov
> Staff Member
> Modeling and Data Management Systems Section (387)
> Data Management Systems and Technologies Group
>
> _________________________________________________
> Jet Propulsion Laboratory            Pasadena, CA
> Office: 171-266B                        Mailstop:  171-246
> _______________________________________________________
>
> Disclaimer:  The opinions presented within are my own and do not reflect
> those of either NASA, JPL, or the California Institute of Technology.
>
>
>
>
>


-- 
www.babatu.com
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message