nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Mattmann <chris.mattm...@jpl.nasa.gov>
Subject Re: RSS-fecter and index individul-how can i realize this function
Date Wed, 07 Feb 2007 21:10:30 GMT
Doug, Renaud,

 Got it. So, the logic behind this is, why bother waiting until the
following fetch to parse (and create ParseData objects from) the RSS items
out of the feed. Okay, I get it, assuming that the RSS feed has *all* of the
RSS metadata in it. However, it's perfectly acceptable to have feeds that
simply have a title, description, and link in it. I guess this is still
valuable metadata information to have, however, the only caveat is that the
implication of the proposed change is:

1. We won't have cached copies, or fetched copies of the Content represented
by the item links. Therefore, in this model, we won't be able to pull up a
Nutch cache of the page corresponding to the RSS item, because we are
circumventing the fetch step

2. It sounds like a pretty fundamental API shift in Nutch, to support a
single type of content, RSS. Even if there are more content types that
follow this model, as Doug and Renaud both pointed out, there aren't a
multitude of them (perhaps archive files, but can you think of any others)?

The other main thing that comes to mind about this for me is it prevents the
fetched Content for the RSS items from being able to provide useful
metadata, in the sense that it doesn't explicitly fetch the content. What if
we wanted to apply some super cool metadata extractor X that used
word-stemming, HTML design analysis, and other techniques to extract
metadata from the content pointed to by an RSS item link? In the proposed
model, we assume that the RSS xml item tag already contains all necessary
metadata for indexing, which in my mind, limits the model. Does what I am
saying make sense? I'm not shooting down the issue, I'm just trying to
brainstorm a bit here about the issue.

Cheers,
  Chris





On 2/7/07 11:11 AM, "Doug Cutting" <cutting@apache.org> wrote:

> Chris Mattmann wrote:
>>  Sorry to be so thick-headed, but could someone explain to me in really
>> simple language what this change is requesting that is different from the
>> current Nutch API? I still don't get it, sorry...
> 
> A Content would no longer generate a single Parse.  Instead, a Content
> could potentially generate many Parses.  For most types of content,
> e.g., HTML, each Content would still generate a single Parse.  But for
> RSS, a Content might generate multiple Parses, each indexed separately
> and each with a distinct URL.
> 
> Another potential application could be processing archives: the parser
> could unpack the archive and each item in it indexed separately rather
> than indexing the archive as a whole.  This only makes sense if each
> item has a distinct URL, which it does in RSS, but it might not in an
> archive.  However some archive file formats do contain URLs, like that
> used by the Internet Archive.
> 
> http://www.archive.org/web/researcher/ArcFileFormat.php
> 
> Does that help?
> 
> Doug



Mime
View raw message