nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <>
Subject Re: RSS-fecter and index individul-how can i realize this function
Date Wed, 07 Feb 2007 21:31:21 GMT
Chris Mattmann wrote:
>  Got it. So, the logic behind this is, why bother waiting until the
> following fetch to parse (and create ParseData objects from) the RSS items
> out of the feed. Okay, I get it, assuming that the RSS feed has *all* of the
> RSS metadata in it. However, it's perfectly acceptable to have feeds that
> simply have a title, description, and link in it.

Almost.  The feed may have less than the referenced page, but it's also 
a lot easier to parse, since the link could be an anchor within a large 
page, or could be a page that has lots of navigation links, spam 
comments, etc.  So feed entries are generally much more precise than the 
pages they reference, and may make for a higher-quality search experience.

> I guess this is still
> valuable metadata information to have, however, the only caveat is that the
> implication of the proposed change is:
> 1. We won't have cached copies, or fetched copies of the Content represented
> by the item links. Therefore, in this model, we won't be able to pull up a
> Nutch cache of the page corresponding to the RSS item, because we are
> circumventing the fetch step

Good point.  We indeed wouldn't have these URLs in the cache.

> 2. It sounds like a pretty fundamental API shift in Nutch, to support a
> single type of content, RSS. Even if there are more content types that
> follow this model, as Doug and Renaud both pointed out, there aren't a
> multitude of them (perhaps archive files, but can you think of any others)?

Also true.  On the other hand, Nutch provides 98% of an RSS search 
engine.  It'd be a shame to have to re-invent everything else and it 
would be great if Nutch could evolve to support RSS well.

Could image search might also benefit from this?  One could generate a 
Parse for each image on a page whose text was from the page.  Product 
search too, perhaps.

> The other main thing that comes to mind about this for me is it prevents the
> fetched Content for the RSS items from being able to provide useful
> metadata, in the sense that it doesn't explicitly fetch the content. What if
> we wanted to apply some super cool metadata extractor X that used
> word-stemming, HTML design analysis, and other techniques to extract
> metadata from the content pointed to by an RSS item link? In the proposed
> model, we assume that the RSS xml item tag already contains all necessary
> metadata for indexing, which in my mind, limits the model. Does what I am
> saying make sense? I'm not shooting down the issue, I'm just trying to
> brainstorm a bit here about the issue.

Sure, the RSS feed may contain less than the page it references, but 
that might be all that one wishes to index.  Otherwise, if, e.g., a blog 
  includes titles from other recent posts you're going to get lots of 
false positives.  Ideally Nutch should support various options: 
searching the feed only, searching the referenced page only, or perhaps 
searching both.


View raw message