nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Mattmann <chris.mattm...@jpl.nasa.gov>
Subject Re: RSS-fecter and index individul-how can i realize this function
Date Thu, 08 Feb 2007 15:34:25 GMT
Hi Doug,

  Okay, I see your points. It seems like this would be really useful for
some current folks, and for Nutch going forward. I see that there has been
some initial work today and preparing patches. I'd be happy to shepherd this
into the sources. I will begin reviewing what's required, and contacting the
folks who've begun work on this issue.

Thanks!

Cheers,
  Chris



On 2/7/07 1:31 PM, "Doug Cutting" <cutting@apache.org> wrote:

> Chris Mattmann wrote:
>>  Got it. So, the logic behind this is, why bother waiting until the
>> following fetch to parse (and create ParseData objects from) the RSS items
>> out of the feed. Okay, I get it, assuming that the RSS feed has *all* of the
>> RSS metadata in it. However, it's perfectly acceptable to have feeds that
>> simply have a title, description, and link in it.
> 
> Almost.  The feed may have less than the referenced page, but it's also
> a lot easier to parse, since the link could be an anchor within a large
> page, or could be a page that has lots of navigation links, spam
> comments, etc.  So feed entries are generally much more precise than the
> pages they reference, and may make for a higher-quality search experience.
> 
>> I guess this is still
>> valuable metadata information to have, however, the only caveat is that the
>> implication of the proposed change is:
>> 
>> 1. We won't have cached copies, or fetched copies of the Content represented
>> by the item links. Therefore, in this model, we won't be able to pull up a
>> Nutch cache of the page corresponding to the RSS item, because we are
>> circumventing the fetch step
> 
> Good point.  We indeed wouldn't have these URLs in the cache.
> 
>> 2. It sounds like a pretty fundamental API shift in Nutch, to support a
>> single type of content, RSS. Even if there are more content types that
>> follow this model, as Doug and Renaud both pointed out, there aren't a
>> multitude of them (perhaps archive files, but can you think of any others)?
> 
> Also true.  On the other hand, Nutch provides 98% of an RSS search
> engine.  It'd be a shame to have to re-invent everything else and it
> would be great if Nutch could evolve to support RSS well.
> 
> Could image search might also benefit from this?  One could generate a
> Parse for each image on a page whose text was from the page.  Product
> search too, perhaps.
> 
>> The other main thing that comes to mind about this for me is it prevents the
>> fetched Content for the RSS items from being able to provide useful
>> metadata, in the sense that it doesn't explicitly fetch the content. What if
>> we wanted to apply some super cool metadata extractor X that used
>> word-stemming, HTML design analysis, and other techniques to extract
>> metadata from the content pointed to by an RSS item link? In the proposed
>> model, we assume that the RSS xml item tag already contains all necessary
>> metadata for indexing, which in my mind, limits the model. Does what I am
>> saying make sense? I'm not shooting down the issue, I'm just trying to
>> brainstorm a bit here about the issue.
> 
> Sure, the RSS feed may contain less than the page it references, but
> that might be all that one wishes to index.  Otherwise, if, e.g., a blog
>   includes titles from other recent posts you're going to get lots of
> false positives.  Ideally Nutch should support various options:
> searching the feed only, searching the referenced page only, or perhaps
> searching both.
> 
> Doug



Mime
View raw message