nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Renaud Richardet <...@oslutions.com>
Subject Re: RSS-fecter and index individul-how can i realize this function
Date Tue, 06 Feb 2007 23:17:20 GMT
Doug Cutting wrote:
> Renaud Richardet wrote:
>> The usecase is that you index RSS-feeds, but your users can search 
>> each feed-entry as a single document. Does it makes sense?
>
> But each feed item also contains a link whose content will be indexed 
> and that's generally a superset of the item.  
Agreed
> So should there be two urls indexed per item?  
I don't think so
> In many cases, the best thing to do is to index only the linked page, 
> not the feed item at all.  In some (rare?) cases, there might be items 
> without a link, whose only content is directly in the feed, or where 
> the content in the feed is complementary to that in the linked page.  
> In these cases it might be useful to combine the two (the feed item 
> and the linked content), indexing both.  The proposed change might 
> permit that.  Is that the case you're concerned about?
I see. I was thinking that I could index the feed items without having 
to fetch them individually.

More fundamentally, I want to index only the blog-entry text, and not 
the elements around it (header, menus, ads, ...), so as to improve the 
search results.

Here's my case, the proposed changes would allow me to do (*)

1) parse feeds:

for each (feedentry : feed) do
|
|  if (full-text entries) then
|   |  index each feed entry as a single document; blog header, menus 
are not indexed. *
|  else
|   |  create a "special outlink" for each feed entry, which include 
metadata (content, time, etc)
|  endif
|
done

2) on a next fetch loop:

for each (link) do
|
|  if (this is a normal link)
|    |  fetch it and index it normally
|  else if (this link come from an already indexed feed entry) then
|    |  end, do not fetch it *
|  else if (this is a "special outlink")
|    |  guess which DOM nodes hold the post content
|    |  index it; blog header, menus are not indexed.
|  endif
|
done


Thanks,
Renaud

Mime
View raw message