nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doğacan Güney <dogacan.gu...@agmlab.com>
Subject Re: RSS-fecter and index individul-how can i realize this function
Date Wed, 07 Feb 2007 07:50:51 GMT
Renaud Richardet wrote:
> Doug Cutting wrote:
>> Renaud Richardet wrote:
>>> The usecase is that you index RSS-feeds, but your users can search
>>> each feed-entry as a single document. Does it makes sense?
>>
>> But each feed item also contains a link whose content will be indexed
>> and that's generally a superset of the item.  
> Agreed
>> So should there be two urls indexed per item?  
> I don't think so
>> In many cases, the best thing to do is to index only the linked page,
>> not the feed item at all.  In some (rare?) cases, there might be
>> items without a link, whose only content is directly in the feed, or
>> where the content in the feed is complementary to that in the linked
>> page.  In these cases it might be useful to combine the two (the feed
>> item and the linked content), indexing both.  The proposed change
>> might permit that.  Is that the case you're concerned about?
> I see. I was thinking that I could index the feed items without having
> to fetch them individually.
>
> More fundamentally, I want to index only the blog-entry text, and not
> the elements around it (header, menus, ads, ...), so as to improve the
> search results.
>
> Here's my case, the proposed changes would allow me to do (*)
>
> 1) parse feeds:
>
> for each (feedentry : feed) do
> |
> |  if (full-text entries) then
> |   |  index each feed entry as a single document; blog header, menus
> are not indexed. *
> |  else
> |   |  create a "special outlink" for each feed entry, which include
> metadata (content, time, etc)
> |  endif
> |
> done
>
> 2) on a next fetch loop:
>
> for each (link) do
> |
> |  if (this is a normal link)
> |    |  fetch it and index it normally
> |  else if (this link come from an already indexed feed entry) then
> |    |  end, do not fetch it *
> |  else if (this is a "special outlink")
> |    |  guess which DOM nodes hold the post content
> |    |  index it; blog header, menus are not indexed.
> |  endif
> |
> done
>
I agree with Renaud Richardet.

Also, I think it all boils down to speed. if you are building a blog
search engine, you want
it to update feeds as fast as it can. Doing 2 depths(one for rss-feed,
one for outlinks) will slow it down.

Besides that, many blog crawlers(like
http://help.yahoo.com/help/us/ysearch/crawling/crawling-02.html) set
crawl-delay  to 1 and so I guess most of the web servers are OK with
that for rss-feeds, but not necessarily
OK with it for HTML pages. (So you will do depth 1(rss-feeds) very
fast(with a 1 second delay), and then get the
items with 5 second delay.)

(I hope it is not stupid to point out Yahoo's crawler to someone who
works at Yahoo :)

--
Doğacan Güney

>
> Thanks,
> Renaud
>
>
>


Mime
View raw message