nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Santiago PĂ©rez <>
Subject Creating an alternative Linkdb with part of the outlinks
Date Fri, 18 Dec 2009 11:50:50 GMT


I am using Nutch for indexing websites and it is working well (most of the

I've checked that Nutch extract the outlinks from the raw HTML code of each
parsed site for expand the crawling proccess.

I would like to keep this structure but I would alsko like to extract the
outlinks from a specific part of the web page (like only from the content of
a new) for creating also an alternative LinkDB in order to know how news are
linked and being linked by another news in their content.
Can anybody give an idea for focusing where and how can I add that new

Thanks in advance from a newbie ;)
View this message in context:
Sent from the Nutch - Dev mailing list archive at

View raw message