nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From karthik085 <>
Subject Re: Ignore Robots meta tag
Date Fri, 27 Apr 2007 19:35:12 GMT

OOPS...I meant IndexSegment. 
PruneIndexTool prunes existing Nutch indexes of unwanted content. :-)

karthik085 wrote:
> Hi,
> I am trying to index a website. That website has 
>   <meta name='ROBOTS' content='NOINDEX, NOFOLLOW'> in their html file.
> If they want to remove this, they will have to remove it in all their
> pages and they don't want to regenerate these pages from database.
> I already crawled this website. Is there anyway I can make Nutch to ignore
> the above and index the page?
> One way I can think of is:
> a) Retrieve HTML from segments
> b) Remove that line
> c) Write back
> d) Re-index
> Anyone has a better solution? Can I use PruneIndexTool?
> If the above is the way I go about it, how do I do it...I mean, what are
> the commands I need to issue/classes I need to call and modify?
> Any help is appreciated. Thanks.
> Karthik

View this message in context:
Sent from the Nutch - User mailing list archive at

View raw message