nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dennis Kubes <ku...@apache.org>
Subject Re: How to effectively stop indexing javascript pages ending with .js
Date Tue, 02 Dec 2008 20:31:23 GMT


John Martyniak wrote:
> That is good information.  Because I too have the same issue, I don't 
> want the js files in the index.
> 
> But what if you already have a bunch of .js files in your segments and 
> want to remove them from the index/segments.  is there anyway to 
> effectively do that as well?

I believe (but haven't tested) that if you change the urlfilters as 
discussed and then run mergedb and mergesegs commands giving only a 
single crawldb and segments as input, then those urls will be filtered out.

Dennis

> 
> -John
> 
> On Dec 2, 2008, at 12:56 PM, ML mail wrote:
> 
>> Dear Dennis
>>
>> Many thanks for your quick response. Now everything is clear and I 
>> understand why it didn't work...
>>
>> I will still use the urlfilter-regex plugin as I would like to crawl 
>> only domains from a single top level domain but as suggested I have 
>> added the urlfilter-suffix plugin to avoid indexing javascript pages. 
>> In the past I already had deactivated the parse-js plugin.
>>
>> So I am now looking forward to the next crawls being freed of stupid 
>> file formats like js ;-)
>>
>> Greetings
>>
>>
>> --- On Tue, 12/2/08, Dennis Kubes <kubes@apache.org> wrote:
>>
>>> From: Dennis Kubes <kubes@apache.org>
>>> Subject: Re: How to effectively stop indexing javascript pages ending 
>>> with .js
>>> To: nutch-user@lucene.apache.org
>>> Date: Tuesday, December 2, 2008, 8:50 AM
>>> ML mail wrote:
>>>> Hello,
>>>>
>>>> I would definitely like not to index any javascript
>>> pages, this means any pages ending with ".js". So
>>> for this purpose I simply edited the crawl-urlfilter.txt
>>> file and changed the default suffix list not to be parsed to
>>> add the .js extension so that it looks like this now:
>>>>
>>>> # skip image and other suffixes we can't yet parse
>>>>
>>> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP|js)$

>>>
>>>
>>> The easiest way IMO is to use prefix and suffix urlfilters
>>> instead regex urlfilter.  Change plugin.includes and replace
>>> urlfilter-regex with urlfilter-(prefix|suffix).  Then in the
>>> suffix-urlfilter.txt file add .js under .css in web formats.
>>>
>>> Also change plugin.includes from parse-(text|html|js) to be
>>> parse-(text|html).
>>>
>>>>
>>>> Unfortunately I noticed that javascript pages are
>>> still getting indexed. So what does this exactly mean ? Is
>>> crawl-urlfilter.txt not working ? Did I miss something maybe
>>> ?
>>>> I was also wondering what is the difference between
>>> these two files:
>>>>
>>>> crawl-urlfilter.txt
>>>> regex-urlfilter.txt
>>>
>>> crawl-urlfilter.txt file is used by the crawl command.  The
>>> regex, suffix, prefix, and other urlfilter files and plugins
>>> are used when calling commands manually in various tools.
>>>
>>> Dennis
>>>>
>>>> ?
>>>>
>>>> Many thanks
>>>> Regards
>>>>
>>>>
>>>>
>>
>>
>>
> 

Mime
View raw message