nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rémy Amouroux <r...@teorem.fr>
Subject Re: URL filtering: crawling time vs. indexing time
Date Fri, 02 Nov 2012 16:29:12 GMT
You have still several possibilities here :
1) find a way to seed the crawl with the URLs containing the links to the leaf pages (sometimes
it is possible with a simple loop)
2) create regex for each step of the scenario going to the leaf page, in order to limit the
crawl to necessary pages only. Use the $ sign at the end of your regexp to limit the match
of regexp like http://([a-z0-9]*\.)*mysite.com.


Le 2 nov. 2012 à 17:22, Joe Zhang <smartagent@gmail.com> a écrit :

> The problem is that,
> 
> - if you write regex such as: +^http://([a-z0-9]*\.)*mysite.com, you'll end
> up indexing all the pages on the way, not just the leaf pages.
> - if you write specific regex for
> http://www.mysite.com/level1pattern/level2pattern/pagepattern.html, and you
> start crawling at mysite.com, you'll get zero results, as there is no match.
> 
> On Fri, Nov 2, 2012 at 6:21 AM, Markus Jelsma <markus.jelsma@openindex.io>wrote:
> 
>> -----Original message-----
>>> From:Joe Zhang <smartagent@gmail.com>
>>> Sent: Fri 02-Nov-2012 10:04
>>> To: user@nutch.apache.org
>>> Subject: URL filtering: crawling time vs. indexing time
>>> 
>>> I feel like this is a trivial question, but I just can't get my ahead
>>> around it.
>>> 
>>> I'm using nutch 1.5.1 and solr 3.6.1 together. Things work fine at the
>>> rudimentary level.
>>> 
>>> If my understanding is correct, the regex-es in
>>> nutch/conf/regex-urlfilter.txt control  the crawling behavior, ie., which
>>> URLs to visit or not in the crawling process.
>> 
>> Yes.
>> 
>>> 
>>> On the other hand, it doesn't seem artificial for us to only want certain
>>> pages to be indexed. I was hoping to write some regular expressions as
>> well
>>> in some config file, but I just can't find the right place. My hunch
>> tells
>>> me that such things should not require into-the-box coding. Can anybody
>>> help?
>> 
>> What exactly do you want? Add your custom regular expressions? The
>> regex-urlfilter.txt is the place to write them to.
>> 
>>> 
>>> Again, the scenario is really rather generic. Let's say we want to crawl
>>> http://www.mysite.com. We can use the regex-urlfilter.txt to skip loops
>> and
>>> unncessary file types etc., but only expect to index pages with URLs
>> like:
>>> http://www.mysite.com/level1pattern/level2pattern/pagepattern.html.
>> 
>> To do this you must simply make sure your regular expressions can do this.
>> 
>>> 
>>> Am I too naive to expect zero Java coding in this case?
>> 
>> No, you can achieve almost all kinds of exotic filtering with just the URL
>> filters and the regular expressions.
>> 
>> Cheers
>>> 
>> 


Mime
View raw message