nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lewis John Mcgibbney <lewis.mcgibb...@gmail.com>
Subject Re: "URLFilterChecker" documentation
Date Tue, 13 Dec 2011 18:24:10 GMT
Heres my output from URLFilterChecker [1]

lewis@lewis-01:~/ASF/trunk/runtime/local$ bin/nutch
org.apache.nutch.net.URLFilterChecker -filterName urlfilter-regex
Exception in thread "main" java.lang.RuntimeException: Filter
urlfilter-regex not found.
	at org.apache.nutch.net.URLFilterChecker.checkOne(URLFilterChecker.java:66)
	at org.apache.nutch.net.URLFilterChecker.main(URLFilterChecker.java:126)
lewis@lewis-01:~/ASF/trunk/runtime/local$ bin/nutch
org.apache.nutch.net.URLFilterChecker -allCombined
Checking combination of all URLFilters available
^Z
[10]+  Stopped                 bin/nutch
org.apache.nutch.net.URLFilterChecker -allCombined
lewis@lewis-01:~/ASF/trunk/runtime/local$ bin/nutch
org.apache.nutch.net.URLFilterChecker -filterName RegexURLFilter
Exception in thread "main" java.lang.RuntimeException: Filter
RegexURLFilter not found.
	at org.apache.nutch.net.URLFilterChecker.checkOne(URLFilterChecker.java:66)
	at org.apache.nutch.net.URLFilterChecker.main(URLFilterChecker.java:126)

I'm noticing three things

1) NO reference to a single urlfilter seems to work when appended to
the -filterName parameter e.g. regex-urlfilter, urlfilter-regex,
RegexURLFilter, regex-urlfilter.txt
2) When no -filterName parameter is passed but a value is passed e.g.
bin/nutch org.apache.nutch.net.URLFilterChecker regex-urlfilter log
output is as follows
lewis@lewis-01:~/ASF/trunk/runtime/local$ bin/nutch
org.apache.nutch.net.URLFilterChecker regex-urlfilter
Checking combination of all URLFilters available
Therefore it seems to incorrectly skip to the checkAll method then hang!
3) If the -allCombined parameter is passed the output indiciates that
it does the same as 2) above...

Can you please check if you are getting the same behaviour Markus? Thank you

[1] http://svn.apache.org/repos/asf/nutch/trunk/src/java/org/apache/nutch/net/URLFilterChecker.java

On Tue, Dec 13, 2011 at 5:06 PM, Markus Jelsma
<markus.jelsma@openindex.io> wrote:
> i see no log output mate :)
>
> On Tuesday 13 December 2011 17:58:36 you wrote:
>> Thanks Markus.
>>
>> Can you look at my log output and inform where I am going wrong
>> please? It seemed to be playing up for me.
>>
>> Thanks
>>
>> On Tue, Dec 13, 2011 at 4:53 PM, Markus Jelsma
>>
>> <markus.jelsma@openindex.io> wrote:
>> > I've never seen it hanging and use it weekly.
>> >
>> > On Tuesday 13 December 2011 17:45:54 you wrote:
>> >> Hi,
>> >>
>> >> Can anyone confirm if this is an issue?
>> >>
>> >> If so I think we should log it before it goes unnoticed.
>> >>
>> >> Thanks
>> >>
>> >> Lewis
>> >>
>> >> On Fri, Dec 9, 2011 at 3:21 PM, Lewis John Mcgibbney
>> >>
>> >> <lewis.mcgibbney@gmail.com> wrote:
>> >> > If you look at the output I posted, even when I specified a particular
>> >> > filter, the checkAll() method is still getting called, as is indicated
>> >> > by the "Checking combination of all URLFilters available" log output.
>> >> > It's not a particularly complex class, so hopefully if we can confirm
>> >> > this is a bug we can fix it quickly.
>> >> >
>> >> > Finally, I must ask, Remi which URL filters have you included in your
>> >> > plugin.includes property in nutch-site.xml after building Nutch?
>> >> >
>> >> > On Fri, Dec 9, 2011 at 3:11 PM, Lewis John Mcgibbney
>> >> >
>> >> > <lewis.mcgibbney@gmail.com> wrote:
>> >> >> Hi Remi & Markus,
>> >> >>
>> >> >> Yeah, I can replicate this, good catch Remi.
>> >> >>
>> >> >> lewis@lewis-desktop:~/ASF/trunk/runtime/local$ bin/nutch
>> >> >> org.apache.nutch.net.URLFilterChecker http://www.heraldscotland.com
>> >> >> -filterName regex-urlfilter.txt
>> >> >>
>> >> >> Checking combination of all URLFilters available
>> >> >> ^Z
>> >> >> [2]+  Stopped                 bin/nutch
>> >> >> org.apache.nutch.net.URLFilterChecker http://www.heraldscotland.com
>> >> >> -filterName regex-urlfilter.txt
>> >> >> lewis@lewis-desktop:~/ASF/trunk/runtime/local$ bin/nutch
>> >> >> org.apache.nutch.net.URLFilterChecker http://www.heraldscotland.com
>> >> >> -filterName regex-urlfilter
>> >> >>
>> >> >> Checking combination of all URLFilters available
>> >> >>
>> >> >> The first instance was hanging, so was the second. This needs some
>> >> >> further investigation I think. Can someone else please confirm
before
>> >> >> we log this in Jira?
>> >> >>
>> >> >> Thanks for reporting
>> >> >>
>> >> >>
>> >> >> On Fri, Dec 9, 2011 at 12:53 PM, remi tassing <tassingremi@gmail.com>
>> >> >>
>> >> >> wrote:
>> >> >>> I fed with URL but it didn't work:
>> >> >>>
>> >> >>> $ bin/nutch org.apache.nutch.net.URLFilterChecker
>> >> >>> http://www.google.com Checking combination of all URLFilters
>> >> >>> available
>> >> >>>
>> >> >>> Remi
>> >> >>>
>> >> >>> On Fri, Dec 9, 2011 at 2:43 PM, Markus Jelsma
>> >> >>>
>> >> >>> <markus.jelsma@openindex.io>wrote:
>> >> >>> > it reads from stdin so you can either type a url followed
by enter
>> >> >>> > or feed
>> >> >>> > from stdin using pipes.
>> >> >>> >
>> >> >>> > On Friday 09 December 2011 13:32:41 remi tassing wrote:
>> >> >>> > > Hello guys,
>> >> >>> > >
>> >> >>> > > how do you use "org.apache.nutch.net.URLFilterChecker"?
It's not
>> >> >>> >
>> >> >>> > documented
>> >> >>> >
>> >> >>> > > and it always shows me this "Checking combination
of all
>> >> >>> > > URLFilters available" and then gets stuck.
>> >> >>> > >
>> >> >>> > > Remi
>> >> >>> >
>> >> >>> > --
>> >> >>> > Markus Jelsma - CTO - Openindex
>> >> >>>
>> >> >>> --
>> >> >>> Remi Tassing
>> >> >>
>> >> >> --
>> >> >> Lewis
>> >> >
>> >> > --
>> >> > Lewis
>> >
>> > --
>> > Markus Jelsma - CTO - Openindex
>
> --
> Markus Jelsma - CTO - Openindex



-- 
Lewis

Mime
View raw message