nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From remi tassing <tassingr...@gmail.com>
Subject Re: "URLFilterChecker" documentation
Date Tue, 13 Dec 2011 19:09:09 GMT
Pla check Markus's earlier email.on the format. It seems to be working.but
the output is still incorrect for me.

On Tuesday, December 13, 2011, Lewis John Mcgibbney <
lewis.mcgibbney@gmail.com> wrote:
> Heres my output from URLFilterChecker [1]
>
> lewis@lewis-01:~/ASF/trunk/runtime/local$ bin/nutch
> org.apache.nutch.net.URLFilterChecker -filterName urlfilter-regex
> Exception in thread "main" java.lang.RuntimeException: Filter
> urlfilter-regex not found.
>        at
org.apache.nutch.net.URLFilterChecker.checkOne(URLFilterChecker.java:66)
>        at
org.apache.nutch.net.URLFilterChecker.main(URLFilterChecker.java:126)
> lewis@lewis-01:~/ASF/trunk/runtime/local$ bin/nutch
> org.apache.nutch.net.URLFilterChecker -allCombined
> Checking combination of all URLFilters available
> ^Z
> [10]+  Stopped                 bin/nutch
> org.apache.nutch.net.URLFilterChecker -allCombined
> lewis@lewis-01:~/ASF/trunk/runtime/local$ bin/nutch
> org.apache.nutch.net.URLFilterChecker -filterName RegexURLFilter
> Exception in thread "main" java.lang.RuntimeException: Filter
> RegexURLFilter not found.
>        at
org.apache.nutch.net.URLFilterChecker.checkOne(URLFilterChecker.java:66)
>        at
org.apache.nutch.net.URLFilterChecker.main(URLFilterChecker.java:126)
>
> I'm noticing three things
>
> 1) NO reference to a single urlfilter seems to work when appended to
> the -filterName parameter e.g. regex-urlfilter, urlfilter-regex,
> RegexURLFilter, regex-urlfilter.txt
> 2) When no -filterName parameter is passed but a value is passed e.g.
> bin/nutch org.apache.nutch.net.URLFilterChecker regex-urlfilter log
> output is as follows
> lewis@lewis-01:~/ASF/trunk/runtime/local$ bin/nutch
> org.apache.nutch.net.URLFilterChecker regex-urlfilter
> Checking combination of all URLFilters available
> Therefore it seems to incorrectly skip to the checkAll method then hang!
> 3) If the -allCombined parameter is passed the output indiciates that
> it does the same as 2) above...
>
> Can you please check if you are getting the same behaviour Markus? Thank
you
>
> [1]
http://svn.apache.org/repos/asf/nutch/trunk/src/java/org/apache/nutch/net/URLFilterChecker.java
>
> On Tue, Dec 13, 2011 at 5:06 PM, Markus Jelsma
> <markus.jelsma@openindex.io> wrote:
>> i see no log output mate :)
>>
>> On Tuesday 13 December 2011 17:58:36 you wrote:
>>> Thanks Markus.
>>>
>>> Can you look at my log output and inform where I am going wrong
>>> please? It seemed to be playing up for me.
>>>
>>> Thanks
>>>
>>> On Tue, Dec 13, 2011 at 4:53 PM, Markus Jelsma
>>>
>>> <markus.jelsma@openindex.io> wrote:
>>> > I've never seen it hanging and use it weekly.
>>> >
>>> > On Tuesday 13 December 2011 17:45:54 you wrote:
>>> >> Hi,
>>> >>
>>> >> Can anyone confirm if this is an issue?
>>> >>
>>> >> If so I think we should log it before it goes unnoticed.
>>> >>
>>> >> Thanks
>>> >>
>>> >> Lewis
>>> >>
>>> >> On Fri, Dec 9, 2011 at 3:21 PM, Lewis John Mcgibbney
>>> >>
>>> >> <lewis.mcgibbney@gmail.com> wrote:
>>> >> > If you look at the output I posted, even when I specified a
particular
>>> >> > filter, the checkAll() method is still getting called, as is
indicated
>>> >> > by the "Checking combination of all URLFilters available" log
output.
>>> >> > It's not a particularly complex class, so hopefully if we can
confirm
>>> >> > this is a bug we can fix it quickly.
>>> >> >
>>> >> > Finally, I must ask, Remi which URL filters have you included in
your
>>> >> > plugin.includes property in nutch-site.xml after building Nutch?
>>> >> >
>>> >> > On Fri, Dec 9, 2011 at 3:11 PM, Lewis John Mcgibbney
>>> >> >
>>> >> > <lewis.mcgibbney@gmail.com> wrote:
>>> >> >> Hi Remi & Markus,
>>> >> >>
>>> >> >> Yeah, I can replicate this, good catch Remi.
>>> >> >>
>>> >> >> lewis@lewis-desktop:~/ASF/trunk/runtime/local$ bin/nutch
>>> >> >> org.apache.nutch.net.URLFilterChecker
http://www.heraldscotland.com
>>> >> >> -filterName regex-urlfilter.txt
>>> >> >>
>>> >> >> Checking combination of all URLFilters available
>>> >> >> ^Z
>>> >> >> [2]+  Stopped                 bin/nutch
>>> >> >> org.apache.nutch.net.URLFilterChecker
http://www.heraldscotland.com
>>> >> >> -filterName regex-urlfilter.txt
>>> >> >> lewis@lewis-desktop:~/ASF/trunk/runtime/local$ bin/nutch
>>> >> >> org.apache.nutch.net.URLFilterChecker
http://www.heraldscotland.com
>>> >> >> -filterName regex-urlfilter
>>> >> >>
>>> >> >> Checking combination of all URLFilters available
>>> >> >>
>>> >> >> The first instance was hanging, so was the second. This needs
some
>>> >> >> further investigation I think. Can someone else please confirm
before
>>> >> >> we log this in Jira?
>>> >> >>
>>> >> >> Thanks for reporting
>>> >> >>
>>> >> >>
>>> >> >> On Fri, Dec 9, 2011 at 12:53 PM, remi tassing <
tassingremi@gmail.com>
>>> >> >>
>>> >> >> wrote:
>>> >> >>> I fed with URL but it didn't work:
>>> >> >>>
>>> >> >>> $ bin/nutch org.apache.nutch.net.URLFilterChecker
>>> >> >>> http://www.google.com Checking combination of all URLFilters
>>> >> >>> available
>>> >> >>>
>>> >> >>> Remi
>>> >> >>>
>>> >> >>> On Fri, Dec 9, 2011 at 2:43 PM, Markus Jelsma
>>> >> >>>
>>> >> >>> <markus.jelsma@openindex.io>wrote:
>>> >> >>> > it reads from stdin so you can either type a url followed
by
enter
>>> >> >>> > or feed
>>> >> >>> > from stdin using pipes.
>>> >> >>> >
>>> >> >>> > On Friday 09 December 2011 13:32:41 remi tassing wrote:
>>> >> >>> > > Hello guys,
>>> >> >>> > >
>>> >> >>> > > how do you use "org.apache.nutch.net.URLFilterChecker"?
It's
not
>>> >> >>> >
>>> >> >>> > documented
>>> >> >>> >
>>> >> >>> > > and it always shows me this "Checking combination
of all
>>> >> >>> > > URLFilters available" and then gets stuck.
>>> >> >>> > >
>>> >> >>> > > Remi
>>> >> >>> >
>>> >> >>> > --
>>> >> >>> > Markus Jelsma - CTO - Openindex
>>> >> >>>
>>> >> >>> --
>>> >> >>> Remi Tassing
>>> >> >>
>>> >> >> --
>>> >> >> Lewis
>>> >> >
>>> >> > --
>>> >> > Lewis
>>> >
>>> > --
>>> > Markus Jelsma - CTO - Openindex
>>
>> --
>> Markus Jelsma - CTO - Openindex
>
>
>
> --
> Lewis
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message