nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lewis John Mcgibbney <lewis.mcgibb...@gmail.com>
Subject Re: "URLFilterChecker" documentation
Date Tue, 13 Dec 2011 19:15:03 GMT
I get it now ... Duh :0)

Output is fine for me. What is wrong with your results Remi?

On Tue, Dec 13, 2011 at 7:09 PM, remi tassing <tassingremi@gmail.com> wrote:
> Pla check Markus's earlier email.on the format. It seems to be working.but
> the output is still incorrect for me.
>
> On Tuesday, December 13, 2011, Lewis John Mcgibbney <
> lewis.mcgibbney@gmail.com> wrote:
>> Heres my output from URLFilterChecker [1]
>>
>> lewis@lewis-01:~/ASF/trunk/runtime/local$ bin/nutch
>> org.apache.nutch.net.URLFilterChecker -filterName urlfilter-regex
>> Exception in thread "main" java.lang.RuntimeException: Filter
>> urlfilter-regex not found.
>>        at
> org.apache.nutch.net.URLFilterChecker.checkOne(URLFilterChecker.java:66)
>>        at
> org.apache.nutch.net.URLFilterChecker.main(URLFilterChecker.java:126)
>> lewis@lewis-01:~/ASF/trunk/runtime/local$ bin/nutch
>> org.apache.nutch.net.URLFilterChecker -allCombined
>> Checking combination of all URLFilters available
>> ^Z
>> [10]+  Stopped                 bin/nutch
>> org.apache.nutch.net.URLFilterChecker -allCombined
>> lewis@lewis-01:~/ASF/trunk/runtime/local$ bin/nutch
>> org.apache.nutch.net.URLFilterChecker -filterName RegexURLFilter
>> Exception in thread "main" java.lang.RuntimeException: Filter
>> RegexURLFilter not found.
>>        at
> org.apache.nutch.net.URLFilterChecker.checkOne(URLFilterChecker.java:66)
>>        at
> org.apache.nutch.net.URLFilterChecker.main(URLFilterChecker.java:126)
>>
>> I'm noticing three things
>>
>> 1) NO reference to a single urlfilter seems to work when appended to
>> the -filterName parameter e.g. regex-urlfilter, urlfilter-regex,
>> RegexURLFilter, regex-urlfilter.txt
>> 2) When no -filterName parameter is passed but a value is passed e.g.
>> bin/nutch org.apache.nutch.net.URLFilterChecker regex-urlfilter log
>> output is as follows
>> lewis@lewis-01:~/ASF/trunk/runtime/local$ bin/nutch
>> org.apache.nutch.net.URLFilterChecker regex-urlfilter
>> Checking combination of all URLFilters available
>> Therefore it seems to incorrectly skip to the checkAll method then hang!
>> 3) If the -allCombined parameter is passed the output indiciates that
>> it does the same as 2) above...
>>
>> Can you please check if you are getting the same behaviour Markus? Thank
> you
>>
>> [1]
> http://svn.apache.org/repos/asf/nutch/trunk/src/java/org/apache/nutch/net/URLFilterChecker.java
>>
>> On Tue, Dec 13, 2011 at 5:06 PM, Markus Jelsma
>> <markus.jelsma@openindex.io> wrote:
>>> i see no log output mate :)
>>>
>>> On Tuesday 13 December 2011 17:58:36 you wrote:
>>>> Thanks Markus.
>>>>
>>>> Can you look at my log output and inform where I am going wrong
>>>> please? It seemed to be playing up for me.
>>>>
>>>> Thanks
>>>>
>>>> On Tue, Dec 13, 2011 at 4:53 PM, Markus Jelsma
>>>>
>>>> <markus.jelsma@openindex.io> wrote:
>>>> > I've never seen it hanging and use it weekly.
>>>> >
>>>> > On Tuesday 13 December 2011 17:45:54 you wrote:
>>>> >> Hi,
>>>> >>
>>>> >> Can anyone confirm if this is an issue?
>>>> >>
>>>> >> If so I think we should log it before it goes unnoticed.
>>>> >>
>>>> >> Thanks
>>>> >>
>>>> >> Lewis
>>>> >>
>>>> >> On Fri, Dec 9, 2011 at 3:21 PM, Lewis John Mcgibbney
>>>> >>
>>>> >> <lewis.mcgibbney@gmail.com> wrote:
>>>> >> > If you look at the output I posted, even when I specified a
> particular
>>>> >> > filter, the checkAll() method is still getting called, as is
> indicated
>>>> >> > by the "Checking combination of all URLFilters available" log
> output.
>>>> >> > It's not a particularly complex class, so hopefully if we can
> confirm
>>>> >> > this is a bug we can fix it quickly.
>>>> >> >
>>>> >> > Finally, I must ask, Remi which URL filters have you included
in
> your
>>>> >> > plugin.includes property in nutch-site.xml after building Nutch?
>>>> >> >
>>>> >> > On Fri, Dec 9, 2011 at 3:11 PM, Lewis John Mcgibbney
>>>> >> >
>>>> >> > <lewis.mcgibbney@gmail.com> wrote:
>>>> >> >> Hi Remi & Markus,
>>>> >> >>
>>>> >> >> Yeah, I can replicate this, good catch Remi.
>>>> >> >>
>>>> >> >> lewis@lewis-desktop:~/ASF/trunk/runtime/local$ bin/nutch
>>>> >> >> org.apache.nutch.net.URLFilterChecker
> http://www.heraldscotland.com
>>>> >> >> -filterName regex-urlfilter.txt
>>>> >> >>
>>>> >> >> Checking combination of all URLFilters available
>>>> >> >> ^Z
>>>> >> >> [2]+  Stopped                 bin/nutch
>>>> >> >> org.apache.nutch.net.URLFilterChecker
> http://www.heraldscotland.com
>>>> >> >> -filterName regex-urlfilter.txt
>>>> >> >> lewis@lewis-desktop:~/ASF/trunk/runtime/local$ bin/nutch
>>>> >> >> org.apache.nutch.net.URLFilterChecker
> http://www.heraldscotland.com
>>>> >> >> -filterName regex-urlfilter
>>>> >> >>
>>>> >> >> Checking combination of all URLFilters available
>>>> >> >>
>>>> >> >> The first instance was hanging, so was the second. This
needs some
>>>> >> >> further investigation I think. Can someone else please
confirm
> before
>>>> >> >> we log this in Jira?
>>>> >> >>
>>>> >> >> Thanks for reporting
>>>> >> >>
>>>> >> >>
>>>> >> >> On Fri, Dec 9, 2011 at 12:53 PM, remi tassing <
> tassingremi@gmail.com>
>>>> >> >>
>>>> >> >> wrote:
>>>> >> >>> I fed with URL but it didn't work:
>>>> >> >>>
>>>> >> >>> $ bin/nutch org.apache.nutch.net.URLFilterChecker
>>>> >> >>> http://www.google.com Checking combination of all URLFilters
>>>> >> >>> available
>>>> >> >>>
>>>> >> >>> Remi
>>>> >> >>>
>>>> >> >>> On Fri, Dec 9, 2011 at 2:43 PM, Markus Jelsma
>>>> >> >>>
>>>> >> >>> <markus.jelsma@openindex.io>wrote:
>>>> >> >>> > it reads from stdin so you can either type a url
followed by
> enter
>>>> >> >>> > or feed
>>>> >> >>> > from stdin using pipes.
>>>> >> >>> >
>>>> >> >>> > On Friday 09 December 2011 13:32:41 remi tassing
wrote:
>>>> >> >>> > > Hello guys,
>>>> >> >>> > >
>>>> >> >>> > > how do you use "org.apache.nutch.net.URLFilterChecker"?
It's
> not
>>>> >> >>> >
>>>> >> >>> > documented
>>>> >> >>> >
>>>> >> >>> > > and it always shows me this "Checking combination
of all
>>>> >> >>> > > URLFilters available" and then gets stuck.
>>>> >> >>> > >
>>>> >> >>> > > Remi
>>>> >> >>> >
>>>> >> >>> > --
>>>> >> >>> > Markus Jelsma - CTO - Openindex
>>>> >> >>>
>>>> >> >>> --
>>>> >> >>> Remi Tassing
>>>> >> >>
>>>> >> >> --
>>>> >> >> Lewis
>>>> >> >
>>>> >> > --
>>>> >> > Lewis
>>>> >
>>>> > --
>>>> > Markus Jelsma - CTO - Openindex
>>>
>>> --
>>> Markus Jelsma - CTO - Openindex
>>
>>
>>
>> --
>> Lewis
>>



-- 
Lewis

Mime
View raw message