<?xml version="1.0" encoding="UTF-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
<title>user@nutch.apache.org Archives</title>
<link rel="self" href="http://mail-archives.apache.org/mod_mbox/nutch-user/?format=atom"/>
<link href="http://mail-archives.apache.org/mod_mbox/nutch-user/"/>
<id>http://mail-archives.apache.org/mod_mbox/nutch-user/</id>
<updated>2013-05-24T10:56:19Z</updated>
<entry>
<title>Re: Nutch 2.1: extension point ParseFilter: doc is null</title>
<author><name>Martin Aesch &lt;martin.aesch@googlemail.com&gt;</name></author>
<link rel="alternate" href="http://mail-archives.apache.org/mod_mbox/nutch-user/201305.mbox/%3c1369389655.9580.10.camel@senf.dw.privat%3e"/>
<id>urn:uuid:%3c1369389655-9580-10-camel@senf-dw-privat%3e</id>
<updated>2013-05-24T10:00:55Z</updated>
<content type="xhtml">
<div xmlns="http://www.w3.org/1999/xhtml">
<pre>
Dear Lewis,&#010;&#010;Thanks for the hints. Yes I am using currently neko parser and did some&#010;debugging. Seems, that issue is still open for nutch-2.1:&#010;https://issues.apache.org/jira/browse/NUTCH-1253&#010;(See stacktrace further below)&#010;I had however no success to change to neko 1.9.6.2 (since it is already&#010;in maven repo) and did not try out yet the solution suggested in the&#010;issue (neko 1.9.12).&#010;&#010;With tagsoup, reltag plugin works smoothly.&#010;&#010;Best regards,&#010;Martin&#010;&#010;&#010;&#010;&#010;&#010;java.util.concurrent.ExecutionException: java.lang.AbstractMethodError:&#010;org.cyberneko.html.HTMLScanner.getCharacterOffset()I&#010;        at java.util.concurrent.FutureTask&#010;$Sync.innerGet(FutureTask.java:262)&#010;        at java.util.concurrent.FutureTask.get(FutureTask.java:119)&#010;        at&#010;org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:148)&#010;        at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:129)&#010;        at org.apache.nutch.parse.ParseUtil.process(ParseUtil.java:176)&#010;        at org.apache.nutch.parse.ParserJob&#010;$ParserMapper.map(ParserJob.java:129)&#010;        at org.apache.nutch.parse.ParserJob&#010;$ParserMapper.map(ParserJob.java:78)&#010;        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)&#010;        at&#010;org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)&#010;        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)&#010;        at org.apache.hadoop.mapred.LocalJobRunner&#010;$Job.run(LocalJobRunner.java:212)&#010;Caused by: java.lang.AbstractMethodError:&#010;org.cyberneko.html.HTMLScanner.getCharacterOffset()I&#010;        at org.apache.xerces.xni.parser.XMLParseException.&lt;init&gt;(Unknown&#010;Source)&#010;        at org.cyberneko.html.HTMLConfiguration&#010;$ErrorReporter.createException(HTMLConfiguration.java:673)&#010;        at org.cyberneko.html.HTMLConfiguration&#010;$ErrorReporter.reportError(HTMLConfiguration.java:662)&#010;        at org.cyberneko.html.HTMLScanner&#010;$ContentScanner.scanAttribute(HTMLScanner.java:2468)&#010;        at org.cyberneko.html.HTMLScanner&#010;$ContentScanner.scanAttribute(HTMLScanner.java:2424)&#010;        at org.cyberneko.html.HTMLScanner&#010;$ContentScanner.scanStartElement(HTMLScanner.java:2328)&#010;        at org.cyberneko.html.HTMLScanner&#010;$ContentScanner.scan(HTMLScanner.java:1881)&#010;        at&#010;org.cyberneko.html.HTMLScanner.scanDocument(HTMLScanner.java:809)&#010;        at&#010;org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:478)&#010;        at&#010;org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:431)&#010;        at&#010;org.cyberneko.html.parsers.DOMFragmentParser.parse(DOMFragmentParser.java:164)&#010;        at&#010;org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:275)&#010;        at&#010;org.apache.nutch.parse.html.HtmlParser.parse(HtmlParser.java:238)&#010;        at&#010;org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:173)&#010;        at&#010;org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:36)&#010;        at&#010;org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:23)&#010;        at java.util.concurrent.FutureTask&#010;$Sync.innerRun(FutureTask.java:334)&#010;        at java.util.concurrent.FutureTask.run(FutureTask.java:166)&#010;        at&#010;java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)&#010;        at java.util.concurrent.ThreadPoolExecutor&#010;$Worker.run(ThreadPoolExecutor.java:615)&#010;        at java.lang.Thread.run(Thread.java:722)&#010;&#010;&#010;&#010;&#010;&#010;&#010;&#010;&#010;&#010;AbstractMethodError&#010;On Thu, 2013-05-23 at 21:16 -0700, Lewis John Mcgibbney wrote:&#010;&gt; Hi Martin,&#010;&gt; I am struggling to understand how the DocumentFragment (populated either by&#010;&gt; private methods parseTagSoup or parseNeko depending on your config in&#010;&gt; nutch-site.xml) is null!&#010;&gt; What you don't mention is some problem you are having?&#010;&gt; I can't DEBUG the code tonight but I am interested to see what is up here.&#010;&gt; Lewis&#010;&gt; &#010;&gt; On Thursday, May 23, 2013, Martin Aesch &lt;martin.aesch@googlemail.com&gt; wrote:&#010;&gt; &gt; Dear nutchers,&#010;&gt; &gt;&#010;&gt; &gt; I extended the ParseFilter extension point&#010;&gt; &gt;&#010;&gt; &gt; public Parse filter(String url, WebPage page, Parse parse,&#010;&gt; &gt;     HTMLMetaTags metaTags, DocumentFragment doc) {&#010;&gt; &gt;&#010;&gt; &gt; From what I understand, plugin parse-html should populate the&#010;&gt; &gt; DocumentFragment doc.&#010;&gt; &gt;&#010;&gt; &gt; Unfortunately, doc is always null. I tried this with my own plugin, as&#010;&gt; &gt; well as with the nutch-shipped plugin microformats-reltag, which extends&#010;&gt; &gt; the same point.&#010;&gt; &gt;&#010;&gt; &gt; Both plugins are working, and they are called. I attached my debugger,&#010;&gt; &gt; and both for my own plugin as well as for the reltag-plugin, doc is&#010;&gt; &gt; always null.&#010;&gt; &gt;&#010;&gt; &gt; I checked parse-plugins.xml, yes, parse-html is called and my mime-types&#010;&gt; &gt; are those which call parse-html&#010;&gt; &gt; (extension-id="org.apache.nutch.parse.html.HtmlParser").&#010;&gt; &gt;&#010;&gt; &gt; What am I missing?&#010;&gt; &gt;&#010;&gt; &gt; Thanks,&#010;&gt; &gt; Martin&#010;&gt; &gt;&#010;&gt; &gt;&#010;&gt; &#010;&#010;&#010;
</pre>
</div>
</content>
</entry>
<entry>
<title>Re: Nutch 2.1: extension point ParseFilter: doc is null</title>
<author><name>Lewis John Mcgibbney &lt;lewis.mcgibbney@gmail.com&gt;</name></author>
<link rel="alternate" href="http://mail-archives.apache.org/mod_mbox/nutch-user/201305.mbox/%3cCAGaRif3daDyqWNdAZFh7h6QYhTdnec54h9oYqniaNqu8=52Qpg@mail.gmail.com%3e"/>
<id>urn:uuid:%3cCAGaRif3daDyqWNdAZFh7h6QYhTdnec54h9oYqniaNqu8=52Qpg@mail-gmail-com%3e</id>
<updated>2013-05-24T04:16:10Z</updated>
<content type="xhtml">
<div xmlns="http://www.w3.org/1999/xhtml">
<pre>
Hi Martin,&#010;I am struggling to understand how the DocumentFragment (populated either by&#010;private methods parseTagSoup or parseNeko depending on your config in&#010;nutch-site.xml) is null!&#010;What you don't mention is some problem you are having?&#010;I can't DEBUG the code tonight but I am interested to see what is up here.&#010;Lewis&#010;&#010;On Thursday, May 23, 2013, Martin Aesch &lt;martin.aesch@googlemail.com&gt; wrote:&#010;&gt; Dear nutchers,&#010;&gt;&#010;&gt; I extended the ParseFilter extension point&#010;&gt;&#010;&gt; public Parse filter(String url, WebPage page, Parse parse,&#010;&gt;     HTMLMetaTags metaTags, DocumentFragment doc) {&#010;&gt;&#010;&gt; From what I understand, plugin parse-html should populate the&#010;&gt; DocumentFragment doc.&#010;&gt;&#010;&gt; Unfortunately, doc is always null. I tried this with my own plugin, as&#010;&gt; well as with the nutch-shipped plugin microformats-reltag, which extends&#010;&gt; the same point.&#010;&gt;&#010;&gt; Both plugins are working, and they are called. I attached my debugger,&#010;&gt; and both for my own plugin as well as for the reltag-plugin, doc is&#010;&gt; always null.&#010;&gt;&#010;&gt; I checked parse-plugins.xml, yes, parse-html is called and my mime-types&#010;&gt; are those which call parse-html&#010;&gt; (extension-id="org.apache.nutch.parse.html.HtmlParser").&#010;&gt;&#010;&gt; What am I missing?&#010;&gt;&#010;&gt; Thanks,&#010;&gt; Martin&#010;&gt;&#010;&gt;&#010;&#010;-- &#010;*Lewis*&#010;&#010;
</pre>
</div>
</content>
</entry>
<entry>
<title>Re: Explanation of RegexURLFIlterTestBase benchmark's</title>
<author><name>Lewis John Mcgibbney &lt;lewis.mcgibbney@gmail.com&gt;</name></author>
<link rel="alternate" href="http://mail-archives.apache.org/mod_mbox/nutch-user/201305.mbox/%3cCAGaRif1EYfJed7agVaH_K7-BYTLzhPfJ_FsJjjgV09hYpwnDvQ@mail.gmail.com%3e"/>
<id>urn:uuid:%3cCAGaRif1EYfJed7agVaH_K7-BYTLzhPfJ_FsJjjgV09hYpwnDvQ@mail-gmail-com%3e</id>
<updated>2013-05-24T01:51:14Z</updated>
<content type="xhtml">
<div xmlns="http://www.w3.org/1999/xhtml">
<pre>
Hi Kirby,&#010;&#010;&#010;On Thu, May 23, 2013 at 6:36 PM, Kirby Bohling &lt;kirby.bohling@gmail.com&gt;wrote:&#010;&#010;&gt;&#010;&gt; Not that I think you need them in particular, but it seems like Nutch could&#010;&gt; be doing plenty of benchmarking, and micro benchmarking in particular.&#010;&gt;&#010;&#010;I agree with this. It is not my goal to attack this head on but (I think)&#010;it is useful for us to know more about the different components of Nutch&#010;and how they operate, micro benchmarking would certainly be a way of making&#010;this realistic.&#010;This being said, I am quite keen on the idea of third party libraries (such&#010;as bk.brics automaton [0]) being tested in thier own environment, by their&#010;own development team. In this case, some comparative *results* (of an older&#010;bk.brics library) can be seen here [1].&#010;Anyone is free to infer from this what they wish, but it gives a bit of an&#010;idea about the gains which can be achieved.&#010;If regex p is something which you (I mean this collectively to refer to&#010;anyone) think is a bottle neck for your Nutch deployment. Try out the&#010;automaton plugin and hopefully things get better for you. AFAIK we use the&#010;most up-to-date library available here so things should work well.&#010;&#010;Thanks for the post Kirby.&#010;&#010;[0] http://www.brics.dk/automaton/index.html&#010;[1] http://tusker.org/regex/regex_benchmark.html&#010;&#010;
</pre>
</div>
</content>
</entry>
<entry>
<title>Re: Explanation of RegexURLFIlterTestBase benchmark's</title>
<author><name>Kirby Bohling &lt;kirby.bohling@gmail.com&gt;</name></author>
<link rel="alternate" href="http://mail-archives.apache.org/mod_mbox/nutch-user/201305.mbox/%3cCA+bn5ryUEz9-oktGV2rb0ZhDDxUnOpPvDr2MzjkxzxU-4g3zjg@mail.gmail.com%3e"/>
<id>urn:uuid:%3cCA+bn5ryUEz9-oktGV2rb0ZhDDxUnOpPvDr2MzjkxzxU-4g3zjg@mail-gmail-com%3e</id>
<updated>2013-05-24T01:36:43Z</updated>
<content type="xhtml">
<div xmlns="http://www.w3.org/1999/xhtml">
<pre>
Re-reading my e-mail, I realize it might be read poorly.  Thanks for giving&#010;me the benefit of the doubt.&#010;&#010;There's a bunch of good material out on the web, and ultimately, the truth&#010;is that micro benchmarks can always be misleading, and the only accurate&#010;benchmark is real workload testing.  That said, micros benchmarks can be&#010;accurate and useful in limit contexts.&#010;&#010;There are several good resources if you want to do micro benchmarking w/&#010;Java:&#010;https://code.google.com/p/caliper/  (Full Disclosure: Written by my&#010;employer.  I've never used it, but the theory/docs are sound)&#010;&#010;Peter Lawrey has a good blog that touches on issues of performance and has&#010;a couple of posts explicitly on mistakes he's made in extremely high&#010;performance benchmarking:&#010;http://mechanical-sympathy.blogspot.com/&#010;&#010;Pretty decent explanations here:&#010;http://stackoverflow.com/questions/504103/how-do-i-write-a-correct-micro-benchmark-in-java&#010;&#010;Not that I think you need them in particular, but it seems like Nutch could&#010;be doing plenty of benchmarking, and micro benchmarking in particular.&#010; Knowing the pitfalls is valuable.  Lots of very smart people screw this up&#010;regularly and make poorly founded decisions armed with faulty data.  Not&#010;always sure I qualify as smart, but I've done it more than once.&#010;&#010;Anyways, those reference have plenty of gory details for folks who are&#010;interested in why things like this happen.&#010;&#010;Kirby&#010;&#010;&#010;&#010;On Thu, May 23, 2013 at 6:48 PM, Lewis John Mcgibbney &lt;&#010;lewis.mcgibbney@gmail.com&gt; wrote:&#010;&#010;&gt; You know, this was my suspicion Kirby.&#010;&gt; Thanks for giving the heads up... automaton rocks.&#010;&gt; Lewis&#010;&gt;&#010;&gt;&#010;&gt; On Thu, May 23, 2013 at 5:06 PM, Kirby Bohling &lt;kirby.bohling@gmail.com&#010;&gt; &gt;wrote:&#010;&gt;&#010;&gt; &gt; Standard micro-benchmark issues with Java, run the 50 last and it'll run&#010;&gt; &gt; faster.  JVM warmup, and JIT compilation, yadda, yadda, yadda.&#010;&gt; &gt;&#010;&gt; &gt;&#010;&gt; &gt; On Thu, May 23, 2013 at 1:57 PM, Lewis John Mcgibbney &lt;&#010;&gt; &gt; lewis.mcgibbney@gmail.com&gt; wrote:&#010;&gt; &gt;&#010;&gt; &gt; &gt; Hi All,&#010;&gt; &gt; &gt; A really nice aspect of the regex (urlfilter-automaton and&#010;&gt; &gt; urfilter-regex)&#010;&gt; &gt; &gt; plugin implementation's in Nutch is that there is a small but very&#010;&gt; useful&#010;&gt; &gt; &gt; RegexURLFilterBaseTest [0] which compares benchmarks for simple regex&#010;&gt; &gt; &gt; parsing.&#010;&gt; &gt; &gt; The results we get are as follows&#010;&gt; &gt; &gt;&#010;&gt; &gt; &gt; urls      automaton      regex&#010;&gt; &gt; &gt; 50        343ms           210ms&#010;&gt; &gt; &gt; 100      48ms             187ms&#010;&gt; &gt; &gt; 200      65ms             363ms&#010;&gt; &gt; &gt; 400      100ms           692ms&#010;&gt; &gt; &gt; 800      165ms           1385ms&#010;&gt; &gt; &gt;&#010;&gt; &gt; &gt; The problem I have here is understanding why the first (50) bench&#010;&gt; appears&#010;&gt; &gt; &gt; to be more expensive for both implementations?&#010;&gt; &gt; &gt; Additionally, why does this same bench cost much more for automaton?&#010;&gt; &gt; &gt;&#010;&gt; &gt; &gt; Anyone have a clue?&#010;&gt; &gt; &gt; Thanks&#010;&gt; &gt; &gt; Lewis&#010;&gt; &gt; &gt;&#010;&gt; &gt; &gt; [0]&#010;&gt; &gt; &gt;&#010;&gt; &gt; &gt;&#010;&gt; &gt;&#010;&gt; http://svn.apache.org/viewvc/nutch/branches/2.x/src/plugin/lib-regex-filter/src/test/org/apache/nutch/urlfilter/api/RegexURLFilterBaseTest.java?view=markup&#010;&gt; &gt; &gt;&#010;&gt; &gt; &gt; --&#010;&gt; &gt; &gt; *Lewis*&#010;&gt; &gt; &gt;&#010;&gt; &gt;&#010;&gt;&#010;&gt;&#010;&gt;&#010;&gt; --&#010;&gt; *Lewis*&#010;&gt;&#010;&#010;
</pre>
</div>
</content>
</entry>
<entry>
<title>Re: Explanation of RegexURLFIlterTestBase benchmark's</title>
<author><name>Tejas Patil &lt;tejas.patil.cs@gmail.com&gt;</name></author>
<link rel="alternate" href="http://mail-archives.apache.org/mod_mbox/nutch-user/201305.mbox/%3cCAFKhtFyXKYcsng9WmDVX9dyPi6tdmZQgG+mm5GYD-a+3=fnedg@mail.gmail.com%3e"/>
<id>urn:uuid:%3cCAFKhtFyXKYcsng9WmDVX9dyPi6tdmZQgG+mm5GYD-a+3=fnedg@mail-gmail-com%3e</id>
<updated>2013-05-24T00:58:40Z</updated>
<content type="xhtml">
<div xmlns="http://www.w3.org/1999/xhtml">
<pre>
Just ran the tests twice (to be clear: invoked bench() twice in same run)&#010;to see the timings for regex-urlfilter:&#010;&#010;(inputs) time&#010;*(50) 231ms*&#010;(100) 169ms&#010;(200) 326ms&#010;(400) 683ms&#010;(800) 1420ms&#010;*(50) 109ms*&#010;(100) 188ms&#010;(200) 319ms&#010;(400) 714ms&#010;(800) 1442ms&#010;&#010;Kirby is right.&#010;&#010;&#010;On Thu, May 23, 2013 at 5:48 PM, Lewis John Mcgibbney &lt;&#010;lewis.mcgibbney@gmail.com&gt; wrote:&#010;&#010;&gt; You know, this was my suspicion Kirby.&#010;&gt; Thanks for giving the heads up... automaton rocks.&#010;&gt; Lewis&#010;&gt;&#010;&gt;&#010;&gt; On Thu, May 23, 2013 at 5:06 PM, Kirby Bohling &lt;kirby.bohling@gmail.com&#010;&gt; &gt;wrote:&#010;&gt;&#010;&gt; &gt; Standard micro-benchmark issues with Java, run the 50 last and it'll run&#010;&gt; &gt; faster.  JVM warmup, and JIT compilation, yadda, yadda, yadda.&#010;&gt; &gt;&#010;&gt; &gt;&#010;&gt; &gt; On Thu, May 23, 2013 at 1:57 PM, Lewis John Mcgibbney &lt;&#010;&gt; &gt; lewis.mcgibbney@gmail.com&gt; wrote:&#010;&gt; &gt;&#010;&gt; &gt; &gt; Hi All,&#010;&gt; &gt; &gt; A really nice aspect of the regex (urlfilter-automaton and&#010;&gt; &gt; urfilter-regex)&#010;&gt; &gt; &gt; plugin implementation's in Nutch is that there is a small but very&#010;&gt; useful&#010;&gt; &gt; &gt; RegexURLFilterBaseTest [0] which compares benchmarks for simple regex&#010;&gt; &gt; &gt; parsing.&#010;&gt; &gt; &gt; The results we get are as follows&#010;&gt; &gt; &gt;&#010;&gt; &gt; &gt; urls      automaton      regex&#010;&gt; &gt; &gt; 50        343ms           210ms&#010;&gt; &gt; &gt; 100      48ms             187ms&#010;&gt; &gt; &gt; 200      65ms             363ms&#010;&gt; &gt; &gt; 400      100ms           692ms&#010;&gt; &gt; &gt; 800      165ms           1385ms&#010;&gt; &gt; &gt;&#010;&gt; &gt; &gt; The problem I have here is understanding why the first (50) bench&#010;&gt; appears&#010;&gt; &gt; &gt; to be more expensive for both implementations?&#010;&gt; &gt; &gt; Additionally, why does this same bench cost much more for automaton?&#010;&gt; &gt; &gt;&#010;&gt; &gt; &gt; Anyone have a clue?&#010;&gt; &gt; &gt; Thanks&#010;&gt; &gt; &gt; Lewis&#010;&gt; &gt; &gt;&#010;&gt; &gt; &gt; [0]&#010;&gt; &gt; &gt;&#010;&gt; &gt; &gt;&#010;&gt; &gt;&#010;&gt; http://svn.apache.org/viewvc/nutch/branches/2.x/src/plugin/lib-regex-filter/src/test/org/apache/nutch/urlfilter/api/RegexURLFilterBaseTest.java?view=markup&#010;&gt; &gt; &gt;&#010;&gt; &gt; &gt; --&#010;&gt; &gt; &gt; *Lewis*&#010;&gt; &gt; &gt;&#010;&gt; &gt;&#010;&gt;&#010;&gt;&#010;&gt;&#010;&gt; --&#010;&gt; *Lewis*&#010;&gt;&#010;&#010;
</pre>
</div>
</content>
</entry>
<entry>
<title>Re: Explanation of RegexURLFIlterTestBase benchmark's</title>
<author><name>Lewis John Mcgibbney &lt;lewis.mcgibbney@gmail.com&gt;</name></author>
<link rel="alternate" href="http://mail-archives.apache.org/mod_mbox/nutch-user/201305.mbox/%3cCAGaRif1HpORyOAzYHHvG2Eu4D244Ww7dco3xekYhA=Mf_eNmHw@mail.gmail.com%3e"/>
<id>urn:uuid:%3cCAGaRif1HpORyOAzYHHvG2Eu4D244Ww7dco3xekYhA=Mf_eNmHw@mail-gmail-com%3e</id>
<updated>2013-05-24T00:48:27Z</updated>
<content type="xhtml">
<div xmlns="http://www.w3.org/1999/xhtml">
<pre>
You know, this was my suspicion Kirby.&#010;Thanks for giving the heads up... automaton rocks.&#010;Lewis&#010;&#010;&#010;On Thu, May 23, 2013 at 5:06 PM, Kirby Bohling &lt;kirby.bohling@gmail.com&gt;wrote:&#010;&#010;&gt; Standard micro-benchmark issues with Java, run the 50 last and it'll run&#010;&gt; faster.  JVM warmup, and JIT compilation, yadda, yadda, yadda.&#010;&gt;&#010;&gt;&#010;&gt; On Thu, May 23, 2013 at 1:57 PM, Lewis John Mcgibbney &lt;&#010;&gt; lewis.mcgibbney@gmail.com&gt; wrote:&#010;&gt;&#010;&gt; &gt; Hi All,&#010;&gt; &gt; A really nice aspect of the regex (urlfilter-automaton and&#010;&gt; urfilter-regex)&#010;&gt; &gt; plugin implementation's in Nutch is that there is a small but very useful&#010;&gt; &gt; RegexURLFilterBaseTest [0] which compares benchmarks for simple regex&#010;&gt; &gt; parsing.&#010;&gt; &gt; The results we get are as follows&#010;&gt; &gt;&#010;&gt; &gt; urls      automaton      regex&#010;&gt; &gt; 50        343ms           210ms&#010;&gt; &gt; 100      48ms             187ms&#010;&gt; &gt; 200      65ms             363ms&#010;&gt; &gt; 400      100ms           692ms&#010;&gt; &gt; 800      165ms           1385ms&#010;&gt; &gt;&#010;&gt; &gt; The problem I have here is understanding why the first (50) bench appears&#010;&gt; &gt; to be more expensive for both implementations?&#010;&gt; &gt; Additionally, why does this same bench cost much more for automaton?&#010;&gt; &gt;&#010;&gt; &gt; Anyone have a clue?&#010;&gt; &gt; Thanks&#010;&gt; &gt; Lewis&#010;&gt; &gt;&#010;&gt; &gt; [0]&#010;&gt; &gt;&#010;&gt; &gt;&#010;&gt; http://svn.apache.org/viewvc/nutch/branches/2.x/src/plugin/lib-regex-filter/src/test/org/apache/nutch/urlfilter/api/RegexURLFilterBaseTest.java?view=markup&#010;&gt; &gt;&#010;&gt; &gt; --&#010;&gt; &gt; *Lewis*&#010;&gt; &gt;&#010;&gt;&#010;&#010;&#010;&#010;-- &#010;*Lewis*&#010;&#010;
</pre>
</div>
</content>
</entry>
<entry>
<title>Re: Explanation of RegexURLFIlterTestBase benchmark's</title>
<author><name>Kirby Bohling &lt;kirby.bohling@gmail.com&gt;</name></author>
<link rel="alternate" href="http://mail-archives.apache.org/mod_mbox/nutch-user/201305.mbox/%3cCA+bn5ryVWmKdbr-kyZxxrJH-v2RRU8K0qO5Hn-rLft5R68hC6g@mail.gmail.com%3e"/>
<id>urn:uuid:%3cCA+bn5ryVWmKdbr-kyZxxrJH-v2RRU8K0qO5Hn-rLft5R68hC6g@mail-gmail-com%3e</id>
<updated>2013-05-24T00:06:50Z</updated>
<content type="xhtml">
<div xmlns="http://www.w3.org/1999/xhtml">
<pre>
Standard micro-benchmark issues with Java, run the 50 last and it'll run&#010;faster.  JVM warmup, and JIT compilation, yadda, yadda, yadda.&#010;&#010;&#010;On Thu, May 23, 2013 at 1:57 PM, Lewis John Mcgibbney &lt;&#010;lewis.mcgibbney@gmail.com&gt; wrote:&#010;&#010;&gt; Hi All,&#010;&gt; A really nice aspect of the regex (urlfilter-automaton and urfilter-regex)&#010;&gt; plugin implementation's in Nutch is that there is a small but very useful&#010;&gt; RegexURLFilterBaseTest [0] which compares benchmarks for simple regex&#010;&gt; parsing.&#010;&gt; The results we get are as follows&#010;&gt;&#010;&gt; urls      automaton      regex&#010;&gt; 50        343ms           210ms&#010;&gt; 100      48ms             187ms&#010;&gt; 200      65ms             363ms&#010;&gt; 400      100ms           692ms&#010;&gt; 800      165ms           1385ms&#010;&gt;&#010;&gt; The problem I have here is understanding why the first (50) bench appears&#010;&gt; to be more expensive for both implementations?&#010;&gt; Additionally, why does this same bench cost much more for automaton?&#010;&gt;&#010;&gt; Anyone have a clue?&#010;&gt; Thanks&#010;&gt; Lewis&#010;&gt;&#010;&gt; [0]&#010;&gt;&#010;&gt; http://svn.apache.org/viewvc/nutch/branches/2.x/src/plugin/lib-regex-filter/src/test/org/apache/nutch/urlfilter/api/RegexURLFilterBaseTest.java?view=markup&#010;&gt;&#010;&gt; --&#010;&gt; *Lewis*&#010;&gt;&#010;&#010;
</pre>
</div>
</content>
</entry>
<entry>
<title>Nutch 2.1: extension point ParseFilter: doc is null</title>
<author><name>Martin Aesch &lt;martin.aesch@googlemail.com&gt;</name></author>
<link rel="alternate" href="http://mail-archives.apache.org/mod_mbox/nutch-user/201305.mbox/%3c1369344520.6199.7.camel@senf.dw.privat%3e"/>
<id>urn:uuid:%3c1369344520-6199-7-camel@senf-dw-privat%3e</id>
<updated>2013-05-23T21:28:40Z</updated>
<content type="xhtml">
<div xmlns="http://www.w3.org/1999/xhtml">
<pre>
Dear nutchers,&#010;&#010;I extended the ParseFilter extension point&#010;&#010;public Parse filter(String url, WebPage page, Parse parse,&#010;    HTMLMetaTags metaTags, DocumentFragment doc) {&#010;&#010;&gt;From what I understand, plugin parse-html should populate the&#010;DocumentFragment doc.&#010;&#010;Unfortunately, doc is always null. I tried this with my own plugin, as&#010;well as with the nutch-shipped plugin microformats-reltag, which extends&#010;the same point.&#010;&#010;Both plugins are working, and they are called. I attached my debugger,&#010;and both for my own plugin as well as for the reltag-plugin, doc is&#010;always null. &#010;&#010;I checked parse-plugins.xml, yes, parse-html is called and my mime-types&#010;are those which call parse-html&#010;(extension-id="org.apache.nutch.parse.html.HtmlParser").&#010;&#010;What am I missing?&#010;&#010;Thanks,&#010;Martin&#010;&#010;&#010;
</pre>
</div>
</content>
</entry>
<entry>
<title>Re: Nutch 2.1 pdf parsing</title>
<author><name>Adriana Farina &lt;adriana.farina23@gmail.com&gt;</name></author>
<link rel="alternate" href="http://mail-archives.apache.org/mod_mbox/nutch-user/201305.mbox/%3cCAGqBQ2DgTb0_2Qu5vNwALx1CWCC0Q4Tc72k+kOwKZih+1mUg6w@mail.gmail.com%3e"/>
<id>urn:uuid:%3cCAGqBQ2DgTb0_2Qu5vNwALx1CWCC0Q4Tc72k+kOwKZih+1mUg6w@mail-gmail-com%3e</id>
<updated>2013-05-23T20:31:48Z</updated>
<content type="xhtml">
<div xmlns="http://www.w3.org/1999/xhtml">
<pre>
Hi Lewis,&#010;&#010;thank you very much. I will try your solution.&#010;&#010;&#010;2013/5/23 Lewis John Mcgibbney &lt;lewis.mcgibbney@gmail.com&gt;&#010;&#010;&gt; Hi Adriana,&#010;&gt; If I were you I would switch your logging to DEBUG for the ParserJob&#010;&gt;&#010;&gt; - log4j.logger.org.apache.nutch.parse.ParserJob=INFO,cmdstdout&#010;&gt; + log4j.logger.org.apache.nutch.parse.ParserJob=DEBUG,cmdstdout&#010;&gt;&#010;&gt;&#010;&gt; recompile the code, then look closely at the parse chunk of the log to see&#010;&gt; what parser is being used, and if there are any particular issues flagged&#010;&gt; up @runtime.&#010;&gt;&#010;&gt;&#010;&gt; On Thu, May 23, 2013 at 8:14 AM, Adriana Farina&#010;&gt; &lt;adriana.farina23@gmail.com&gt;wrote:&#010;&gt;&#010;&gt; &gt; Hi,&#010;&gt; &gt;&#010;&gt; &gt; I'm using Nutch 2.1 in distributed mode on top of Hadoop 1.0.4, with&#010;&gt; HBase&#010;&gt; &gt; 0.90.4 as database.&#010;&gt; &gt;&#010;&gt; &gt; I wrote a Java class from which I run the crawling cycle, the code that&#010;&gt; &gt; implements the crawling cycle is the following:&#010;&gt; &gt;&#010;&gt; &gt;                   for (int i = 0; i &lt; depth; i++) {&#010;&gt; &gt; batchid = generator.generate((Long) args.get(Nutch.ARG_TOPN),&#010;&gt; &gt; System.currentTimeMillis(), false, false);&#010;&gt; &gt; fetcher.fetch(batchid, 1, false, -1);&#010;&gt; &gt; parser.parse(batchid, false, true);&#010;&gt; &gt; updater.run(new String[0]);&#010;&gt; &gt;   }&#010;&gt; &gt;&#010;&gt; &gt; The problem is that I'm not able to parse the pdf files, inside HBase I&#010;&gt; got&#010;&gt; &gt; no pdf content. The strange thing is that I got one row with the&#010;&gt; following&#010;&gt; &gt; content: column=p:parsestat, timestamp=1369316742871,&#010;&gt; &gt; value=\x04\x90\x03\x02\x96\x01org.apache.nutch.parse.ParseException:&#010;&gt; Unable&#010;&gt; &gt; to successfully parse content\x00.&#010;&gt; &gt;&#010;&gt; &gt; It seems to me that I have configured all nutch property files correctly.&#010;&gt; &gt; Can anybody help me?&#010;&gt; &gt;&#010;&gt; &gt; Thank you very much.&#010;&gt; &gt;&#010;&gt; &gt;&#010;&gt; &gt; --&#010;&gt; &gt; Adriana Farina&#010;&gt; &gt;&#010;&gt;&#010;&gt;&#010;&gt;&#010;&gt; --&#010;&gt; *Lewis*&#010;&gt;&#010;&#010;&#010;&#010;-- &#010;Adriana Farina&#010;&#010;
</pre>
</div>
</content>
</entry>
<entry>
<title>Re: error crawling</title>
<author><name>alxsss@aim.com</name></author>
<link rel="alternate" href="http://mail-archives.apache.org/mod_mbox/nutch-user/201305.mbox/%3c8D0260982904A27-4C4-E169@webmail-d254.sysops.aol.com%3e"/>
<id>urn:uuid:%3c8D0260982904A27-4C4-E169@webmail-d254-sysops-aol-com%3e</id>
<updated>2013-05-23T20:16:15Z</updated>
<content type="xhtml">
<div xmlns="http://www.w3.org/1999/xhtml">
<pre>
I do not think that script works in nutch-2.x.&#010;For example I see this&#010;$bin/nutch generate $commonOptions $CRAWL_ID/crawldb $CRAWL_ID/segments -topN $sizeFetchlist&#010;-numFetchers $numSlaves -noFilter&#010;&#010;There are no crawldb or segments in nutch-2.x.&#010;&#010;When you use crawlid in inject command it creates a crawlid_webpage table in hbase and when&#010;you use generate, fetch and etc it queries webpage table which does not exist.&#010;&#010;Alex.&#010;&#010; &#010;&#010; &#010;&#010;-----Original Message-----&#010;From: Christopher Gross &lt;cogross@gmail.com&gt;&#010;To: user &lt;user@nutch.apache.org&gt;&#010;Sent: Wed, May 22, 2013 6:23 pm&#010;Subject: Re: error crawling&#010;&#010;&#010;I'm trying to crawl. I'm just running the script that I pulled from the&#010;nutch site, so I assumed that it would be good to go, like the old&#010;runbot.sh script. I could try removing that part, but I still get the error&#010;farther down in the main body of the loop.&#010;&#010;-- Christopher Gross&#010;Sent from my nexus 7&#010;On May 22, 2013 4:40 PM, &lt;alxsss@aim.com&gt; wrote:&#010;&#010;&gt; what are you trying to achieve? What is the reason running inject with a&#010;&gt; crawlIId?&#010;&gt;&#010;&gt;&#010;&gt;&#010;&gt;&#010;&gt;&#010;&gt;&#010;&gt; -----Original Message-----&#010;&gt; From: Christopher Gross &lt;cogross@gmail.com&gt;&#010;&gt; To: user &lt;user@nutch.apache.org&gt;&#010;&gt; Sent: Wed, May 22, 2013 12:25 pm&#010;&gt; Subject: Re: error crawling&#010;&gt;&#010;&gt;&#010;&gt; Sure, I'll try.  I'm also confused about this -- I had it working at one&#010;&gt; point, and it stopped working after migrating to a new box (copied&#010;&gt; everything over but cleared out the HBase).&#010;&gt;&#010;&gt; My hadoop.log for today has:&#010;&gt; store.HBaseStore - Keyclass and nameclass match but mismatching table&#010;&gt; names  mappingfile schema is 'webpage' vs actual schema 'crawl_webpage' ,&#010;&gt; assuming they are the same.&#010;&gt;&#010;&gt; I have nothing in a config file for a "crawl_webpage".  I ran:&#010;&gt; grep crawl_webpage *&#010;&gt; and got nothing.&#010;&gt; Running:&#010;&gt; grep webpage *&#010;&gt; gets me hits on gora mapping files for accumulo, hbase, cassandra and sql,&#010;&gt; as well as the nutch-default.xml file.&#010;&gt; nutch-default.xml has a "storage.schema.webpage" which has a value of&#010;&gt; "webpage".&#010;&gt;&#010;&gt; Now, what I'm thinking is that my CRAWL_ID is set to crawl, and for&#010;&gt; whatever reason, that is the table that nutch is making is that CRAWL_ID +&#010;&gt; _ + "webpage".&#010;&gt;&#010;&gt; I tried making the gora mapping file use crawl_webpage but then I ended up&#010;&gt; with some crawl_crawl_webpage error messages, so I cleared out the HBase&#010;&gt; (again) and rolled back the file.&#010;&gt;&#010;&gt; Perhaps I'm running on an older one, can you point me in the right&#010;&gt; direction for getting that "crawl" script that replaces the 1.x "runbot.sh"&#010;&gt; script?&#010;&gt;&#010;&gt;&#010;&gt; -- Chris&#010;&gt;&#010;&gt;&#010;&gt; On Mon, May 20, 2013 at 1:55 PM, Lewis John Mcgibbney &lt;&#010;&gt; lewis.mcgibbney@gmail.com&gt; wrote:&#010;&gt;&#010;&gt; &gt; Hi Chris,&#010;&gt; &gt;&#010;&gt; &gt; On Mon, May 20, 2013 at 10:21 AM, Christopher Gross &lt;cogross@gmail.com&#010;&gt; &gt; &gt;wrote:&#010;&gt; &gt;&#010;&gt; &gt; &gt; Lewis --&#010;&gt; &gt; &gt; Is the DEBUG something set in the conf/log4j.properties file?  I have&#010;&gt; the&#010;&gt; &gt; &gt; rootLogger set to INFO,DRFA and the threshold is ALL.  Everything else&#010;&gt; is&#010;&gt; &gt; &gt; INFO or WARN (no DEBUGs to be found.)&#010;&gt; &gt; &gt;&#010;&gt; &gt; &gt;&#010;&gt; &gt; Well yes you can set it in the log4j.properties file, however if you are&#010;&gt; &gt; working with anything older than 2.x HEAD then by default the logging is&#010;&gt; &gt; hardcoded as INFO.&#010;&gt; &gt; The DEBUG logging was implemented as of NUTCH-1496 and is now built into&#010;&gt; &gt; 2.x HEAD. An example can be seen here&#010;&gt; &gt;&#010;&gt; &gt;&#010;&gt; http://svn.apache.org/viewvc/nutch/branches/2.x/src/java/org/apache/nutch/parse/ParserJob.java?r1=1379438&amp;r2=1408271&#010;&gt; &gt;&#010;&gt; &gt; BTW here is the HBase thread which I referred to before&#010;&gt; &gt; http://www.mail-archive.com/user@nutch.apache.org/msg09245.html&#010;&gt; &gt;&#010;&gt; &gt;&#010;&gt; &gt; &gt; I'm still a bit lost on what I need to do for the gora-hbase portion.&#010;&gt;  My&#010;&gt; &gt; &gt; gora-hbase-mapping.xml is unchanged.  Also, from the nutch-default.xml&#010;&gt; &gt; &gt; file:&#010;&gt; &gt; &gt; &lt;property&gt;&#010;&gt; &gt; &gt;   &lt;name&gt;storage.schema.webpage&lt;/name&gt;&#010;&gt; &gt; &gt;   &lt;value&gt;webpage&lt;/value&gt;&#010;&gt; &gt; &gt;   &lt;description&gt;This value holds the schema name used for Nutch web db.&#010;&gt; &gt; &gt;   Note that Nutch ignores the value in the gora mapping files, and uses&#010;&gt; &gt; &gt;   this as the webpage schema name.&#010;&gt; &gt; &gt;   &lt;/description&gt;&#010;&gt; &gt; &gt; &lt;/property&gt;&#010;&gt; &gt; &gt;&#010;&gt; &gt; &gt; So that would lead me to believe that the gora file is just ignored.&#010;&gt; &gt; &gt; If I have the "crawlId" set to "crawlId" -- where do I need to tell&#010;&gt; nutch&#010;&gt; &gt; &gt; to look in the hbase for the "crawlId_webpage"?&#010;&gt; &gt; &gt;&#010;&gt; &gt; &gt; I am unsure as to what your problem is here Chris. Can you please try&#010;&gt; to&#010;&gt; &gt; explain it in layman's terms for me so I understand what problem you are&#010;&gt; &gt; facing?&#010;&gt; &gt; Thanks&#010;&gt; &gt; Lewis&#010;&gt; &gt;&#010;&gt;&#010;&gt;&#010;&gt;&#010;&#010; &#010;&#010;
</pre>
</div>
</content>
</entry>
<entry>
<title>Explanation of RegexURLFIlterTestBase benchmark's</title>
<author><name>Lewis John Mcgibbney &lt;lewis.mcgibbney@gmail.com&gt;</name></author>
<link rel="alternate" href="http://mail-archives.apache.org/mod_mbox/nutch-user/201305.mbox/%3cCAGaRif3CiHMvoo2UteHX7Mh6DVwP8fHsU0Gp=oKw9Jgd2nbjyQ@mail.gmail.com%3e"/>
<id>urn:uuid:%3cCAGaRif3CiHMvoo2UteHX7Mh6DVwP8fHsU0Gp=oKw9Jgd2nbjyQ@mail-gmail-com%3e</id>
<updated>2013-05-23T19:57:04Z</updated>
<content type="xhtml">
<div xmlns="http://www.w3.org/1999/xhtml">
<pre>
Hi All,&#010;A really nice aspect of the regex (urlfilter-automaton and urfilter-regex)&#010;plugin implementation's in Nutch is that there is a small but very useful&#010;RegexURLFilterBaseTest [0] which compares benchmarks for simple regex&#010;parsing.&#010;The results we get are as follows&#010;&#010;urls      automaton      regex&#010;50        343ms           210ms&#010;100      48ms             187ms&#010;200      65ms             363ms&#010;400      100ms           692ms&#010;800      165ms           1385ms&#010;&#010;The problem I have here is understanding why the first (50) bench appears&#010;to be more expensive for both implementations?&#010;Additionally, why does this same bench cost much more for automaton?&#010;&#010;Anyone have a clue?&#010;Thanks&#010;Lewis&#010;&#010;[0]&#010;http://svn.apache.org/viewvc/nutch/branches/2.x/src/plugin/lib-regex-filter/src/test/org/apache/nutch/urlfilter/api/RegexURLFilterBaseTest.java?view=markup&#010;&#010;-- &#010;*Lewis*&#010;&#010;
</pre>
</div>
</content>
</entry>
<entry>
<title>Re: Nutch 2.1 pdf parsing</title>
<author><name>Lewis John Mcgibbney &lt;lewis.mcgibbney@gmail.com&gt;</name></author>
<link rel="alternate" href="http://mail-archives.apache.org/mod_mbox/nutch-user/201305.mbox/%3cCAGaRif2hdSSyrsZ21dc2=+_AR34YhXm46Tf8Xd5wPzPjQhjvMw@mail.gmail.com%3e"/>
<id>urn:uuid:%3cCAGaRif2hdSSyrsZ21dc2=+_AR34YhXm46Tf8Xd5wPzPjQhjvMw@mail-gmail-com%3e</id>
<updated>2013-05-23T18:09:38Z</updated>
<content type="xhtml">
<div xmlns="http://www.w3.org/1999/xhtml">
<pre>
Hi Adriana,&#010;If I were you I would switch your logging to DEBUG for the ParserJob&#010;&#010;- log4j.logger.org.apache.nutch.parse.ParserJob=INFO,cmdstdout&#010;+ log4j.logger.org.apache.nutch.parse.ParserJob=DEBUG,cmdstdout&#010;&#010;&#010;recompile the code, then look closely at the parse chunk of the log to see&#010;what parser is being used, and if there are any particular issues flagged&#010;up @runtime.&#010;&#010;&#010;On Thu, May 23, 2013 at 8:14 AM, Adriana Farina&#010;&lt;adriana.farina23@gmail.com&gt;wrote:&#010;&#010;&gt; Hi,&#010;&gt;&#010;&gt; I'm using Nutch 2.1 in distributed mode on top of Hadoop 1.0.4, with HBase&#010;&gt; 0.90.4 as database.&#010;&gt;&#010;&gt; I wrote a Java class from which I run the crawling cycle, the code that&#010;&gt; implements the crawling cycle is the following:&#010;&gt;&#010;&gt;                   for (int i = 0; i &lt; depth; i++) {&#010;&gt; batchid = generator.generate((Long) args.get(Nutch.ARG_TOPN),&#010;&gt; System.currentTimeMillis(), false, false);&#010;&gt; fetcher.fetch(batchid, 1, false, -1);&#010;&gt; parser.parse(batchid, false, true);&#010;&gt; updater.run(new String[0]);&#010;&gt;   }&#010;&gt;&#010;&gt; The problem is that I'm not able to parse the pdf files, inside HBase I got&#010;&gt; no pdf content. The strange thing is that I got one row with the following&#010;&gt; content: column=p:parsestat, timestamp=1369316742871,&#010;&gt; value=\x04\x90\x03\x02\x96\x01org.apache.nutch.parse.ParseException: Unable&#010;&gt; to successfully parse content\x00.&#010;&gt;&#010;&gt; It seems to me that I have configured all nutch property files correctly.&#010;&gt; Can anybody help me?&#010;&gt;&#010;&gt; Thank you very much.&#010;&gt;&#010;&gt;&#010;&gt; --&#010;&gt; Adriana Farina&#010;&gt;&#010;&#010;&#010;&#010;-- &#010;*Lewis*&#010;&#010;
</pre>
</div>
</content>
</entry>
<entry>
<title>Nutch 2.1 pdf parsing</title>
<author><name>Adriana Farina &lt;adriana.farina23@gmail.com&gt;</name></author>
<link rel="alternate" href="http://mail-archives.apache.org/mod_mbox/nutch-user/201305.mbox/%3cCAGqBQ2BLK0Tatf1K34b-ai2jj0FZWqPgbMeoC5r6DsXH3=6otw@mail.gmail.com%3e"/>
<id>urn:uuid:%3cCAGqBQ2BLK0Tatf1K34b-ai2jj0FZWqPgbMeoC5r6DsXH3=6otw@mail-gmail-com%3e</id>
<updated>2013-05-23T15:14:38Z</updated>
<content type="xhtml">
<div xmlns="http://www.w3.org/1999/xhtml">
<pre>
Hi,&#010;&#010;I'm using Nutch 2.1 in distributed mode on top of Hadoop 1.0.4, with HBase&#010;0.90.4 as database.&#010;&#010;I wrote a Java class from which I run the crawling cycle, the code that&#010;implements the crawling cycle is the following:&#010;&#010;                  for (int i = 0; i &lt; depth; i++) {&#010;batchid = generator.generate((Long) args.get(Nutch.ARG_TOPN),&#010;System.currentTimeMillis(), false, false);&#010;fetcher.fetch(batchid, 1, false, -1);&#010;parser.parse(batchid, false, true);&#010;updater.run(new String[0]);&#010;  }&#010;&#010;The problem is that I'm not able to parse the pdf files, inside HBase I got&#010;no pdf content. The strange thing is that I got one row with the following&#010;content: column=p:parsestat, timestamp=1369316742871,&#010;value=\x04\x90\x03\x02\x96\x01org.apache.nutch.parse.ParseException: Unable&#010;to successfully parse content\x00.&#010;&#010;It seems to me that I have configured all nutch property files correctly.&#010;Can anybody help me?&#010;&#010;Thank you very much.&#010;&#010;&#010;-- &#010;Adriana Farina&#010;&#010;
</pre>
</div>
</content>
</entry>
<entry>
<title>Re: Nutch 2.1 - Unauthorized</title>
<author><name>Tobias Marx &lt;tmarx@uni-wuppertal.de&gt;</name></author>
<link rel="alternate" href="http://mail-archives.apache.org/mod_mbox/nutch-user/201305.mbox/%3c6F2026DE-BFF5-44FD-9429-C06964AD2F5D@uni-wuppertal.de%3e"/>
<id>urn:uuid:%3c6F2026DE-BFF5-44FD-9429-C06964AD2F5D@uni-wuppertal-de%3e</id>
<updated>2013-05-23T09:56:40Z</updated>
<content type="xhtml">
<div xmlns="http://www.w3.org/1999/xhtml">
<pre>
Hi,&#010;&#010;I think he is referring to this issue:&#010;&#010;https://issues.apache.org/jira/browse/NUTCH-1575&#010;&#010;BR,&#010;Tobias&#010;&#010;Am 22.05.2013 um 18:14 schrieb Lewis John Mcgibbney:&#010;&#010;&gt; Hi Feng,&#010;&gt; Where is the patch please?&#010;&gt; Thank you very much&#010;&gt; Lewis&#010;&gt; &#010;&gt; On Wednesday, May 22, 2013, feng lu &lt;amuseme.lu@gmail.com&gt; wrote:&#010;&gt;&gt; Hi Daniel&#010;&gt;&gt; &#010;&gt;&gt; Now Nutch 2.x can not support solr authentication, I have already open an&#010;&gt;&gt; issue and add a patch , you can patch this and try again.&#010;&gt;&gt; &#010;&gt;&gt; Thanks&#010;&gt;&gt; &#010;&gt;&gt; &#010;&gt;&gt; On Wed, May 22, 2013 at 7:34 PM, Daniel Hüsch &lt;huesch@uni-wuppertal.de&#010;&gt;&gt; wrote:&#010;&gt;&gt; &#010;&gt;&gt;&gt; Hi,&#010;&gt;&gt;&gt; &#010;&gt;&gt;&gt; we used Nutch 1.x with an authentication for tomcat.&#010;&gt;&gt;&gt; We used this:&#010;&gt;&gt;&gt; &#010;&gt;&gt;&gt; &lt;role rolename="solr_admin"/&gt;&#010;&gt;&gt;&gt; &#010;&gt;&gt;&gt; &lt;user username="USER"&#010;&gt;&gt;&gt;       password="PASSWORD"&#010;&gt;&gt;&gt;       roles="solr_admin"/&gt;&#010;&gt;&gt;&gt; &#010;&gt;&gt;&gt; in /conf/tomcat-users.xml&#010;&gt;&gt;&gt; And in configuration of Nutch 1.x we used "solr.auth" for the connection&#010;&gt;&gt;&gt; between nutch and solr.&#010;&gt;&gt;&gt; But this function (solr.auth) is not available in Nutch 2.1.&#010;&gt;&gt;&gt; &#010;&gt;&gt;&gt; So we got an error on hadoop.log at indexing:&#010;&gt;&gt;&gt; &#010;&gt;&gt;&gt; 2013-05-22 12:50:26,581 ERROR solr.SolrIndexerJob - SolrIndexerJob:&#010;&gt;&gt;&gt; org.apache.solr.common.**SolrException: Unauthorized&#010;&gt;&gt;&gt; Unauthorized&#010;&gt;&gt;&gt; &#010;&gt;&gt;&gt; &#010;&gt;&gt;&gt; How can we indexing with an authentication and Nutch 2.1?&#010;&gt;&gt;&gt; &#010;&gt;&gt;&gt; Thanks,&#010;&gt;&gt;&gt; &#010;&gt;&gt;&gt; Daniel&#010;&gt;&gt;&gt; &#010;&gt;&gt; &#010;&gt;&gt; &#010;&gt;&gt; &#010;&gt;&gt; --&#010;&gt;&gt; Don't Grow Old, Grow Up... :-)&#010;&gt;&gt; &#010;&gt; &#010;&gt; -- &#010;&gt; *Lewis*&#010;&#010;-- &#010;Tobias Marx&#010;&#010;Zentrum für Informations- und Medienverarbeitung - ZIM&#010;&#010;Bergische Universität Wuppertal&#010;&#010;Büro: T.11.08&#010;++49 202 439 2237&#010;tmarx@uni-wuppertal.de&#010;&#010;&#010;
</pre>
</div>
</content>
</entry>
<entry>
<title>OutOfMemoryError for bin/nutch elasticindex ocpnutch -all</title>
<author><name>Nicholas W &lt;4407@log1.net&gt;</name></author>
<link rel="alternate" href="http://mail-archives.apache.org/mod_mbox/nutch-user/201305.mbox/%3cCAL6CHUEXkuK=UtoxyqFfwO=YU+a5xx8zp0sfrnezrDKisi+W0g@mail.gmail.com%3e"/>
<id>urn:uuid:%3cCAL6CHUEXkuK=UtoxyqFfwO=YU+a5xx8zp0sfrnezrDKisi+W0g@mail-gmail-com%3e</id>
<updated>2013-05-23T08:47:50Z</updated>
<content type="xhtml">
<div xmlns="http://www.w3.org/1999/xhtml">
<pre>
Dear List,&#010; I have been following the instructions at&#010;http://wiki.apache.org/nutch/Nutch2Tutorial to see if I can get a nutch&#010;installation running with ElasticSearch. I have successfully done a crawl&#010;with no real issues, but then when I try and load the results into&#010;elasticsearch I run into trouble.&#010;&#010;I issue the command:&#010;&#010;bin/nutch elasticindex ocpnutch -all&#010;And it waits around for a long time and then comes back with an error:&#010;Exception in thread "main" java.lang.RuntimeException: job failed:&#010;name=elastic-index [ocpnutch], jobid=job_local_0001&#010;&#010;If I look in the logs at:&#010;&#010;~/apache-nutch-2.1/runtime/local/logs/hadoop.log&#010;I see several errors like this:&#010;Exception caught on netty layer [[id: 0x569764bd, /192.168.17.39:52554 =&gt; /&#010;192.168.17.60:9300]]&#010;java.lang.OutOfMemoryError: Java heap space&#010;&#010;There is nothing in the logs on the elastic search.&#010;&#010;I have tried changing:&#010;elastic.max.bulk.docs and elastic.max.bulk.size to small sizes&#010;and allocating large amounts of GB to nutch, but to no avail.&#010;&#010;The jvm is:&#010;Java(TM) SE Runtime Environment (build 1.7.0_21-b11)&#010;&#010;Does anyone have any idea what I am doing wrong - what other diagnostic&#010;information would be helpful to solve this problem?&#010;&#010;Thanks a lot,&#010;Regards,&#010;Nicholas W.&#010;&#010;
</pre>
</div>
</content>
</entry>
<entry>
<title>Re: error crawling</title>
<author><name>Christopher Gross &lt;cogross@gmail.com&gt;</name></author>
<link rel="alternate" href="http://mail-archives.apache.org/mod_mbox/nutch-user/201305.mbox/%3cCAGqyJJBTn2WEvw837hNPieSXGxPBUFDSnTqrr-vh-eVLMPxnVQ@mail.gmail.com%3e"/>
<id>urn:uuid:%3cCAGqyJJBTn2WEvw837hNPieSXGxPBUFDSnTqrr-vh-eVLMPxnVQ@mail-gmail-com%3e</id>
<updated>2013-05-23T01:22:38Z</updated>
<content type="xhtml">
<div xmlns="http://www.w3.org/1999/xhtml">
<pre>
I'm trying to crawl. I'm just running the script that I pulled from the&#010;nutch site, so I assumed that it would be good to go, like the old&#010;runbot.sh script. I could try removing that part, but I still get the error&#010;farther down in the main body of the loop.&#010;&#010;-- Christopher Gross&#010;Sent from my nexus 7&#010;On May 22, 2013 4:40 PM, &lt;alxsss@aim.com&gt; wrote:&#010;&#010;&gt; what are you trying to achieve? What is the reason running inject with a&#010;&gt; crawlIId?&#010;&gt;&#010;&gt;&#010;&gt;&#010;&gt;&#010;&gt;&#010;&gt;&#010;&gt; -----Original Message-----&#010;&gt; From: Christopher Gross &lt;cogross@gmail.com&gt;&#010;&gt; To: user &lt;user@nutch.apache.org&gt;&#010;&gt; Sent: Wed, May 22, 2013 12:25 pm&#010;&gt; Subject: Re: error crawling&#010;&gt;&#010;&gt;&#010;&gt; Sure, I'll try.  I'm also confused about this -- I had it working at one&#010;&gt; point, and it stopped working after migrating to a new box (copied&#010;&gt; everything over but cleared out the HBase).&#010;&gt;&#010;&gt; My hadoop.log for today has:&#010;&gt; store.HBaseStore - Keyclass and nameclass match but mismatching table&#010;&gt; names  mappingfile schema is 'webpage' vs actual schema 'crawl_webpage' ,&#010;&gt; assuming they are the same.&#010;&gt;&#010;&gt; I have nothing in a config file for a "crawl_webpage".  I ran:&#010;&gt; grep crawl_webpage *&#010;&gt; and got nothing.&#010;&gt; Running:&#010;&gt; grep webpage *&#010;&gt; gets me hits on gora mapping files for accumulo, hbase, cassandra and sql,&#010;&gt; as well as the nutch-default.xml file.&#010;&gt; nutch-default.xml has a "storage.schema.webpage" which has a value of&#010;&gt; "webpage".&#010;&gt;&#010;&gt; Now, what I'm thinking is that my CRAWL_ID is set to crawl, and for&#010;&gt; whatever reason, that is the table that nutch is making is that CRAWL_ID +&#010;&gt; _ + "webpage".&#010;&gt;&#010;&gt; I tried making the gora mapping file use crawl_webpage but then I ended up&#010;&gt; with some crawl_crawl_webpage error messages, so I cleared out the HBase&#010;&gt; (again) and rolled back the file.&#010;&gt;&#010;&gt; Perhaps I'm running on an older one, can you point me in the right&#010;&gt; direction for getting that "crawl" script that replaces the 1.x "runbot.sh"&#010;&gt; script?&#010;&gt;&#010;&gt;&#010;&gt; -- Chris&#010;&gt;&#010;&gt;&#010;&gt; On Mon, May 20, 2013 at 1:55 PM, Lewis John Mcgibbney &lt;&#010;&gt; lewis.mcgibbney@gmail.com&gt; wrote:&#010;&gt;&#010;&gt; &gt; Hi Chris,&#010;&gt; &gt;&#010;&gt; &gt; On Mon, May 20, 2013 at 10:21 AM, Christopher Gross &lt;cogross@gmail.com&#010;&gt; &gt; &gt;wrote:&#010;&gt; &gt;&#010;&gt; &gt; &gt; Lewis --&#010;&gt; &gt; &gt; Is the DEBUG something set in the conf/log4j.properties file?  I have&#010;&gt; the&#010;&gt; &gt; &gt; rootLogger set to INFO,DRFA and the threshold is ALL.  Everything else&#010;&gt; is&#010;&gt; &gt; &gt; INFO or WARN (no DEBUGs to be found.)&#010;&gt; &gt; &gt;&#010;&gt; &gt; &gt;&#010;&gt; &gt; Well yes you can set it in the log4j.properties file, however if you are&#010;&gt; &gt; working with anything older than 2.x HEAD then by default the logging is&#010;&gt; &gt; hardcoded as INFO.&#010;&gt; &gt; The DEBUG logging was implemented as of NUTCH-1496 and is now built into&#010;&gt; &gt; 2.x HEAD. An example can be seen here&#010;&gt; &gt;&#010;&gt; &gt;&#010;&gt; http://svn.apache.org/viewvc/nutch/branches/2.x/src/java/org/apache/nutch/parse/ParserJob.java?r1=1379438&amp;r2=1408271&#010;&gt; &gt;&#010;&gt; &gt; BTW here is the HBase thread which I referred to before&#010;&gt; &gt; http://www.mail-archive.com/user@nutch.apache.org/msg09245.html&#010;&gt; &gt;&#010;&gt; &gt;&#010;&gt; &gt; &gt; I'm still a bit lost on what I need to do for the gora-hbase portion.&#010;&gt;  My&#010;&gt; &gt; &gt; gora-hbase-mapping.xml is unchanged.  Also, from the nutch-default.xml&#010;&gt; &gt; &gt; file:&#010;&gt; &gt; &gt; &lt;property&gt;&#010;&gt; &gt; &gt;   &lt;name&gt;storage.schema.webpage&lt;/name&gt;&#010;&gt; &gt; &gt;   &lt;value&gt;webpage&lt;/value&gt;&#010;&gt; &gt; &gt;   &lt;description&gt;This value holds the schema name used for Nutch web db.&#010;&gt; &gt; &gt;   Note that Nutch ignores the value in the gora mapping files, and uses&#010;&gt; &gt; &gt;   this as the webpage schema name.&#010;&gt; &gt; &gt;   &lt;/description&gt;&#010;&gt; &gt; &gt; &lt;/property&gt;&#010;&gt; &gt; &gt;&#010;&gt; &gt; &gt; So that would lead me to believe that the gora file is just ignored.&#010;&gt; &gt; &gt; If I have the "crawlId" set to "crawlId" -- where do I need to tell&#010;&gt; nutch&#010;&gt; &gt; &gt; to look in the hbase for the "crawlId_webpage"?&#010;&gt; &gt; &gt;&#010;&gt; &gt; &gt; I am unsure as to what your problem is here Chris. Can you please try&#010;&gt; to&#010;&gt; &gt; explain it in layman's terms for me so I understand what problem you are&#010;&gt; &gt; facing?&#010;&gt; &gt; Thanks&#010;&gt; &gt; Lewis&#010;&gt; &gt;&#010;&gt;&#010;&gt;&#010;&gt;&#010;&#010;
</pre>
</div>
</content>
</entry>
<entry>
<title>Re: error crawling</title>
<author><name>alxsss@aim.com</name></author>
<link rel="alternate" href="http://mail-archives.apache.org/mod_mbox/nutch-user/201305.mbox/%3c8D02543A8216213-4C4-5A0D@webmail-d254.sysops.aol.com%3e"/>
<id>urn:uuid:%3c8D02543A8216213-4C4-5A0D@webmail-d254-sysops-aol-com%3e</id>
<updated>2013-05-22T20:39:57Z</updated>
<content type="xhtml">
<div xmlns="http://www.w3.org/1999/xhtml">
<pre>
what are you trying to achieve? What is the reason running inject with a crawlIId?&#010; &#010;&#010; &#010;&#010; &#010;&#010;-----Original Message-----&#010;From: Christopher Gross &lt;cogross@gmail.com&gt;&#010;To: user &lt;user@nutch.apache.org&gt;&#010;Sent: Wed, May 22, 2013 12:25 pm&#010;Subject: Re: error crawling&#010;&#010;&#010;Sure, I'll try.  I'm also confused about this -- I had it working at one&#010;point, and it stopped working after migrating to a new box (copied&#010;everything over but cleared out the HBase).&#010;&#010;My hadoop.log for today has:&#010;store.HBaseStore - Keyclass and nameclass match but mismatching table&#010;names  mappingfile schema is 'webpage' vs actual schema 'crawl_webpage' ,&#010;assuming they are the same.&#010;&#010;I have nothing in a config file for a "crawl_webpage".  I ran:&#010;grep crawl_webpage *&#010;and got nothing.&#010;Running:&#010;grep webpage *&#010;gets me hits on gora mapping files for accumulo, hbase, cassandra and sql,&#010;as well as the nutch-default.xml file.&#010;nutch-default.xml has a "storage.schema.webpage" which has a value of&#010;"webpage".&#010;&#010;Now, what I'm thinking is that my CRAWL_ID is set to crawl, and for&#010;whatever reason, that is the table that nutch is making is that CRAWL_ID +&#010;_ + "webpage".&#010;&#010;I tried making the gora mapping file use crawl_webpage but then I ended up&#010;with some crawl_crawl_webpage error messages, so I cleared out the HBase&#010;(again) and rolled back the file.&#010;&#010;Perhaps I'm running on an older one, can you point me in the right&#010;direction for getting that "crawl" script that replaces the 1.x "runbot.sh"&#010;script?&#010;&#010;&#010;-- Chris&#010;&#010;&#010;On Mon, May 20, 2013 at 1:55 PM, Lewis John Mcgibbney &lt;&#010;lewis.mcgibbney@gmail.com&gt; wrote:&#010;&#010;&gt; Hi Chris,&#010;&gt;&#010;&gt; On Mon, May 20, 2013 at 10:21 AM, Christopher Gross &lt;cogross@gmail.com&#010;&gt; &gt;wrote:&#010;&gt;&#010;&gt; &gt; Lewis --&#010;&gt; &gt; Is the DEBUG something set in the conf/log4j.properties file?  I have the&#010;&gt; &gt; rootLogger set to INFO,DRFA and the threshold is ALL.  Everything else is&#010;&gt; &gt; INFO or WARN (no DEBUGs to be found.)&#010;&gt; &gt;&#010;&gt; &gt;&#010;&gt; Well yes you can set it in the log4j.properties file, however if you are&#010;&gt; working with anything older than 2.x HEAD then by default the logging is&#010;&gt; hardcoded as INFO.&#010;&gt; The DEBUG logging was implemented as of NUTCH-1496 and is now built into&#010;&gt; 2.x HEAD. An example can be seen here&#010;&gt;&#010;&gt; http://svn.apache.org/viewvc/nutch/branches/2.x/src/java/org/apache/nutch/parse/ParserJob.java?r1=1379438&amp;r2=1408271&#010;&gt;&#010;&gt; BTW here is the HBase thread which I referred to before&#010;&gt; http://www.mail-archive.com/user@nutch.apache.org/msg09245.html&#010;&gt;&#010;&gt;&#010;&gt; &gt; I'm still a bit lost on what I need to do for the gora-hbase portion.  My&#010;&gt; &gt; gora-hbase-mapping.xml is unchanged.  Also, from the nutch-default.xml&#010;&gt; &gt; file:&#010;&gt; &gt; &lt;property&gt;&#010;&gt; &gt;   &lt;name&gt;storage.schema.webpage&lt;/name&gt;&#010;&gt; &gt;   &lt;value&gt;webpage&lt;/value&gt;&#010;&gt; &gt;   &lt;description&gt;This value holds the schema name used for Nutch web db.&#010;&gt; &gt;   Note that Nutch ignores the value in the gora mapping files, and uses&#010;&gt; &gt;   this as the webpage schema name.&#010;&gt; &gt;   &lt;/description&gt;&#010;&gt; &gt; &lt;/property&gt;&#010;&gt; &gt;&#010;&gt; &gt; So that would lead me to believe that the gora file is just ignored.&#010;&gt; &gt; If I have the "crawlId" set to "crawlId" -- where do I need to tell nutch&#010;&gt; &gt; to look in the hbase for the "crawlId_webpage"?&#010;&gt; &gt;&#010;&gt; &gt; I am unsure as to what your problem is here Chris. Can you please try to&#010;&gt; explain it in layman's terms for me so I understand what problem you are&#010;&gt; facing?&#010;&gt; Thanks&#010;&gt; Lewis&#010;&gt;&#010;&#010; &#010;&#010;
</pre>
</div>
</content>
</entry>
<entry>
<title>Re: error crawling</title>
<author><name>Christopher Gross &lt;cogross@gmail.com&gt;</name></author>
<link rel="alternate" href="http://mail-archives.apache.org/mod_mbox/nutch-user/201305.mbox/%3cCAGqyJJB2CEQ-s1ZUrgcbOv63O-+C8KvSfmgma60vhyvUMZFA0g@mail.gmail.com%3e"/>
<id>urn:uuid:%3cCAGqyJJB2CEQ-s1ZUrgcbOv63O-+C8KvSfmgma60vhyvUMZFA0g@mail-gmail-com%3e</id>
<updated>2013-05-22T19:17:14Z</updated>
<content type="xhtml">
<div xmlns="http://www.w3.org/1999/xhtml">
<pre>
Sure, I'll try.  I'm also confused about this -- I had it working at one&#010;point, and it stopped working after migrating to a new box (copied&#010;everything over but cleared out the HBase).&#010;&#010;My hadoop.log for today has:&#010;store.HBaseStore - Keyclass and nameclass match but mismatching table&#010;names  mappingfile schema is 'webpage' vs actual schema 'crawl_webpage' ,&#010;assuming they are the same.&#010;&#010;I have nothing in a config file for a "crawl_webpage".  I ran:&#010;grep crawl_webpage *&#010;and got nothing.&#010;Running:&#010;grep webpage *&#010;gets me hits on gora mapping files for accumulo, hbase, cassandra and sql,&#010;as well as the nutch-default.xml file.&#010;nutch-default.xml has a "storage.schema.webpage" which has a value of&#010;"webpage".&#010;&#010;Now, what I'm thinking is that my CRAWL_ID is set to crawl, and for&#010;whatever reason, that is the table that nutch is making is that CRAWL_ID +&#010;_ + "webpage".&#010;&#010;I tried making the gora mapping file use crawl_webpage but then I ended up&#010;with some crawl_crawl_webpage error messages, so I cleared out the HBase&#010;(again) and rolled back the file.&#010;&#010;Perhaps I'm running on an older one, can you point me in the right&#010;direction for getting that "crawl" script that replaces the 1.x "runbot.sh"&#010;script?&#010;&#010;&#010;-- Chris&#010;&#010;&#010;On Mon, May 20, 2013 at 1:55 PM, Lewis John Mcgibbney &lt;&#010;lewis.mcgibbney@gmail.com&gt; wrote:&#010;&#010;&gt; Hi Chris,&#010;&gt;&#010;&gt; On Mon, May 20, 2013 at 10:21 AM, Christopher Gross &lt;cogross@gmail.com&#010;&gt; &gt;wrote:&#010;&gt;&#010;&gt; &gt; Lewis --&#010;&gt; &gt; Is the DEBUG something set in the conf/log4j.properties file?  I have the&#010;&gt; &gt; rootLogger set to INFO,DRFA and the threshold is ALL.  Everything else is&#010;&gt; &gt; INFO or WARN (no DEBUGs to be found.)&#010;&gt; &gt;&#010;&gt; &gt;&#010;&gt; Well yes you can set it in the log4j.properties file, however if you are&#010;&gt; working with anything older than 2.x HEAD then by default the logging is&#010;&gt; hardcoded as INFO.&#010;&gt; The DEBUG logging was implemented as of NUTCH-1496 and is now built into&#010;&gt; 2.x HEAD. An example can be seen here&#010;&gt;&#010;&gt; http://svn.apache.org/viewvc/nutch/branches/2.x/src/java/org/apache/nutch/parse/ParserJob.java?r1=1379438&amp;r2=1408271&#010;&gt;&#010;&gt; BTW here is the HBase thread which I referred to before&#010;&gt; http://www.mail-archive.com/user@nutch.apache.org/msg09245.html&#010;&gt;&#010;&gt;&#010;&gt; &gt; I'm still a bit lost on what I need to do for the gora-hbase portion.  My&#010;&gt; &gt; gora-hbase-mapping.xml is unchanged.  Also, from the nutch-default.xml&#010;&gt; &gt; file:&#010;&gt; &gt; &lt;property&gt;&#010;&gt; &gt;   &lt;name&gt;storage.schema.webpage&lt;/name&gt;&#010;&gt; &gt;   &lt;value&gt;webpage&lt;/value&gt;&#010;&gt; &gt;   &lt;description&gt;This value holds the schema name used for Nutch web db.&#010;&gt; &gt;   Note that Nutch ignores the value in the gora mapping files, and uses&#010;&gt; &gt;   this as the webpage schema name.&#010;&gt; &gt;   &lt;/description&gt;&#010;&gt; &gt; &lt;/property&gt;&#010;&gt; &gt;&#010;&gt; &gt; So that would lead me to believe that the gora file is just ignored.&#010;&gt; &gt; If I have the "crawlId" set to "crawlId" -- where do I need to tell nutch&#010;&gt; &gt; to look in the hbase for the "crawlId_webpage"?&#010;&gt; &gt;&#010;&gt; &gt; I am unsure as to what your problem is here Chris. Can you please try to&#010;&gt; explain it in layman's terms for me so I understand what problem you are&#010;&gt; facing?&#010;&gt; Thanks&#010;&gt; Lewis&#010;&gt;&#010;&#010;
</pre>
</div>
</content>
</entry>
<entry>
<title>Re: Nutch 2.1 - Unauthorized</title>
<author><name>Lewis John Mcgibbney &lt;lewis.mcgibbney@gmail.com&gt;</name></author>
<link rel="alternate" href="http://mail-archives.apache.org/mod_mbox/nutch-user/201305.mbox/%3cCAGaRif1oAP2692+sZ1Ejgy=MLn2W1U7dBF-MYqfZo60ZgeFoYA@mail.gmail.com%3e"/>
<id>urn:uuid:%3cCAGaRif1oAP2692+sZ1Ejgy=MLn2W1U7dBF-MYqfZo60ZgeFoYA@mail-gmail-com%3e</id>
<updated>2013-05-22T16:14:57Z</updated>
<content type="xhtml">
<div xmlns="http://www.w3.org/1999/xhtml">
<pre>
Hi Feng,&#010;Where is the patch please?&#010;Thank you very much&#010;Lewis&#010;&#010;On Wednesday, May 22, 2013, feng lu &lt;amuseme.lu@gmail.com&gt; wrote:&#010;&gt; Hi Daniel&#010;&gt;&#010;&gt; Now Nutch 2.x can not support solr authentication, I have already open an&#010;&gt; issue and add a patch , you can patch this and try again.&#010;&gt;&#010;&gt; Thanks&#010;&gt;&#010;&gt;&#010;&gt; On Wed, May 22, 2013 at 7:34 PM, Daniel Hüsch &lt;huesch@uni-wuppertal.de&#010;&gt;wrote:&#010;&gt;&#010;&gt;&gt; Hi,&#010;&gt;&gt;&#010;&gt;&gt; we used Nutch 1.x with an authentication for tomcat.&#010;&gt;&gt; We used this:&#010;&gt;&gt;&#010;&gt;&gt; &lt;role rolename="solr_admin"/&gt;&#010;&gt;&gt;&#010;&gt;&gt;  &lt;user username="USER"&#010;&gt;&gt;        password="PASSWORD"&#010;&gt;&gt;        roles="solr_admin"/&gt;&#010;&gt;&gt;&#010;&gt;&gt; in /conf/tomcat-users.xml&#010;&gt;&gt; And in configuration of Nutch 1.x we used "solr.auth" for the connection&#010;&gt;&gt; between nutch and solr.&#010;&gt;&gt; But this function (solr.auth) is not available in Nutch 2.1.&#010;&gt;&gt;&#010;&gt;&gt; So we got an error on hadoop.log at indexing:&#010;&gt;&gt;&#010;&gt;&gt; 2013-05-22 12:50:26,581 ERROR solr.SolrIndexerJob - SolrIndexerJob:&#010;&gt;&gt; org.apache.solr.common.**SolrException: Unauthorized&#010;&gt;&gt; Unauthorized&#010;&gt;&gt;&#010;&gt;&gt;&#010;&gt;&gt; How can we indexing with an authentication and Nutch 2.1?&#010;&gt;&gt;&#010;&gt;&gt; Thanks,&#010;&gt;&gt;&#010;&gt;&gt; Daniel&#010;&gt;&gt;&#010;&gt;&#010;&gt;&#010;&gt;&#010;&gt; --&#010;&gt; Don't Grow Old, Grow Up... :-)&#010;&gt;&#010;&#010;-- &#010;*Lewis*&#010;&#010;
</pre>
</div>
</content>
</entry>
<entry>
<title>Re: Nutch 2.1 - Unauthorized</title>
<author><name>feng lu &lt;amuseme.lu@gmail.com&gt;</name></author>
<link rel="alternate" href="http://mail-archives.apache.org/mod_mbox/nutch-user/201305.mbox/%3cCAOeWMMrfW_tC=qb1JZE3F8=E9MMDbWNr=sAyy=R2t8b2wKWqsg@mail.gmail.com%3e"/>
<id>urn:uuid:%3cCAOeWMMrfW_tC=qb1JZE3F8=E9MMDbWNr=sAyy=R2t8b2wKWqsg@mail-gmail-com%3e</id>
<updated>2013-05-22T15:36:31Z</updated>
<content type="xhtml">
<div xmlns="http://www.w3.org/1999/xhtml">
<pre>
Hi Daniel&#010;&#010;Now Nutch 2.x can not support solr authentication, I have already open an&#010;issue and add a patch , you can patch this and try again.&#010;&#010;Thanks&#010;&#010;&#010;On Wed, May 22, 2013 at 7:34 PM, Daniel Hüsch &lt;huesch@uni-wuppertal.de&gt;wrote:&#010;&#010;&gt; Hi,&#010;&gt;&#010;&gt; we used Nutch 1.x with an authentication for tomcat.&#010;&gt; We used this:&#010;&gt;&#010;&gt; &lt;role rolename="solr_admin"/&gt;&#010;&gt;&#010;&gt;  &lt;user username="USER"&#010;&gt;        password="PASSWORD"&#010;&gt;        roles="solr_admin"/&gt;&#010;&gt;&#010;&gt; in /conf/tomcat-users.xml&#010;&gt; And in configuration of Nutch 1.x we used "solr.auth" for the connection&#010;&gt; between nutch and solr.&#010;&gt; But this function (solr.auth) is not available in Nutch 2.1.&#010;&gt;&#010;&gt; So we got an error on hadoop.log at indexing:&#010;&gt;&#010;&gt; 2013-05-22 12:50:26,581 ERROR solr.SolrIndexerJob - SolrIndexerJob:&#010;&gt; org.apache.solr.common.**SolrException: Unauthorized&#010;&gt; Unauthorized&#010;&gt;&#010;&gt;&#010;&gt; How can we indexing with an authentication and Nutch 2.1?&#010;&gt;&#010;&gt; Thanks,&#010;&gt;&#010;&gt; Daniel&#010;&gt;&#010;&#010;&#010;&#010;-- &#010;Don't Grow Old, Grow Up... :-)&#010;&#010;
</pre>
</div>
</content>
</entry>
<entry>
<title>Nutch 2.1 - Unauthorized</title>
<author><name>Daniel Hüsch &lt;huesch@uni-wuppertal.de&gt;</name></author>
<link rel="alternate" href="http://mail-archives.apache.org/mod_mbox/nutch-user/201305.mbox/%3c519CAD42.9060302@uni-wuppertal.de%3e"/>
<id>urn:uuid:%3c519CAD42-9060302@uni-wuppertal-de%3e</id>
<updated>2013-05-22T11:34:26Z</updated>
<content type="xhtml">
<div xmlns="http://www.w3.org/1999/xhtml">
<pre>
Hi,&#010;&#010;we used Nutch 1.x with an authentication for tomcat.&#010;We used this:&#010;&#010;&lt;role rolename="solr_admin"/&gt;&#010;&#010;  &lt;user username="USER"&#010;        password="PASSWORD"&#010;        roles="solr_admin"/&gt;&#010;&#010;in /conf/tomcat-users.xml&#010;And in configuration of Nutch 1.x we used "solr.auth" for the connection &#010;between nutch and solr.&#010;But this function (solr.auth) is not available in Nutch 2.1.&#010;&#010;So we got an error on hadoop.log at indexing:&#010;&#010;2013-05-22 12:50:26,581 ERROR solr.SolrIndexerJob - SolrIndexerJob: &#010;org.apache.solr.common.SolrException: Unauthorized&#010;Unauthorized&#010;&#010;&#010;How can we indexing with an authentication and Nutch 2.1?&#010;&#010;Thanks,&#010;&#010;Daniel&#010;&#010;
</pre>
</div>
</content>
</entry>
<entry>
<title>Re: Using Nutch and Hive together</title>
<author><name>&quot;Yves S. Garret&quot; &lt;yoursurrogategod@gmail.com&gt;</name></author>
<link rel="alternate" href="http://mail-archives.apache.org/mod_mbox/nutch-user/201305.mbox/%3cCAJ=2b04bX9=vjn0awW+=T2ugj0q6E=YZFM0P7zdP3CwiPCgBGw@mail.gmail.com%3e"/>
<id>urn:uuid:%3cCAJ=2b04bX9=vjn0awW+=T2ugj0q6E=YZFM0P7zdP3CwiPCgBGw@mail-gmail-com%3e</id>
<updated>2013-05-22T01:22:17Z</updated>
<content type="xhtml">
<div xmlns="http://www.w3.org/1999/xhtml">
<pre>
I have another question.  As a side-solution (as in, something that&#010;needs to be done soon and can be quick and dirty), would it be possible&#010;to pipe the Nutch output somewhere and then using a cron job (or some&#010;timed process) to import it into Hive?  Has anyone done this?&#010;&#010;&#010;On Wed, May 1, 2013 at 11:16 PM, Renato Marroquín Mogrovejo &lt;&#010;renatoj.marroquin@gmail.com&gt; wrote:&#010;&#010;&gt; Hi Yves,&#010;&gt;&#010;&gt; Just get your head around hadoop and start playing around with it.&#010;&gt; Nutch is a great place to start getting familiarized with Gora.&#010;&gt; We will help you any time you need it and then you can help us push&#010;&gt; Gora forward (:&#010;&gt;&#010;&gt;&#010;&gt; Renato M.&#010;&gt;&#010;&gt; 2013/5/1 Yves S. Garret &lt;yoursurrogategod@gmail.com&gt;:&#010;&gt; &gt; Hi Renato,&#010;&gt; &gt;&#010;&gt; &gt; Sounds kinda fun :) .  I'll start reading here about Gora in&#010;&gt; &gt; order to better understand.  I'm also reading Hadoop: The&#010;&gt; &gt; Definitive Guide.&#010;&gt; &gt;&#010;&gt; &gt; http://gora.apache.org/&#010;&gt; &gt;&#010;&gt; &gt; Should I look into anything else in order to learn more about&#010;&gt; &gt; Gora?  I'll have to admit, I'm new to Hadoop and don't have&#010;&gt; &gt; the firmest grasp of the internals.  I'm not sure how useful&#010;&gt; &gt; I'll be at this moment.&#010;&gt; &gt;&#010;&gt; &gt;&#010;&gt; &gt; On Tue, Apr 30, 2013 at 6:21 PM, Renato Marroquín Mogrovejo &lt;&#010;&gt; &gt; renatoj.marroquin@gmail.com&gt; wrote:&#010;&gt; &gt;&#010;&gt; &gt;&gt; Hi Yves,&#010;&gt; &gt;&gt;&#010;&gt; &gt;&gt; Apache Gora does not support Apache Hive just yet, but we have it on&#010;&gt; &gt;&gt; our future plans. If you were willing to dive into an adventure with&#010;&gt; &gt;&gt; Gora we would be happy to help you out with that.&#010;&gt; &gt;&gt; There is a Pig-Gora adapter patch on JIRA, maybe you would like to&#010;&gt; &gt;&gt; give it a look? Although there is a bit of work involved in that one&#010;&gt; &gt;&gt; as well.&#010;&gt; &gt;&gt;&#010;&gt; &gt;&gt;&#010;&gt; &gt;&gt; Renato M.&#010;&gt; &gt;&gt;&#010;&gt; &gt;&gt; 2013/4/30 Yves S. Garret &lt;yoursurrogategod@gmail.com&gt;:&#010;&gt; &gt;&gt; &gt; Ok.  I think I understand.  So there's an adapter involved here.&#010;&gt; &gt;&gt; &gt;&#010;&gt; &gt;&gt; &gt;&#010;&gt; &gt;&gt; &gt; On Tue, Apr 30, 2013 at 5:04 PM, Tejas Patil &lt;&#010;&gt; tejas.patil.cs@gmail.com&#010;&gt; &gt;&gt; &gt;wrote:&#010;&gt; &gt;&gt; &gt;&#010;&gt; &gt;&gt; &gt;&gt; Nutch 2.x series is built on Gora which offers storage abstraction.&#010;&gt; From&#010;&gt; &gt;&gt; &gt;&gt; the Gora project main page, I think gora has adapters for accessing&#010;&gt; the&#010;&gt; &gt;&gt; &gt;&gt; data and making analysis through Apache Hive but it wont support&#010;&gt; &gt;&gt; storing of&#010;&gt; &gt;&gt; &gt;&gt; nutch data into hive.&#010;&gt; &gt;&gt; &gt;&gt;&#010;&gt; &gt;&gt; &gt;&gt; There are Gora experts on this group who can answer better.&#010;&gt; &gt;&gt; &gt;&gt;&#010;&gt; &gt;&gt; &gt;&gt;&#010;&gt; &gt;&gt; &gt;&gt; On Tue, Apr 30, 2013 at 1:46 PM, Yves S. Garret&#010;&gt; &gt;&gt; &gt;&gt; &lt;yoursurrogategod@gmail.com&gt;wrote:&#010;&gt; &gt;&gt; &gt;&gt;&#010;&gt; &gt;&gt; &gt;&gt; &gt; Hello,&#010;&gt; &gt;&gt; &gt;&gt; &gt;&#010;&gt; &gt;&gt; &gt;&gt; &gt; I'm curious.  If I wanted to store the URLs for Nutch (version&#010;&gt; 2.1) in&#010;&gt; &gt;&gt; &gt;&gt; Hive&#010;&gt; &gt;&gt; &gt;&gt; &gt; (version&#010;&gt; &gt;&gt; &gt;&gt; &gt; 0.9.0) and then store the output from Nutch in Hive, how would&#010;I do&#010;&gt; &gt;&gt; that?&#010;&gt; &gt;&gt; &gt;&gt; &gt; Any&#010;&gt; &gt;&gt; &gt;&gt; &gt; pointers?&#010;&gt; &gt;&gt; &gt;&gt; &gt;&#010;&gt; &gt;&gt; &gt;&gt; &gt; I've googled for "nutch hive" (maybe there's a better term?),&#010;but&#010;&gt; &gt;&gt; haven't&#010;&gt; &gt;&gt; &gt;&gt; &gt; found&#010;&gt; &gt;&gt; &gt;&gt; &gt; anything specific or very helpful.  I'll keep looking and&#010;&gt; &gt;&gt; experimenting.&#010;&gt; &gt;&gt; &gt;&gt; &gt; Your help is&#010;&gt; &gt;&gt; &gt;&gt; &gt; appreciated.&#010;&gt; &gt;&gt; &gt;&gt; &gt;&#010;&gt; &gt;&gt; &gt;&gt; &gt; --Yves&#010;&gt; &gt;&gt; &gt;&gt; &gt;&#010;&gt; &gt;&gt; &gt;&gt;&#010;&gt; &gt;&gt;&#010;&gt;&#010;&#010;
</pre>
</div>
</content>
</entry>
<entry>
<title>Re: Nutch 2.1 generate: how to get multiple maps in deploy-mode</title>
<author><name>feng lu &lt;amuseme.lu@gmail.com&gt;</name></author>
<link rel="alternate" href="http://mail-archives.apache.org/mod_mbox/nutch-user/201305.mbox/%3cCAOeWMMqOiOH5MHPxVDW88+C=J6sUfd-q7Nc7nin0g+0ncFdCSg@mail.gmail.com%3e"/>
<id>urn:uuid:%3cCAOeWMMqOiOH5MHPxVDW88+C=J6sUfd-q7Nc7nin0g+0ncFdCSg@mail-gmail-com%3e</id>
<updated>2013-05-21T14:34:51Z</updated>
<content type="xhtml">
<div xmlns="http://www.w3.org/1999/xhtml">
<pre>
Nutch 2.1 use apache gora to access the cassandra database, so that&#010;implemented the hadoop inputsplit interface to generate input split from&#010;cassandra. So you can find some documentation about gora- cassandra model.&#010;&#010;&#010;&#010;On May 21, 2013 7:45 PM, "Martin Aesch" &lt;martin.aesch@googlemail.com&gt; wrote:&#010;&#010;&gt; Hi,&#010;&gt;&#010;&gt; I'm running nutch-2.1 (on top of cassandra) on a single-node hadoop&#010;&gt; "cluster". Sorry in case my question is noobish or more a hadoop issue.&#010;&gt;&#010;&gt; In short: How can I force nutch generate to provide a filesplitsize&gt;1&#010;&gt; which seems to be necesarry to run multiple map jobs?&#010;&gt;&#010;&gt; I am seeing that only one input split is generated for nutch generate:&#010;&gt; For&#010;&gt; ./bin/nutch generate -topN 1000000 -noFilter&#010;&gt;&#010;&gt;&#010;&gt; 2013-05-21 13:36:33,960 INFO org.apache.hadoop.mapred.JobInProgress:&#010;&gt; Input size for job job_201305211335_0001 = 0. Number of splits = 1&#010;&gt; 2013-05-21 13:36:33,960 INFO org.apache.hadoop.mapred.JobInProgress:&#010;&gt; job_201305211335_0001 LOCALITY_WAIT_FACTOR=0.0&#010;&gt; 2013-05-21 13:36:33,961 INFO org.apache.hadoop.mapred.JobInProgress: Job&#010;&gt; job_201305211335_0001 initialized successfully with 1 map tasks and 2&#010;&gt; reduce tasks.&#010;&gt; 2013-05-21 13:36:34,278 INFO org.apache.hadoop.mapred.JobTracker: Adding&#010;&gt; task (JOB_SETUP) 'attempt_201305211335_0001_m_000002_0' to tip&#010;&gt; task_201305211335_0001_m_000002, for tracker&#010;&gt; 'tracker_Ubuntu-1204-precise-64-minimal:localhost/127.0.0.1:34436'&#010;&gt; 2013-05-21 13:36:36,728 INFO org.apache.hadoop.mapred.JobInProgress:&#010;&gt; Task 'attempt_201305211335_0001_m_000002_0' has completed&#010;&gt; task_201305211335_0001_m_000002 successfully.&#010;&gt; 2013-05-21 13:36:36,735 INFO org.apache.hadoop.mapred.JobInProgress:&#010;&gt; Choosing a non-local task task_201305211335_0001_m_000000&#010;&gt; 2013-05-21 13:36:36,735 INFO org.apache.hadoop.mapred.JobTracker: Adding&#010;&gt; task (MAP) 'attempt_201305211335_0001_m_000000_0' to tip&#010;&gt; task_201305211335_0001_m_000000, for tracker&#010;&gt; 'tracker_Ubuntu-1204-precise-64-minimal:localhost/127.0.0.1:34436'&#010;&gt;&#010;&gt;&#010;&gt; System load did not reach its maximum by far, neither in terms of CPU&#010;&gt; nor in terms of i/o-waiting. It took 100 minutes, which seems very fair&#010;&gt; for one single map, since I have about 50M webpages in my database.&#010;&gt; Jobtracker says max map tasks is 2 and max reduce tasks is 2, but only 1&#010;&gt; map task is running.&#010;&gt;&#010;&gt; This is my mapred-site.xml, which should be ok, in particular I&#010;&gt; overwrite mapred.job.tracker not to be "local":&#010;&gt;&#010;&gt; &lt;?xml version="1.0"?&gt;&#010;&gt; &lt;?xml-stylesheet type="text/xsl" href="configuration.xsl"?&gt;&#010;&gt; &lt;!-- Put site-specific property overrides in this file. --&gt;&#010;&gt;&#010;&gt; &lt;configuration&gt;&#010;&gt;      &lt;property&gt;&#010;&gt;          &lt;name&gt;mapred.job.tracker&lt;/name&gt;&#010;&gt;          &lt;value&gt;localhost:9001&lt;/value&gt;&#010;&gt;      &lt;/property&gt;&#010;&gt;      &lt;property&gt;&#010;&gt;         &lt;name&gt;mapreduce.jobtracker.staging.root.dir&lt;/name&gt;&#010;&gt;         &lt;value&gt;/user&lt;/value&gt;&#010;&gt;      &lt;/property&gt;&#010;&gt;      &lt;property&gt;&#010;&gt;         &lt;name&gt;mapred.map.tasks&lt;/name&gt;&#010;&gt;         &lt;value&gt;2&lt;/value&gt;&#010;&gt;         &lt;description&gt;&#010;&gt;         define mapred.map tasks to be number of slave hosts&#010;&gt;         &lt;/description&gt;&#010;&gt;      &lt;/property&gt;&#010;&gt; &lt;property&gt;&#010;&gt;   &lt;name&gt;mapred.reduce.tasks&lt;/name&gt;&#010;&gt;   &lt;value&gt;2&lt;/value&gt;&#010;&gt;   &lt;description&gt;&#010;&gt;     define mapred.reduce tasks to be number of slave hosts&#010;&gt;   &lt;/description&gt;&#010;&gt; &lt;/property&gt;&#010;&gt;&#010;&gt; &lt;/configuration&gt;&#010;&gt;&#010;&gt;&#010;&gt; Thanks and best regards,&#010;&gt; Martin&#010;&gt;&#010;&gt;&#010;&gt;&#010;&gt;&#010;&gt;&#010;&gt;&#010;&gt;&#010;&#010;
</pre>
</div>
</content>
</entry>
<entry>
<title>Nutch 2.1 generate: how to get multiple maps in deploy-mode</title>
<author><name>Martin Aesch &lt;martin.aesch@googlemail.com&gt;</name></author>
<link rel="alternate" href="http://mail-archives.apache.org/mod_mbox/nutch-user/201305.mbox/%3c1369136687.6333.3.camel@senf.dw.privat%3e"/>
<id>urn:uuid:%3c1369136687-6333-3-camel@senf-dw-privat%3e</id>
<updated>2013-05-21T11:44:47Z</updated>
<content type="xhtml">
<div xmlns="http://www.w3.org/1999/xhtml">
<pre>
Hi,&#010;&#010;I'm running nutch-2.1 (on top of cassandra) on a single-node hadoop&#010;"cluster". Sorry in case my question is noobish or more a hadoop issue.&#010;&#010;In short: How can I force nutch generate to provide a filesplitsize&gt;1&#010;which seems to be necesarry to run multiple map jobs?&#010;&#010;I am seeing that only one input split is generated for nutch generate:&#010;For &#010;./bin/nutch generate -topN 1000000 -noFilter&#010;&#010;&#010;2013-05-21 13:36:33,960 INFO org.apache.hadoop.mapred.JobInProgress:&#010;Input size for job job_201305211335_0001 = 0. Number of splits = 1&#010;2013-05-21 13:36:33,960 INFO org.apache.hadoop.mapred.JobInProgress:&#010;job_201305211335_0001 LOCALITY_WAIT_FACTOR=0.0&#010;2013-05-21 13:36:33,961 INFO org.apache.hadoop.mapred.JobInProgress: Job&#010;job_201305211335_0001 initialized successfully with 1 map tasks and 2&#010;reduce tasks.&#010;2013-05-21 13:36:34,278 INFO org.apache.hadoop.mapred.JobTracker: Adding&#010;task (JOB_SETUP) 'attempt_201305211335_0001_m_000002_0' to tip&#010;task_201305211335_0001_m_000002, for tracker&#010;'tracker_Ubuntu-1204-precise-64-minimal:localhost/127.0.0.1:34436'&#010;2013-05-21 13:36:36,728 INFO org.apache.hadoop.mapred.JobInProgress:&#010;Task 'attempt_201305211335_0001_m_000002_0' has completed&#010;task_201305211335_0001_m_000002 successfully.&#010;2013-05-21 13:36:36,735 INFO org.apache.hadoop.mapred.JobInProgress:&#010;Choosing a non-local task task_201305211335_0001_m_000000&#010;2013-05-21 13:36:36,735 INFO org.apache.hadoop.mapred.JobTracker: Adding&#010;task (MAP) 'attempt_201305211335_0001_m_000000_0' to tip&#010;task_201305211335_0001_m_000000, for tracker&#010;'tracker_Ubuntu-1204-precise-64-minimal:localhost/127.0.0.1:34436'&#010;&#010;&#010;System load did not reach its maximum by far, neither in terms of CPU&#010;nor in terms of i/o-waiting. It took 100 minutes, which seems very fair&#010;for one single map, since I have about 50M webpages in my database.&#010;Jobtracker says max map tasks is 2 and max reduce tasks is 2, but only 1&#010;map task is running.&#010;&#010;This is my mapred-site.xml, which should be ok, in particular I&#010;overwrite mapred.job.tracker not to be "local":&#010;&#010;&lt;?xml version="1.0"?&gt;&#010;&lt;?xml-stylesheet type="text/xsl" href="configuration.xsl"?&gt;&#010;&lt;!-- Put site-specific property overrides in this file. --&gt;&#010;&#010;&lt;configuration&gt;&#010;     &lt;property&gt;&#010;         &lt;name&gt;mapred.job.tracker&lt;/name&gt;&#010;         &lt;value&gt;localhost:9001&lt;/value&gt;&#010;     &lt;/property&gt;&#010;     &lt;property&gt;&#010;        &lt;name&gt;mapreduce.jobtracker.staging.root.dir&lt;/name&gt;&#010;        &lt;value&gt;/user&lt;/value&gt;&#010;     &lt;/property&gt;&#010;     &lt;property&gt; &#010;        &lt;name&gt;mapred.map.tasks&lt;/name&gt;&#010;        &lt;value&gt;2&lt;/value&gt;&#010;        &lt;description&gt;&#010;        define mapred.map tasks to be number of slave hosts&#010;        &lt;/description&gt; &#010;     &lt;/property&gt; &#010;&lt;property&gt; &#010;  &lt;name&gt;mapred.reduce.tasks&lt;/name&gt;&#010;  &lt;value&gt;2&lt;/value&gt;&#010;  &lt;description&gt;&#010;    define mapred.reduce tasks to be number of slave hosts&#010;  &lt;/description&gt; &#010;&lt;/property&gt; &#010;&#010;&lt;/configuration&gt;&#010;&#010;&#010;Thanks and best regards,&#010;Martin&#010;&#010;&#010;&#010;&#010;&#010;&#010;&#010;
</pre>
</div>
</content>
</entry>
<entry>
<title>Re: Using Nutch and Hive together</title>
<author><name>&quot;Yves S. Garret&quot; &lt;yoursurrogategod@gmail.com&gt;</name></author>
<link rel="alternate" href="http://mail-archives.apache.org/mod_mbox/nutch-user/201305.mbox/%3cCAJ=2b06mq+ER5hZ=UO1AkTXg7u5S+C3ZwoHjrR4h1bg79hGFVg@mail.gmail.com%3e"/>
<id>urn:uuid:%3cCAJ=2b06mq+ER5hZ=UO1AkTXg7u5S+C3ZwoHjrR4h1bg79hGFVg@mail-gmail-com%3e</id>
<updated>2013-05-20T21:50:11Z</updated>
<content type="xhtml">
<div xmlns="http://www.w3.org/1999/xhtml">
<pre>
Hi, sorry for not writing back earlier.  But yes, I would like to take a&#010;look,&#010;where should I begin?&#010;&#010;&#010;On Tue, Apr 30, 2013 at 6:21 PM, Renato Marroquín Mogrovejo &lt;&#010;renatoj.marroquin@gmail.com&gt; wrote:&#010;&#010;&gt; Hi Yves,&#010;&gt;&#010;&gt; Apache Gora does not support Apache Hive just yet, but we have it on&#010;&gt; our future plans. If you were willing to dive into an adventure with&#010;&gt; Gora we would be happy to help you out with that.&#010;&gt; There is a Pig-Gora adapter patch on JIRA, maybe you would like to&#010;&gt; give it a look? Although there is a bit of work involved in that one&#010;&gt; as well.&#010;&gt;&#010;&gt;&#010;&gt; Renato M.&#010;&gt;&#010;&gt; 2013/4/30 Yves S. Garret &lt;yoursurrogategod@gmail.com&gt;:&#010;&gt; &gt; Ok.  I think I understand.  So there's an adapter involved here.&#010;&gt; &gt;&#010;&gt; &gt;&#010;&gt; &gt; On Tue, Apr 30, 2013 at 5:04 PM, Tejas Patil &lt;tejas.patil.cs@gmail.com&#010;&gt; &gt;wrote:&#010;&gt; &gt;&#010;&gt; &gt;&gt; Nutch 2.x series is built on Gora which offers storage abstraction. From&#010;&gt; &gt;&gt; the Gora project main page, I think gora has adapters for accessing the&#010;&gt; &gt;&gt; data and making analysis through Apache Hive but it wont support&#010;&gt; storing of&#010;&gt; &gt;&gt; nutch data into hive.&#010;&gt; &gt;&gt;&#010;&gt; &gt;&gt; There are Gora experts on this group who can answer better.&#010;&gt; &gt;&gt;&#010;&gt; &gt;&gt;&#010;&gt; &gt;&gt; On Tue, Apr 30, 2013 at 1:46 PM, Yves S. Garret&#010;&gt; &gt;&gt; &lt;yoursurrogategod@gmail.com&gt;wrote:&#010;&gt; &gt;&gt;&#010;&gt; &gt;&gt; &gt; Hello,&#010;&gt; &gt;&gt; &gt;&#010;&gt; &gt;&gt; &gt; I'm curious.  If I wanted to store the URLs for Nutch (version 2.1) in&#010;&gt; &gt;&gt; Hive&#010;&gt; &gt;&gt; &gt; (version&#010;&gt; &gt;&gt; &gt; 0.9.0) and then store the output from Nutch in Hive, how would I do&#010;&gt; that?&#010;&gt; &gt;&gt; &gt; Any&#010;&gt; &gt;&gt; &gt; pointers?&#010;&gt; &gt;&gt; &gt;&#010;&gt; &gt;&gt; &gt; I've googled for "nutch hive" (maybe there's a better term?), but&#010;&gt; haven't&#010;&gt; &gt;&gt; &gt; found&#010;&gt; &gt;&gt; &gt; anything specific or very helpful.  I'll keep looking and&#010;&gt; experimenting.&#010;&gt; &gt;&gt; &gt; Your help is&#010;&gt; &gt;&gt; &gt; appreciated.&#010;&gt; &gt;&gt; &gt;&#010;&gt; &gt;&gt; &gt; --Yves&#010;&gt; &gt;&gt; &gt;&#010;&gt; &gt;&gt;&#010;&gt;&#010;&#010;
</pre>
</div>
</content>
</entry>
<entry>
<title>Re: error crawling</title>
<author><name>Lewis John Mcgibbney &lt;lewis.mcgibbney@gmail.com&gt;</name></author>
<link rel="alternate" href="http://mail-archives.apache.org/mod_mbox/nutch-user/201305.mbox/%3cCAGaRif0Jik7cOOSTF3PB8EULax8xk0yZEM-_4YO7-Kyz3o0PZQ@mail.gmail.com%3e"/>
<id>urn:uuid:%3cCAGaRif0Jik7cOOSTF3PB8EULax8xk0yZEM-_4YO7-Kyz3o0PZQ@mail-gmail-com%3e</id>
<updated>2013-05-20T17:55:00Z</updated>
<content type="xhtml">
<div xmlns="http://www.w3.org/1999/xhtml">
<pre>
Hi Chris,&#010;&#010;On Mon, May 20, 2013 at 10:21 AM, Christopher Gross &lt;cogross@gmail.com&gt;wrote:&#010;&#010;&gt; Lewis --&#010;&gt; Is the DEBUG something set in the conf/log4j.properties file?  I have the&#010;&gt; rootLogger set to INFO,DRFA and the threshold is ALL.  Everything else is&#010;&gt; INFO or WARN (no DEBUGs to be found.)&#010;&gt;&#010;&gt;&#010;Well yes you can set it in the log4j.properties file, however if you are&#010;working with anything older than 2.x HEAD then by default the logging is&#010;hardcoded as INFO.&#010;The DEBUG logging was implemented as of NUTCH-1496 and is now built into&#010;2.x HEAD. An example can be seen here&#010;http://svn.apache.org/viewvc/nutch/branches/2.x/src/java/org/apache/nutch/parse/ParserJob.java?r1=1379438&amp;r2=1408271&#010;&#010;BTW here is the HBase thread which I referred to before&#010;http://www.mail-archive.com/user@nutch.apache.org/msg09245.html&#010;&#010;&#010;&gt; I'm still a bit lost on what I need to do for the gora-hbase portion.  My&#010;&gt; gora-hbase-mapping.xml is unchanged.  Also, from the nutch-default.xml&#010;&gt; file:&#010;&gt; &lt;property&gt;&#010;&gt;   &lt;name&gt;storage.schema.webpage&lt;/name&gt;&#010;&gt;   &lt;value&gt;webpage&lt;/value&gt;&#010;&gt;   &lt;description&gt;This value holds the schema name used for Nutch web db.&#010;&gt;   Note that Nutch ignores the value in the gora mapping files, and uses&#010;&gt;   this as the webpage schema name.&#010;&gt;   &lt;/description&gt;&#010;&gt; &lt;/property&gt;&#010;&gt;&#010;&gt; So that would lead me to believe that the gora file is just ignored.&#010;&gt; If I have the "crawlId" set to "crawlId" -- where do I need to tell nutch&#010;&gt; to look in the hbase for the "crawlId_webpage"?&#010;&gt;&#010;&gt; I am unsure as to what your problem is here Chris. Can you please try to&#010;explain it in layman's terms for me so I understand what problem you are&#010;facing?&#010;Thanks&#010;Lewis&#010;&#010;
</pre>
</div>
</content>
</entry>
<entry>
<title>Re: error crawling</title>
<author><name>Christopher Gross &lt;cogross@gmail.com&gt;</name></author>
<link rel="alternate" href="http://mail-archives.apache.org/mod_mbox/nutch-user/201305.mbox/%3cCAGqyJJAdJmjMQhQ14re=quFeSjhSgPZfYT8XJr99uS-_Bouypw@mail.gmail.com%3e"/>
<id>urn:uuid:%3cCAGqyJJAdJmjMQhQ14re=quFeSjhSgPZfYT8XJr99uS-_Bouypw@mail-gmail-com%3e</id>
<updated>2013-05-20T17:21:55Z</updated>
<content type="xhtml">
<div xmlns="http://www.w3.org/1999/xhtml">
<pre>
Lewis --&#010;Is the DEBUG something set in the conf/log4j.properties file?  I have the&#010;rootLogger set to INFO,DRFA and the threshold is ALL.  Everything else is&#010;INFO or WARN (no DEBUGs to be found.)&#010;&#010;Is there something I should set elsewhere that would be causing this?&#010;&#010;I'm still a bit lost on what I need to do for the gora-hbase portion.  My&#010;gora-hbase-mapping.xml is unchanged.  Also, from the nutch-default.xml file:&#010;&lt;property&gt;&#010;  &lt;name&gt;storage.schema.webpage&lt;/name&gt;&#010;  &lt;value&gt;webpage&lt;/value&gt;&#010;  &lt;description&gt;This value holds the schema name used for Nutch web db.&#010;  Note that Nutch ignores the value in the gora mapping files, and uses&#010;  this as the webpage schema name.&#010;  &lt;/description&gt;&#010;&lt;/property&gt;&#010;&#010;So that would lead me to believe that the gora file is just ignored.&#010;If I have the "crawlId" set to "crawlId" -- where do I need to tell nutch&#010;to look in the hbase for the "crawlId_webpage"?&#010;&#010;&#010;-- Chris&#010;&#010;&#010;On Mon, May 20, 2013 at 11:56 AM, Lewis John Mcgibbney &lt;&#010;lewis.mcgibbney@gmail.com&gt; wrote:&#010;&#010;&gt; Please search the mailing list for the HBase logging. There was a&#010;&gt; conversation on this reasonably recently.&#010;&gt;&#010;&gt; Please see my other response for the rest.&#010;&gt; hth&#010;&gt; Lewis&#010;&gt;&#010;&gt; On Monday, May 20, 2013, Christopher Gross &lt;cogross@gmail.com&gt; wrote:&#010;&gt; &gt; Ok, so the crawlId isn't like the directories used in the 1.x versions of&#010;&gt; &gt; nutch.&#010;&gt; &gt;&#010;&gt; &gt; Well, changing that line makes that part work.  I still get the "Skipping&#010;&gt; &gt; &lt;url&gt;; different batch id (null)" error.&#010;&gt; &gt;&#010;&gt; &gt; I'm not sure if this line from the hadoop.log file relates:&#010;&gt; &gt; INFO  store.HBaseStore - Keyclass and nameclass match but mismatching&#010;&gt; table&#010;&gt; &gt; names  mappingfile schema is 'webpage' vs actual schema 'crawl_webpage' ,&#010;&gt; &gt; assuming they are the same.&#010;&gt; &gt;&#010;&gt; &gt; Any ideas for that one?&#010;&gt; &gt;&#010;&gt; &gt; -- Chris&#010;&gt; &gt;&#010;&gt; &gt;&#010;&gt; &gt; On Fri, May 17, 2013 at 4:32 PM, Tejas Patil &lt;tejas.patil.cs@gmail.com&#010;&gt; &gt;wrote:&#010;&gt; &gt;&#010;&gt; &gt;&gt; The exception speaks about the problem:&#010;&gt; &gt;&gt;&#010;&gt; &gt;&gt; java.lang.RuntimeException: java.lang.IllegalArgumentException: Illegal&#010;&gt; &gt;&gt; first&#010;&gt; &gt;&gt; character &lt;46&gt; at 0.&#010;&gt; &gt;&gt; User-space table names can only start with 'word characters': i.e.&#010;&gt; &gt;&gt; [a-zA-Z_0-9]: ./crawl/_webpage&#010;&gt; &gt;&gt;&#010;&gt; &gt;&gt; The crawlId passed must follow the regex [a-zA-Z_0-9]. The one you&#010;&gt; passed&#010;&gt; &gt;&gt; has dot and slash.&#010;&gt; &gt;&gt; $ ./bin/nutch inject urls/ -crawlId ./crawl/&#010;&gt; &gt;&gt;&#010;&gt; &gt;&gt; Try this:&#010;&gt; &gt;&gt; $ ./bin/nutch inject urls/ -crawlId crawl&#010;&gt; &gt;&gt;&#010;&gt; &gt;&gt;&#010;&gt; &gt;&gt;&#010;&gt; &gt;&gt; On Fri, May 17, 2013 at 12:47 PM, &lt;alxsss@aim.com&gt; wrote:&#010;&gt; &gt;&gt;&#010;&gt; &gt;&gt; &gt; What if you do bin/nutch inject urls/ ?&#010;&gt; &gt;&gt; &gt;&#010;&gt; &gt;&gt; &gt;&#010;&gt; &gt;&gt; &gt;&#010;&gt; &gt;&gt; &gt;&#010;&gt; &gt;&gt; &gt;&#010;&gt; &gt;&gt; &gt;&#010;&gt; &gt;&gt; &gt; -----Original Message-----&#010;&gt; &gt;&gt; &gt; From: Christopher Gross &lt;cogross@gmail.com&gt;&#010;&gt; &gt;&gt; &gt; To: user &lt;user@nutch.apache.org&gt;&#010;&gt; &gt;&gt; &gt; Sent: Fri, May 17, 2013 11:26 am&#010;&gt; &gt;&gt; &gt; Subject: error crawling&#010;&gt; &gt;&gt; &gt;&#010;&gt; &gt;&gt; &gt;&#010;&gt; &gt;&gt; &gt; I'm having trouble getting my nutch working.  I had it on another&#010;&gt; server&#010;&gt; &gt;&gt; &gt; and it was working fine.  I migrated it to a new server, and I've been&#010;&gt; &gt;&gt; &gt; getting nothing but problems.  My old script wasn't working right&#010;&gt; &gt;&gt; (getting&#010;&gt; &gt;&gt; &gt; a lot of "skipping" on the parser saying that the crawl id was null [a&#010;&gt; &gt;&gt; &gt; separate point of frustration]), so now I'm trying the 'newer' crawl&#010;&gt; &gt;&gt; &gt; script.  This one is worse, since I can't even get the inject to work.&#010;&gt; &gt;&gt; &gt;&#010;&gt; &gt;&gt; &gt; urls contains a "seed.txt" file that worked previously and contains a&#010;&gt; &gt;&gt; bunch&#010;&gt; &gt;&gt; &gt; of urls.  crawl is empty.&#010;&gt; &gt;&gt; &gt;&#010;&gt; &gt;&gt; &gt; from my $NUTCH_HOME directory:&#010;&gt; &gt;&gt; &gt;&#010;&gt; &gt;&gt; &gt; $ ./bin/nutch inject urls/ -crawlId ./crawl/&#010;&gt; &gt;&gt; &gt; InjectorJob: starting&#010;&gt; &gt;&gt; &gt; InjectorJob: urlDir: urls&#010;&gt; &gt;&gt; &gt; InjectorJob: org.apache.gora.util.GoraException:&#010;&gt; &gt;&gt; &gt; java.lang.RuntimeException: java.lang.IllegalArgumentException:&#010;&gt; Illegal&#010;&gt; &gt;&gt; &gt; first character &lt;46&gt; at 0. User-space table names can only start&#010;with&#010;&gt; &gt;&gt; 'word&#010;&gt; &gt;&gt; &gt; characters': i.e. [a-zA-Z_0-9]: ./crawl/_webpage&#010;&gt; &gt;&gt; &gt;         at&#010;&gt; &gt;&gt; &gt;&#010;&gt; &gt;&gt; &gt;&#010;&gt; &gt;&gt;&#010;&gt;&#010;&gt; org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:167)&#010;&gt; &gt;&gt; &gt;         at&#010;&gt; &gt;&gt; &gt;&#010;&gt; &gt;&gt; &gt;&#010;&gt; &gt;&gt;&#010;&gt;&#010;&gt; org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:135)&#010;&gt; &gt;&gt; &gt;         at&#010;&gt; &gt;&gt; &gt;&#010;&gt; &gt;&gt;&#010;&gt; org.apache.nutch.storage.StorageUtils.createWebStore(StorageUtils.java:75)&#010;&gt; &gt;&gt; &gt;         at&#010;&gt; org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:214)&#010;&gt; &gt;&gt; &gt;         at&#010;&gt; &gt;&gt; org.apache.nutch.crawl.InjectorJob.inject(InjectorJob.java:228)&#010;&gt; &gt;&gt; &gt;         at&#010;&gt; org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:248)&#010;&gt; &gt;&gt; &gt;         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)&#010;&gt; &gt;&gt; &gt;         at&#010;&gt; org.apache.nutch.crawl.InjectorJob.main(InjectorJob.java:258)&#010;&gt; &gt;&gt; &gt; Caused by: java.lang.RuntimeException:&#010;&gt; &gt;&gt; java.lang.IllegalArgumentException:&#010;&gt; &gt;&gt; &gt; Illegal first character &lt;46&gt; at 0. User-space table names can only&#010;&gt; start&#010;&gt; &gt;&gt; &gt; with 'word characters': i.e. [a-zA-Z_0-9]: ./crawl/_webpage&#010;&gt; &gt;&gt; &gt;         at&#010;&gt; &gt;&gt; &gt; org.apache.gora.hbase.store.HBaseStore.initialize(HBaseStore.java:125)&#010;&gt; &gt;&gt; &gt;         at&#010;&gt; &gt;&gt; &gt;&#010;&gt; &gt;&gt; &gt;&#010;&gt; &gt;&gt;&#010;&gt;&#010;&gt; org.apache.gora.store.DataStoreFactory.initializeDataStore(DataStoreFactory.java:102)&#010;&gt; &gt;&gt; &gt;         at&#010;&gt; &gt;&gt; &gt;&#010;&gt; &gt;&gt; &gt;&#010;&gt; &gt;&gt;&#010;&gt;&#010;&gt; org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:161)&#010;&gt; &gt;&gt; &gt;         ... 7 more&#010;&gt; &gt;&gt; &gt; Caused by: java.lang.IllegalArgumentException: Illegal first character&#010;&gt; &gt;&gt; &lt;46&gt;&#010;&gt; &gt;&gt; &gt; at 0. User-space table names can only start with 'word characters':&#010;&gt; i.e.&#010;&gt; &gt;&gt; &gt; [a-zA-Z_0-9]: ./crawl/_webpage&#010;&gt; &gt;&gt; &gt;         at&#010;&gt; &gt;&gt; &gt;&#010;&gt; &gt;&gt; &gt;&#010;&gt; &gt;&gt; org.apache.hadoop.hbase.HTableDescriptor.&#010;&gt;&#010;&gt; --&#010;&gt; *Lewis*&#010;&gt;&#010;&#010;
</pre>
</div>
</content>
</entry>
<entry>
<title>Re: error crawling</title>
<author><name>Lewis John Mcgibbney &lt;lewis.mcgibbney@gmail.com&gt;</name></author>
<link rel="alternate" href="http://mail-archives.apache.org/mod_mbox/nutch-user/201305.mbox/%3cCAGaRif2VDdDowVgpWfeHi7U4xYmk8-D=cAqHqwjjDDMuAejsyA@mail.gmail.com%3e"/>
<id>urn:uuid:%3cCAGaRif2VDdDowVgpWfeHi7U4xYmk8-D=cAqHqwjjDDMuAejsyA@mail-gmail-com%3e</id>
<updated>2013-05-20T15:56:38Z</updated>
<content type="xhtml">
<div xmlns="http://www.w3.org/1999/xhtml">
<pre>
Please search the mailing list for the HBase logging. There was a&#010;conversation on this reasonably recently.&#010;&#010;Please see my other response for the rest.&#010;hth&#010;Lewis&#010;&#010;On Monday, May 20, 2013, Christopher Gross &lt;cogross@gmail.com&gt; wrote:&#010;&gt; Ok, so the crawlId isn't like the directories used in the 1.x versions of&#010;&gt; nutch.&#010;&gt;&#010;&gt; Well, changing that line makes that part work.  I still get the "Skipping&#010;&gt; &lt;url&gt;; different batch id (null)" error.&#010;&gt;&#010;&gt; I'm not sure if this line from the hadoop.log file relates:&#010;&gt; INFO  store.HBaseStore - Keyclass and nameclass match but mismatching&#010;table&#010;&gt; names  mappingfile schema is 'webpage' vs actual schema 'crawl_webpage' ,&#010;&gt; assuming they are the same.&#010;&gt;&#010;&gt; Any ideas for that one?&#010;&gt;&#010;&gt; -- Chris&#010;&gt;&#010;&gt;&#010;&gt; On Fri, May 17, 2013 at 4:32 PM, Tejas Patil &lt;tejas.patil.cs@gmail.com&#010;&gt;wrote:&#010;&gt;&#010;&gt;&gt; The exception speaks about the problem:&#010;&gt;&gt;&#010;&gt;&gt; java.lang.RuntimeException: java.lang.IllegalArgumentException: Illegal&#010;&gt;&gt; first&#010;&gt;&gt; character &lt;46&gt; at 0.&#010;&gt;&gt; User-space table names can only start with 'word characters': i.e.&#010;&gt;&gt; [a-zA-Z_0-9]: ./crawl/_webpage&#010;&gt;&gt;&#010;&gt;&gt; The crawlId passed must follow the regex [a-zA-Z_0-9]. The one you passed&#010;&gt;&gt; has dot and slash.&#010;&gt;&gt; $ ./bin/nutch inject urls/ -crawlId ./crawl/&#010;&gt;&gt;&#010;&gt;&gt; Try this:&#010;&gt;&gt; $ ./bin/nutch inject urls/ -crawlId crawl&#010;&gt;&gt;&#010;&gt;&gt;&#010;&gt;&gt;&#010;&gt;&gt; On Fri, May 17, 2013 at 12:47 PM, &lt;alxsss@aim.com&gt; wrote:&#010;&gt;&gt;&#010;&gt;&gt; &gt; What if you do bin/nutch inject urls/ ?&#010;&gt;&gt; &gt;&#010;&gt;&gt; &gt;&#010;&gt;&gt; &gt;&#010;&gt;&gt; &gt;&#010;&gt;&gt; &gt;&#010;&gt;&gt; &gt;&#010;&gt;&gt; &gt; -----Original Message-----&#010;&gt;&gt; &gt; From: Christopher Gross &lt;cogross@gmail.com&gt;&#010;&gt;&gt; &gt; To: user &lt;user@nutch.apache.org&gt;&#010;&gt;&gt; &gt; Sent: Fri, May 17, 2013 11:26 am&#010;&gt;&gt; &gt; Subject: error crawling&#010;&gt;&gt; &gt;&#010;&gt;&gt; &gt;&#010;&gt;&gt; &gt; I'm having trouble getting my nutch working.  I had it on another&#010;server&#010;&gt;&gt; &gt; and it was working fine.  I migrated it to a new server, and I've been&#010;&gt;&gt; &gt; getting nothing but problems.  My old script wasn't working right&#010;&gt;&gt; (getting&#010;&gt;&gt; &gt; a lot of "skipping" on the parser saying that the crawl id was null [a&#010;&gt;&gt; &gt; separate point of frustration]), so now I'm trying the 'newer' crawl&#010;&gt;&gt; &gt; script.  This one is worse, since I can't even get the inject to work.&#010;&gt;&gt; &gt;&#010;&gt;&gt; &gt; urls contains a "seed.txt" file that worked previously and contains a&#010;&gt;&gt; bunch&#010;&gt;&gt; &gt; of urls.  crawl is empty.&#010;&gt;&gt; &gt;&#010;&gt;&gt; &gt; from my $NUTCH_HOME directory:&#010;&gt;&gt; &gt;&#010;&gt;&gt; &gt; $ ./bin/nutch inject urls/ -crawlId ./crawl/&#010;&gt;&gt; &gt; InjectorJob: starting&#010;&gt;&gt; &gt; InjectorJob: urlDir: urls&#010;&gt;&gt; &gt; InjectorJob: org.apache.gora.util.GoraException:&#010;&gt;&gt; &gt; java.lang.RuntimeException: java.lang.IllegalArgumentException: Illegal&#010;&gt;&gt; &gt; first character &lt;46&gt; at 0. User-space table names can only start with&#010;&gt;&gt; 'word&#010;&gt;&gt; &gt; characters': i.e. [a-zA-Z_0-9]: ./crawl/_webpage&#010;&gt;&gt; &gt;         at&#010;&gt;&gt; &gt;&#010;&gt;&gt; &gt;&#010;&gt;&gt;&#010;org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:167)&#010;&gt;&gt; &gt;         at&#010;&gt;&gt; &gt;&#010;&gt;&gt; &gt;&#010;&gt;&gt;&#010;org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:135)&#010;&gt;&gt; &gt;         at&#010;&gt;&gt; &gt;&#010;&gt;&gt;&#010;org.apache.nutch.storage.StorageUtils.createWebStore(StorageUtils.java:75)&#010;&gt;&gt; &gt;         at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:214)&#010;&gt;&gt; &gt;         at&#010;&gt;&gt; org.apache.nutch.crawl.InjectorJob.inject(InjectorJob.java:228)&#010;&gt;&gt; &gt;         at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:248)&#010;&gt;&gt; &gt;         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)&#010;&gt;&gt; &gt;         at&#010;org.apache.nutch.crawl.InjectorJob.main(InjectorJob.java:258)&#010;&gt;&gt; &gt; Caused by: java.lang.RuntimeException:&#010;&gt;&gt; java.lang.IllegalArgumentException:&#010;&gt;&gt; &gt; Illegal first character &lt;46&gt; at 0. User-space table names can only&#010;start&#010;&gt;&gt; &gt; with 'word characters': i.e. [a-zA-Z_0-9]: ./crawl/_webpage&#010;&gt;&gt; &gt;         at&#010;&gt;&gt; &gt; org.apache.gora.hbase.store.HBaseStore.initialize(HBaseStore.java:125)&#010;&gt;&gt; &gt;         at&#010;&gt;&gt; &gt;&#010;&gt;&gt; &gt;&#010;&gt;&gt;&#010;org.apache.gora.store.DataStoreFactory.initializeDataStore(DataStoreFactory.java:102)&#010;&gt;&gt; &gt;         at&#010;&gt;&gt; &gt;&#010;&gt;&gt; &gt;&#010;&gt;&gt;&#010;org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:161)&#010;&gt;&gt; &gt;         ... 7 more&#010;&gt;&gt; &gt; Caused by: java.lang.IllegalArgumentException: Illegal first character&#010;&gt;&gt; &lt;46&gt;&#010;&gt;&gt; &gt; at 0. User-space table names can only start with 'word characters':&#010;i.e.&#010;&gt;&gt; &gt; [a-zA-Z_0-9]: ./crawl/_webpage&#010;&gt;&gt; &gt;         at&#010;&gt;&gt; &gt;&#010;&gt;&gt; &gt;&#010;&gt;&gt; org.apache.hadoop.hbase.HTableDescriptor.&#010;&#010;-- &#010;*Lewis*&#010;&#010;
</pre>
</div>
</content>
</entry>
<entry>
<title>Re: nutch crawl</title>
<author><name>Lewis John Mcgibbney &lt;lewis.mcgibbney@gmail.com&gt;</name></author>
<link rel="alternate" href="http://mail-archives.apache.org/mod_mbox/nutch-user/201305.mbox/%3cCAGaRif3iSYiRV90H_j7VkSPMm67hEnOjo-j4bprNk=V6hFuuXw@mail.gmail.com%3e"/>
<id>urn:uuid:%3cCAGaRif3iSYiRV90H_j7VkSPMm67hEnOjo-j4bprNk=V6hFuuXw@mail-gmail-com%3e</id>
<updated>2013-05-20T15:55:21Z</updated>
<content type="xhtml">
<div xmlns="http://www.w3.org/1999/xhtml">
<pre>
Hi Chris,&#010;&#010;Please see the documentation I put up on the wiki for this phenomenon&#010;&#010;http://wiki.apache.org/nutch/ErrorMessagesInNutch2#Nutch_logging_shows_Skipping_http:.2F.2FmyurlForParsing.com.3B_different_batch_id_.28null.29&#010;&#010;Also, please search the mailing list for a recent discussion on the topic.&#010;&#010;Finally, I logged and issue in Jira to improve logging for this scenario. I&#010;don't agree with the logging of mark's as oppose to the identification of&#010;batchId's to which those mark's should belong.&#010;If we know the batchId(s) then we can at least attempt to generate (I use&#010;the term generate not to specifically relate to the GeneratorJob, however&#010;this is one of the tools that Generates Mark's) Mark's for the specific&#010;WebPage.&#010;&#010;Right now there is a bit of work to be done here as this has come up&#010;several times and is still not quite fixed.&#010;&#010;It just occured to me that as of now ALL of this logging has been silenced&#010;to DEBUG level... I am not sure that this is useful enough for obtaining&#010;metrics upon how many URLs are skipped due to various Mark's being absent.&#010;&#010;https://issues.apache.org/jira/browse/NUTCH-1567&#010;&#010;&#010;On Monday, May 20, 2013, Christopher Gross &lt;cogross@gmail.com&gt; wrote:&#010;&gt; I'm attempting to get a crawl working using scripts, but I've been getting&#010;&gt; a "Skipping &lt;url&gt;; different batch id (null)" error and then nothing new&#010;in&#010;&gt; Solr.  So I've reverted back to trying out the "crawl" for the nutch&#010;script:&#010;&gt;&#010;&gt; ./nutch crawl ../urls/ -solr "http://localhost/nutchsolr" -threads 5&#010;-depth&#010;&gt; 3 -topN 100&#010;&gt;&#010;&gt; urls has the "seed.txt" file with some sites.  It definitely is able to&#010;get&#010;&gt; pages (finding other hostnames in the lists scrolling through the screen),&#010;&gt; but then it is still skipping with the "batch id (null)" message for&#010;&gt; everything it finds.&#010;&gt;&#010;&gt; Any guidance/advice would be appreciated.&#010;&gt;&#010;&gt; Thanks!&#010;&gt;&#010;&gt; -- Chris&#010;&gt;&#010;&#010;-- &#010;*Lewis*&#010;&#010;
</pre>
</div>
</content>
</entry>
<entry>
<title>Re: nutch crawl</title>
<author><name>feng lu &lt;amuseme.lu@gmail.com&gt;</name></author>
<link rel="alternate" href="http://mail-archives.apache.org/mod_mbox/nutch-user/201305.mbox/%3cCAOeWMMrQuZHSuCMo=bjDDyAgQF2+Xdu7MiigK-+OF7S7qEOwGA@mail.gmail.com%3e"/>
<id>urn:uuid:%3cCAOeWMMrQuZHSuCMo=bjDDyAgQF2+Xdu7MiigK-+OF7S7qEOwGA@mail-gmail-com%3e</id>
<updated>2013-05-20T14:28:07Z</updated>
<content type="xhtml">
<div xmlns="http://www.w3.org/1999/xhtml">
<pre>
Hi Christopher&#010;&#010;It will check the update db mark when indexing. but now the update db mark&#010;is null. so it skip the url. maybe this url is not parsed success, you can&#010;check the log to see what happens.&#010;&#010;&#010;On Mon, May 20, 2013 at 9:44 PM, Christopher Gross &lt;cogross@gmail.com&gt;wrote:&#010;&#010;&gt; I'm attempting to get a crawl working using scripts, but I've been getting&#010;&gt; a "Skipping &lt;url&gt;; different batch id (null)" error and then nothing new in&#010;&gt; Solr.  So I've reverted back to trying out the "crawl" for the nutch&#010;&gt; script:&#010;&gt;&#010;&gt; ./nutch crawl ../urls/ -solr "http://localhost/nutchsolr" -threads 5&#010;&gt; -depth&#010;&gt; 3 -topN 100&#010;&gt;&#010;&gt; urls has the "seed.txt" file with some sites.  It definitely is able to get&#010;&gt; pages (finding other hostnames in the lists scrolling through the screen),&#010;&gt; but then it is still skipping with the "batch id (null)" message for&#010;&gt; everything it finds.&#010;&gt;&#010;&gt; Any guidance/advice would be appreciated.&#010;&gt;&#010;&gt; Thanks!&#010;&gt;&#010;&gt; -- Chris&#010;&gt;&#010;&#010;&#010;&#010;-- &#010;Don't Grow Old, Grow Up... :-)&#010;&#010;
</pre>
</div>
</content>
</entry>
<entry>
<title>nutch crawl</title>
<author><name>Christopher Gross &lt;cogross@gmail.com&gt;</name></author>
<link rel="alternate" href="http://mail-archives.apache.org/mod_mbox/nutch-user/201305.mbox/%3cCAGqyJJD5eOjF0aqf+HTtE5cqCRVktR2VdJyGrbDxTKxjMGVFYg@mail.gmail.com%3e"/>
<id>urn:uuid:%3cCAGqyJJD5eOjF0aqf+HTtE5cqCRVktR2VdJyGrbDxTKxjMGVFYg@mail-gmail-com%3e</id>
<updated>2013-05-20T13:44:06Z</updated>
<content type="xhtml">
<div xmlns="http://www.w3.org/1999/xhtml">
<pre>
I'm attempting to get a crawl working using scripts, but I've been getting&#010;a "Skipping &lt;url&gt;; different batch id (null)" error and then nothing new in&#010;Solr.  So I've reverted back to trying out the "crawl" for the nutch script:&#010;&#010;./nutch crawl ../urls/ -solr "http://localhost/nutchsolr" -threads 5 -depth&#010;3 -topN 100&#010;&#010;urls has the "seed.txt" file with some sites.  It definitely is able to get&#010;pages (finding other hostnames in the lists scrolling through the screen),&#010;but then it is still skipping with the "batch id (null)" message for&#010;everything it finds.&#010;&#010;Any guidance/advice would be appreciated.&#010;&#010;Thanks!&#010;&#010;-- Chris&#010;&#010;
</pre>
</div>
</content>
</entry>
<entry>
<title>Re: error crawling</title>
<author><name>Christopher Gross &lt;cogross@gmail.com&gt;</name></author>
<link rel="alternate" href="http://mail-archives.apache.org/mod_mbox/nutch-user/201305.mbox/%3cCAGqyJJB7njUetJeLMi220gXhMe+ctRGTmyZoF6LE4407PqvRJg@mail.gmail.com%3e"/>
<id>urn:uuid:%3cCAGqyJJB7njUetJeLMi220gXhMe+ctRGTmyZoF6LE4407PqvRJg@mail-gmail-com%3e</id>
<updated>2013-05-20T13:05:39Z</updated>
<content type="xhtml">
<div xmlns="http://www.w3.org/1999/xhtml">
<pre>
Ok, so the crawlId isn't like the directories used in the 1.x versions of&#010;nutch.&#010;&#010;Well, changing that line makes that part work.  I still get the "Skipping&#010;&lt;url&gt;; different batch id (null)" error.&#010;&#010;I'm not sure if this line from the hadoop.log file relates:&#010;INFO  store.HBaseStore - Keyclass and nameclass match but mismatching table&#010;names  mappingfile schema is 'webpage' vs actual schema 'crawl_webpage' ,&#010;assuming they are the same.&#010;&#010;Any ideas for that one?&#010;&#010;-- Chris&#010;&#010;&#010;On Fri, May 17, 2013 at 4:32 PM, Tejas Patil &lt;tejas.patil.cs@gmail.com&gt;wrote:&#010;&#010;&gt; The exception speaks about the problem:&#010;&gt;&#010;&gt; java.lang.RuntimeException: java.lang.IllegalArgumentException: Illegal&#010;&gt; first&#010;&gt; character &lt;46&gt; at 0.&#010;&gt; User-space table names can only start with 'word characters': i.e.&#010;&gt; [a-zA-Z_0-9]: ./crawl/_webpage&#010;&gt;&#010;&gt; The crawlId passed must follow the regex [a-zA-Z_0-9]. The one you passed&#010;&gt; has dot and slash.&#010;&gt; $ ./bin/nutch inject urls/ -crawlId ./crawl/&#010;&gt;&#010;&gt; Try this:&#010;&gt; $ ./bin/nutch inject urls/ -crawlId crawl&#010;&gt;&#010;&gt;&#010;&gt;&#010;&gt; On Fri, May 17, 2013 at 12:47 PM, &lt;alxsss@aim.com&gt; wrote:&#010;&gt;&#010;&gt; &gt; What if you do bin/nutch inject urls/ ?&#010;&gt; &gt;&#010;&gt; &gt;&#010;&gt; &gt;&#010;&gt; &gt;&#010;&gt; &gt;&#010;&gt; &gt;&#010;&gt; &gt; -----Original Message-----&#010;&gt; &gt; From: Christopher Gross &lt;cogross@gmail.com&gt;&#010;&gt; &gt; To: user &lt;user@nutch.apache.org&gt;&#010;&gt; &gt; Sent: Fri, May 17, 2013 11:26 am&#010;&gt; &gt; Subject: error crawling&#010;&gt; &gt;&#010;&gt; &gt;&#010;&gt; &gt; I'm having trouble getting my nutch working.  I had it on another server&#010;&gt; &gt; and it was working fine.  I migrated it to a new server, and I've been&#010;&gt; &gt; getting nothing but problems.  My old script wasn't working right&#010;&gt; (getting&#010;&gt; &gt; a lot of "skipping" on the parser saying that the crawl id was null [a&#010;&gt; &gt; separate point of frustration]), so now I'm trying the 'newer' crawl&#010;&gt; &gt; script.  This one is worse, since I can't even get the inject to work.&#010;&gt; &gt;&#010;&gt; &gt; urls contains a "seed.txt" file that worked previously and contains a&#010;&gt; bunch&#010;&gt; &gt; of urls.  crawl is empty.&#010;&gt; &gt;&#010;&gt; &gt; from my $NUTCH_HOME directory:&#010;&gt; &gt;&#010;&gt; &gt; $ ./bin/nutch inject urls/ -crawlId ./crawl/&#010;&gt; &gt; InjectorJob: starting&#010;&gt; &gt; InjectorJob: urlDir: urls&#010;&gt; &gt; InjectorJob: org.apache.gora.util.GoraException:&#010;&gt; &gt; java.lang.RuntimeException: java.lang.IllegalArgumentException: Illegal&#010;&gt; &gt; first character &lt;46&gt; at 0. User-space table names can only start with&#010;&gt; 'word&#010;&gt; &gt; characters': i.e. [a-zA-Z_0-9]: ./crawl/_webpage&#010;&gt; &gt;         at&#010;&gt; &gt;&#010;&gt; &gt;&#010;&gt; org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:167)&#010;&gt; &gt;         at&#010;&gt; &gt;&#010;&gt; &gt;&#010;&gt; org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:135)&#010;&gt; &gt;         at&#010;&gt; &gt;&#010;&gt; org.apache.nutch.storage.StorageUtils.createWebStore(StorageUtils.java:75)&#010;&gt; &gt;         at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:214)&#010;&gt; &gt;         at&#010;&gt; org.apache.nutch.crawl.InjectorJob.inject(InjectorJob.java:228)&#010;&gt; &gt;         at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:248)&#010;&gt; &gt;         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)&#010;&gt; &gt;         at org.apache.nutch.crawl.InjectorJob.main(InjectorJob.java:258)&#010;&gt; &gt; Caused by: java.lang.RuntimeException:&#010;&gt; java.lang.IllegalArgumentException:&#010;&gt; &gt; Illegal first character &lt;46&gt; at 0. User-space table names can only start&#010;&gt; &gt; with 'word characters': i.e. [a-zA-Z_0-9]: ./crawl/_webpage&#010;&gt; &gt;         at&#010;&gt; &gt; org.apache.gora.hbase.store.HBaseStore.initialize(HBaseStore.java:125)&#010;&gt; &gt;         at&#010;&gt; &gt;&#010;&gt; &gt;&#010;&gt; org.apache.gora.store.DataStoreFactory.initializeDataStore(DataStoreFactory.java:102)&#010;&gt; &gt;         at&#010;&gt; &gt;&#010;&gt; &gt;&#010;&gt; org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:161)&#010;&gt; &gt;         ... 7 more&#010;&gt; &gt; Caused by: java.lang.IllegalArgumentException: Illegal first character&#010;&gt; &lt;46&gt;&#010;&gt; &gt; at 0. User-space table names can only start with 'word characters': i.e.&#010;&gt; &gt; [a-zA-Z_0-9]: ./crawl/_webpage&#010;&gt; &gt;         at&#010;&gt; &gt;&#010;&gt; &gt;&#010;&gt; org.apache.hadoop.hbase.HTableDescriptor.isLegalTableName(HTableDescriptor.java:280)&#010;&gt; &gt;         at&#010;&gt; &gt;&#010;&gt; org.apache.hadoop.hbase.HTableDescriptor.&lt;init&gt;(HTableDescriptor.java:172)&#010;&gt; &gt;         at&#010;&gt; &gt;&#010;&gt; org.apache.hadoop.hbase.HTableDescriptor.&lt;init&gt;(HTableDescriptor.java:158)&#010;&gt; &gt;         at&#010;&gt; &gt;&#010;&gt; &gt;&#010;&gt; org.apache.gora.hbase.store.HBaseMapping$HBaseMappingBuilder.build(HBaseMapping.java:171)&#010;&gt; &gt;         at&#010;&gt; &gt; org.apache.gora.hbase.store.HBaseStore.readMapping(HBaseStore.java:592)&#010;&gt; &gt;         at&#010;&gt; &gt; org.apache.gora.hbase.store.HBaseStore.initialize(HBaseStore.java:111)&#010;&gt; &gt;         ... 9 more&#010;&gt; &gt;&#010;&gt; &gt; Where is the "_webpage" coming from?  Am I just missing something?&#010;&gt; &gt;&#010;&gt; &gt; Any help/ideas/references would be appreciated.&#010;&gt; &gt;&#010;&gt; &gt; Thanks!&#010;&gt; &gt;&#010;&gt; &gt; -- Chris&#010;&gt; &gt;&#010;&gt; &gt;&#010;&gt; &gt;&#010;&gt;&#010;&#010;
</pre>
</div>
</content>
</entry>
<entry>
<title>[REQUEST] (NUTCH-1569) Upgrade 2.x to Gora 0.3</title>
<author><name>Lewis John Mcgibbney &lt;lewis.mcgibbney@gmail.com&gt;</name></author>
<link rel="alternate" href="http://mail-archives.apache.org/mod_mbox/nutch-user/201305.mbox/%3cCAGaRif0AhdZHuoQr2v3nB0HSHH7hBRUxQ+CqFgMEGdnMrtJMXQ@mail.gmail.com%3e"/>
<id>urn:uuid:%3cCAGaRif0AhdZHuoQr2v3nB0HSHH7hBRUxQ+CqFgMEGdnMrtJMXQ@mail-gmail-com%3e</id>
<updated>2013-05-19T21:39:16Z</updated>
<content type="xhtml">
<div xmlns="http://www.w3.org/1999/xhtml">
<pre>
Hi All,&#010;I submitted a patch to upgrade the Nutch 2.x Branch codebase to the newly&#010;released Gora 0.3.&#010;The patch can be found here [0].&#010;It would be excellent if folks could please test this patch and provide&#010;feedback to the dev@ list.&#010;The feedback will be very helpful in allowing us to progress towards a&#010;Nutch 2.2 Release.&#010;Thank you very much.&#010;Lewis&#010;&#010;[0] https://issues.apache.org/jira/browse/NUTCH-1569&#010;&#010;-- &#010;*Lewis*&#010;&#010;
</pre>
</div>
</content>
</entry>
<entry>
<title>Re: Status of Elasticsearch indexer?</title>
<author><name>Lewis John Mcgibbney &lt;lewis.mcgibbney@gmail.com&gt;</name></author>
<link rel="alternate" href="http://mail-archives.apache.org/mod_mbox/nutch-user/201305.mbox/%3cCAGaRif2yLeSnig4YExLj4PVfMqhP0TqevHCgdkRAzjwAOgErzg@mail.gmail.com%3e"/>
<id>urn:uuid:%3cCAGaRif2yLeSnig4YExLj4PVfMqhP0TqevHCgdkRAzjwAOgErzg@mail-gmail-com%3e</id>
<updated>2013-05-19T02:09:59Z</updated>
<content type="xhtml">
<div xmlns="http://www.w3.org/1999/xhtml">
<pre>
Hi Chris,&#010;Thanks for getting on the list and discussing these aspects of development&#010;:0)&#010;&gt;From my perspective there are a number of observations&#010;&#010;BRANCH 2.x&#010;* NUTCH-1568 [0] is ripe for development. My sole justification for not&#010;addressing this is that we wish to push Nutch 2.2 and it is safe to say&#010;that there will not be enough testing to push toe code and mark it as&#010;stable!&#010;* NUTCH-1486 [1] is ready for testing (I know this is not elastic search&#010;but I thought I'd throw it in there)&#010;* Ferdy committed NUTCH-1445 [2] which enables you to index 2.x data to&#010;Elastic Search but it is not pluggable so to speak. This will most likely&#010;happen once we shift 2,x architecture pluggable in 2.3 development.&#010;&#010;TRUNK&#010;&#010;* Since Julien committed NUTCH-1047 trunk is pluggable to the tune of Solr&#010;3.X, however Sebatian submitted a patch for a CSV indexer [4] and it would&#010;nto be very hard to get the MongoDB patch ported to pluggable architecture&#010;either I wouldn't imagine.&#010;* Porting of Elastic Search from 2.x to pluggable trunk will most likely&#010;happen in 1.8 development drive.&#010;&#010;I think that wraps it up from me. Most likely there is something I've&#010;missed out though!&#010;It would be really great if you were able to chip in on any of the above...&#010;we are always in need of porting stuff... and actually most critically&#010;reviewing the mountain of patches we have in Jira :0)&#010;hth&#010;Lewis&#010;&#010;[0] https://issues.apache.org/jira/browse/NUTCH-1568&#010;[1] https://issues.apache.org/jira/browse/NUTCH-1486&#010;[2] https://issues.apache.org/jira/browse/NUTCH-1445&#010;[3] https://issues.apache.org/jira/browse/NUTCH-1047&#010;[4] https://issues.apache.org/jira/browse/NUTCH-1541&#010;&#010;&#010;&#010;On Fri, May 17, 2013 at 12:55 PM, Chris Hairfield &lt;&#010;chairfield@latitudegeo.com&gt; wrote:&#010;&#010;&gt; Hello everyone,&#010;&gt;&#010;&gt; I've been eagerly awaiting some of the functionality slated for 2.x,&#010;&gt; especially around your work integrating with Elasticsearch. If possible,&#010;&gt; could you give any additional status on pluggable indexing (NUTCH-1568) and&#010;&gt; the nutch-elasticsearch-indexer (NUTCH-1527)?&#010;&gt;&#010;&gt; It's been a wonderful experience diving into Nutch for the last month and&#010;&gt; watching you guys do pretty awesome work. Now that I can finally say I no&#010;&gt; longer feel completely overwhelmed, I'd like to throw in my support for&#010;&gt; these items. Further, if there is work that still needs to be done, I might&#010;&gt; like to try helping out myself :)&#010;&gt;&#010;&gt; Thanks!&#010;&gt; Chris&#010;&gt;&#010;&#010;&#010;&#010;-- &#010;*Lewis*&#010;&#010;
</pre>
</div>
</content>
</entry>
<entry>
<title>Re: Getting error while running nutch in eclips in window environment</title>
<author><name>Lewis John Mcgibbney &lt;lewis.mcgibbney@gmail.com&gt;</name></author>
<link rel="alternate" href="http://mail-archives.apache.org/mod_mbox/nutch-user/201305.mbox/%3cCAGaRif3LxN2u=i4RR8PzkT_MK1d1M4krwBo9sq-ws2-dAKBAAg@mail.gmail.com%3e"/>
<id>urn:uuid:%3cCAGaRif3LxN2u=i4RR8PzkT_MK1d1M4krwBo9sq-ws2-dAKBAAg@mail-gmail-com%3e</id>
<updated>2013-05-18T19:20:38Z</updated>
<content type="xhtml">
<div xmlns="http://www.w3.org/1999/xhtml">
<pre>
You need to follow the tutorial here&#010;http://wiki.apache.org/nutch/RunNutchInEclipse&#010;If after reading this thoroughly you have some problems please let us know&#010;about them.&#010;Thank you&#010;Lewis&#010;&#010;&#010;On Thu, May 16, 2013 at 12:07 PM, harsh yadav &lt;harsh.mca9@gmail.com&gt; wrote:&#010;&#010;&gt; 2013-05-17 00:33:09,376 WARN  crawl.Crawl (Crawl.java:run(97)) - solrUrl is&#010;&gt; not set, indexing will be skipped...&#010;&gt; 2013-05-17 00:33:09,522 INFO  crawl.Crawl (Crawl.java:run(108)) - crawl&#010;&gt; started in: crawl&#010;&gt; 2013-05-17 00:33:09,523 INFO  crawl.Crawl (Crawl.java:run(109)) -&#010;&gt; rootUrlDir = urls&#010;&gt; 2013-05-17 00:33:09,523 INFO  crawl.Crawl (Crawl.java:run(110)) - threads =&#010;&gt; 10&#010;&gt; 2013-05-17 00:33:09,523 INFO  crawl.Crawl (Crawl.java:run(111)) - depth = 3&#010;&gt; 2013-05-17 00:33:09,523 INFO  crawl.Crawl (Crawl.java:run(112)) -&#010;&gt; solrUrl=null&#010;&gt; 2013-05-17 00:33:09,524 INFO  crawl.Crawl (Crawl.java:run(114)) - topN = 50&#010;&gt; 2013-05-17 00:33:09,534 INFO  crawl.Injector (Injector.java:inject(257)) -&#010;&gt; Injector: starting at 2013-05-17 00:33:09&#010;&gt; 2013-05-17 00:33:09,535 INFO  crawl.Injector (Injector.java:inject(258)) -&#010;&gt; Injector: crawlDb: crawl/crawldb&#010;&gt; 2013-05-17 00:33:09,535 INFO  crawl.Injector (Injector.java:inject(259)) -&#010;&gt; Injector: urlDir: urls&#010;&gt; 2013-05-17 00:33:09,583 INFO  crawl.Injector (Injector.java:inject(269)) -&#010;&gt; Injector: Converting injected urls to crawl db entries.&#010;&gt; 2013-05-17 00:33:09,610 INFO  jvm.JvmMetrics (JvmMetrics.java:init(71)) -&#010;&gt; Initializing JVM Metrics with processName=JobTracker, sessionId=&#010;&gt; 2013-05-17 00:33:09,663 WARN  mapred.JobClient&#010;&gt; (JobClient.java:configureCommandLineOptions(661)) - No job jar file set.&#010;&gt;  User classes may not be found. See JobConf(Class) or&#010;&gt; JobConf#setJar(String).&#010;&gt; 2013-05-17 00:33:09,678 INFO  mapred.FileInputFormat&#010;&gt; (FileInputFormat.java:listStatus(192)) - Total input paths to process : 1&#010;&gt; 2013-05-17 00:33:10,158 INFO  mapred.JobClient&#010;&gt; (JobClient.java:monitorAndPrintJob(1275)) - Running job: job_local_0001&#010;&gt; 2013-05-17 00:33:10,161 INFO  mapred.FileInputFormat&#010;&gt; (FileInputFormat.java:listStatus(192)) - Total input paths to process : 1&#010;&gt; 2013-05-17 00:33:10,212 INFO  mapred.MapTask&#010;&gt; (MapTask.java:runOldMapper(347)) - numReduceTasks: 1&#010;&gt; 2013-05-17 00:33:10,217 INFO  mapred.MapTask (MapTask.java:&lt;init&gt;(776)) -&#010;&gt; io.sort.mb = 100&#010;&gt; 2013-05-17 00:33:10,241 INFO  mapred.MapTask (MapTask.java:&lt;init&gt;(788)) -&#010;&gt; data buffer = 79691776/99614720&#010;&gt; 2013-05-17 00:33:10,241 INFO  mapred.MapTask (MapTask.java:&lt;init&gt;(789)) -&#010;&gt; record buffer = 262144/327680&#010;&gt; 2013-05-17 00:33:10,251 WARN  plugin.PluginRepository&#010;&gt; (PluginManifestParser.java:getPluginFolder(123)) - Plugins: directory not&#010;&gt; found: plugins&#010;&gt; 2013-05-17 00:33:10,252 INFO  plugin.PluginRepository&#010;&gt; (PluginRepository.java:displayStatus(313)) - Plugin Auto-activation mode:&#010;&gt; [true]&#010;&gt; 2013-05-17 00:33:10,252 INFO  plugin.PluginRepository&#010;&gt; (PluginRepository.java:displayStatus(314)) - Registered Plugins:&#010;&gt; 2013-05-17 00:33:10,252 INFO  plugin.PluginRepository&#010;&gt; (PluginRepository.java:displayStatus(317)) - NONE&#010;&gt; 2013-05-17 00:33:10,252 INFO  plugin.PluginRepository&#010;&gt; (PluginRepository.java:displayStatus(324)) - Registered Extension-Points:&#010;&gt; 2013-05-17 00:33:10,253 INFO  plugin.PluginRepository&#010;&gt; (PluginRepository.java:displayStatus(326)) - NONE&#010;&gt; 2013-05-17 00:33:10,255 WARN  mapred.LocalJobRunner&#010;&gt; (LocalJobRunner.java:run(256)) - job_local_0001&#010;&gt; java.lang.RuntimeException: Error in configuring object&#010;&gt; at&#010;&gt; org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93)&#010;&gt; at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64)&#010;&gt; at&#010;&gt;&#010;&gt; org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)&#010;&gt; at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:354)&#010;&gt; at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)&#010;&gt; at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)&#010;&gt; Caused by: java.lang.reflect.InvocationTargetException&#010;&gt; at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)&#010;&gt; at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)&#010;&gt; at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)&#010;&gt; at java.lang.reflect.Method.invoke(Unknown Source)&#010;&gt; at&#010;&gt; org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88)&#010;&gt; ... 5 more&#010;&gt; Caused by: java.lang.RuntimeException: Error in configuring object&#010;&gt; at&#010;&gt; org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93)&#010;&gt; at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64)&#010;&gt; at&#010;&gt;&#010;&gt; org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)&#010;&gt; at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:34)&#010;&gt; ... 10 more&#010;&gt; Caused by: java.lang.reflect.InvocationTargetException&#010;&gt; at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)&#010;&gt; at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)&#010;&gt; at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)&#010;&gt; at java.lang.reflect.Method.invoke(Unknown Source)&#010;&gt; at&#010;&gt; org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88)&#010;&gt; ... 13 more&#010;&gt; Caused by: java.lang.RuntimeException: x point&#010;&gt; org.apache.nutch.net.URLNormalizer not found.&#010;&gt; at org.apache.nutch.net.URLNormalizers.&lt;init&gt;(URLNormalizers.java:123)&#010;&gt; at org.apache.nutch.crawl.Injector$InjectMapper.configure(Injector.java:74)&#010;&gt; ... 18 more&#010;&gt; 2013-05-17 00:33:11,160 INFO  mapred.JobClient&#010;&gt; (JobClient.java:monitorAndPrintJob(1288)) -  map 0% reduce 0%&#010;&gt; 2013-05-17 00:33:11,163 INFO  mapred.JobClient&#010;&gt; (JobClient.java:monitorAndPrintJob(1343)) - Job complete: job_local_0001&#010;&gt; 2013-05-17 00:33:11,164 INFO  mapred.JobClient (Counters.java:log(514)) -&#010;&gt; Counters: 0&#010;&gt; Exception in thread "main" java.io.IOException: Job failed!&#010;&gt; at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252)&#010;&gt; at org.apache.nutch.crawl.Injector.inject(Injector.java:281)&#010;&gt; at org.apache.nutch.crawl.Crawl.run(Crawl.java:132)&#010;&gt; at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)&#010;&gt; at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)&#010;&gt;&#010;&gt;&#010;&gt;&#010;&gt; On Fri, May 17, 2013 at 12:37 AM, harsh yadav &lt;harsh.mca9@gmail.com&gt;&#010;&gt; wrote:&#010;&gt;&#010;&gt; &gt; Hello,&#010;&gt; &gt;&#010;&gt; &gt; I am running nutch 1.6 with hadoop 0.20.2 but not able to crawl in eclips&#010;&gt; &gt; every time getting error:-&#010;&gt; &gt;&#010;&gt; &gt;&#010;&gt;&#010;&#010;&#010;&#010;-- &#010;*Lewis*&#010;&#010;
</pre>
</div>
</content>
</entry>
<entry>
<title>[Nutch-newbie] Installation error</title>
<author><name>&quot;Shah, Nishant&quot; &lt;nishans@amazon.com&gt;</name></author>
<link rel="alternate" href="http://mail-archives.apache.org/mod_mbox/nutch-user/201305.mbox/%3c21578B1B471840408D48D3C3201B6481B42145@ex10-mbx-9003.ant.amazon.com%3e"/>
<id>urn:uuid:%3c21578B1B471840408D48D3C3201B6481B42145@ex10-mbx-9003-ant-amazon-com%3e</id>
<updated>2013-05-18T00:36:21Z</updated>
<content type="xhtml">
<div xmlns="http://www.w3.org/1999/xhtml">
<pre>
Hi everyone,&#010;&#010;This is my first post so apologies if this is not the correct question to ask.&#010;&#010;I have followed the wiki tutorial and I am getting the below error. I am running in the local&#010;mode and don't have hadoop installed. Can you please help as I have no clue what's going wrong.&#010;&#010;Thanks.&#010;Nishant&#010;&#010;The Error:&#010;log4j:ERROR setFile(null,true) call failed.&#010;java.io.FileNotFoundException: /home/local/ANT/nishans/nutch-1.6/apache-nutch-1.6/logs/hadoop.log&#010;(No such file or directory)&#010;at java.io.FileOutputStream.openAppend(Native Method)&#010;at java.io.FileOutputStream.&lt;init&gt;(FileOutputStream.java:207)&#010;at java.io.FileOutputStream.&lt;init&gt;(FileOutputStream.java:131)&#010;at org.apache.log4j.FileAppender.setFile(FileAppender.java:290)&#010;at org.apache.log4j.FileAppender.activateOptions(FileAppender.java:164)&#010;at org.apache.log4j.DailyRollingFileAppender.activateOptions(DailyRollingFileAppender.java:216)&#010;at org.apache.log4j.config.PropertySetter.activate(PropertySetter.java:257)&#010;at org.apache.log4j.config.PropertySetter.setProperties(PropertySetter.java:133)&#010;at org.apache.log4j.config.PropertySetter.setProperties(PropertySetter.java:97)&#010;at org.apache.log4j.PropertyConfigurator.parseAppender(PropertyConfigurator.java:689)&#010;at org.apache.log4j.PropertyConfigurator.parseCategory(PropertyConfigurator.java:647)&#010;at org.apache.log4j.PropertyConfigurator.configureRootCategory(PropertyConfigurator.java:544)&#010;at org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:440)&#010;at org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:476)&#010;at org.apache.log4j.helpers.OptionConverter.selectAndConfigure(OptionConverter.java:471)&#010;at org.apache.log4j.LogManager.&lt;clinit&gt;(LogManager.java:125)&#010;at org.slf4j.impl.Log4jLoggerFactory.getLogger(Log4jLoggerFactory.java:73)&#010;at org.slf4j.LoggerFactory.getLogger(LoggerFactory.java:242)&#010;at org.slf4j.LoggerFactory.getLogger(LoggerFactory.java:254)&#010;at org.apache.nutch.crawl.Injector.&lt;clinit&gt;(Injector.java:53)&#010;log4j:ERROR Either File or DatePattern options are not set for appender [DRFA].&#010;Injector: starting at 2013-05-17 17:22:22&#010;Injector: crawlDb: Te stCrawl/crawldb&#010;Injector: urlDir: urls/seed.txt&#010;Injector: Converting injected urls to crawl db entries.&#010;Injector: java.io.IOException: Job failed!&#010;at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1265)&#010;at org.apache.nutch.crawl.Injector.inject(Injector.java:281)&#010;at org.apache.nutch.crawl.Injector.run(Injector.java:318)&#010;at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)&#010;at org.apache.nutch.crawl.Injector.main(Injector.java:308)&#010;&#010;nishans@ua41f725d6547517ff08c:~/nutch-1.6/apache-nutch-1.6$&lt;mailto:nishans@ua41f725d6547517ff08c:~/nutch-1.6/apache-nutch-1.6$&gt;&#010;clear&#010;&#010;nishans@ua41f725d6547517ff08c:~/nutch-1.6/apache-nutch-1.6$&lt;mailto:nishans@ua41f725d6547517ff08c:~/nutch-1.6/apache-nutch-1.6$&gt;&#010;bin/crawl urls/seed.txt TestCrawl http://localhost:8983/solr/ 2&#010;log4j:ERROR setFile(null,true) call failed.&#010;java.io.FileNotFoundException: /home/local/ANT/nishans/nutch-1.6/apache-nutch-1.6/logs/hadoop.log&#010;(No such file or directory)&#010;at java.io.FileOutputStream.openAppend(Native Method)&#010;at java.io.FileOutputStream.&lt;init&gt;(File OutputStream.java:207)&#010;at java.io.FileOutputStream.&lt;init&gt;(FileOutputStream.java:131)&#010;at org.apache.log4j.FileAppender.setFile(FileAppender.java:290)&#010;at org.apache.log4j.FileAppender.activateOptions(FileAppender.java:164)&#010;at org.apache.log4j.DailyRollingFileAppender.activateOptions(DailyRollingFileAppender.java:216)&#010;at org.apache.log4j.config.PropertySetter.activate(PropertySetter.java:257)&#010;at org.apache.log4j.config.PropertySetter.setProperties(PropertySetter.java:133)&#010;at org.apache.log4j.config.PropertySetter.setProperties(PropertySetter.java:97)&#010;at org.apache.log4j.PropertyConfigurator.parseAppender(PropertyConfigurator.java:689)&#010;at org.apache.log4j.PropertyConfigurator.parseCategory(PropertyConfigurator.java:647)&#010;at org.apache.log4j.PropertyConfigurator.configureRootCategory(PropertyConfigurator.java:544)&#010;at org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:440)&#010;at org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:476)&#010;at org.apache.log4j.helpers.OptionConverter.selectAndConfigure(OptionConverter.java:471)&#010;at org.apache.log4j.LogManager.&lt;clinit&gt;(LogManager.java:125)&#010;at org.slf4j.impl.Log4jLoggerFactory.getLogger(Log4jLoggerFactory.java:73)&#010;at org.slf4j.LoggerFactory.getLogger(LoggerFactory.java:242)&#010;at org.slf4j.LoggerFactory.getLogger(LoggerFactory.java:254)&#010;at org.apache.nutch.crawl.Injector.&lt;clinit&gt;(Injector.java:53)&#010;log4j:ERROR Either File or DatePattern options are not set for appender [DRFA].&#010;Injector: starting at 2013-05-17 17:30:07&#010;Injector: crawlDb: TestCrawl/crawldb&#010;Injector: urlDir: urls/seed.txt&#010;Injector: Converting injected urls to crawl db entries.&#010;Injector: java.io.IOException: Job failed!&#010;at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1265)&#010;at org.apache.nutch.crawl.Injector.inject(Injecto r.java:281)&#010;at org.apache.nutch.crawl.Injector.run(Injector.java:318)&#010;at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)&#010;at org.apache.nutch.crawl.Injector.main(Injector.java:308)&#010;&#010;&#010;&#010;&#010;
</pre>
</div>
</content>
</entry>
<entry>
<title>Re: Example crawl script Nutch 2.1</title>
<author><name>Tejas Patil &lt;tejas.patil.cs@gmail.com&gt;</name></author>
<link rel="alternate" href="http://mail-archives.apache.org/mod_mbox/nutch-user/201305.mbox/%3cCAFKhtFzk9wXmxNXG_+4MBSBHXfw7PACs-15=WQfqBf0UxRL6nQ@mail.gmail.com%3e"/>
<id>urn:uuid:%3cCAFKhtFzk9wXmxNXG_+4MBSBHXfw7PACs-15=WQfqBf0UxRL6nQ@mail-gmail-com%3e</id>
<updated>2013-05-17T20:40:55Z</updated>
<content type="xhtml">
<div xmlns="http://www.w3.org/1999/xhtml">
<pre>
Hi Bai Shen,&#010;&#010;Thanks for your comments. Can you kindly add those to the relevant jira [0]&#010;so that it gets tracked ?&#010;&#010;[0] https://issues.apache.org/jira/browse/NUTCH-1545&#010;&#010;Thanks,&#010;Tejas&#010;&#010;&#010;On Fri, May 17, 2013 at 5:36 AM, Bai Shen &lt;baishen.lists@gmail.com&gt; wrote:&#010;&#010;&gt; I just tested the GeneratorJob portion and it works fine.  I have two&#010;&gt; comments, though.&#010;&gt;&#010;&gt; 1.  I added braces around the -batchId arg if statement.  I don't like if's&#010;&gt; without them.&#010;&gt; 2.  BatchIds never get cleared.  So if you use the same batchId for&#010;&gt; multiple crawl cycles your urls per batch will continue to grow.  There&#010;&gt; should probably be some sort of note in the help printout.&#010;&gt;&#010;&gt;&#010;&gt;&#010;&gt;&#010;&gt; On Tue, Apr 30, 2013 at 10:37 AM, Lewis John Mcgibbney &lt;&#010;&gt; lewis.mcgibbney@gmail.com&gt; wrote:&#010;&gt;&#010;&gt; &gt; Hi James,&#010;&gt; &gt; Please look for NUTCH-1545 capture batchid...&#010;&gt; &gt; If you could review and use this patch it would be very very helpful.&#010;&gt; &gt; thank you&#010;&gt; &gt; lewis&#010;&gt; &gt;&#010;&gt; &gt; On Tuesday, April 30, 2013, James Ford &lt;simon.forsb@gmail.com&gt; wrote:&#010;&gt; &gt; &gt; Thanks for your answer!&#010;&gt; &gt; &gt;&#010;&gt; &gt; &gt; I think I will create my own modified crawlscript then. But I am pretty&#010;&gt; &gt; &gt; confused of how to get a generated batchId? Should I just parse the id&#010;&gt; &gt; from&#010;&gt; &gt; &gt; the output:&#010;&gt; &gt; &gt;&#010;&gt; &gt; &gt; GeneratorJob: generated batch id: 1367327604-149897259&#010;&gt; &gt; &gt;&#010;&gt; &gt; &gt; Or should I get the newly generated batchId from the datastore in my&#010;&gt; &gt; script?&#010;&gt; &gt; &gt; Any best practices?&#010;&gt; &gt; &gt;&#010;&gt; &gt; &gt; Thanks&#010;&gt; &gt; &gt;&#010;&gt; &gt; &gt;&#010;&gt; &gt; &gt;&#010;&gt; &gt; &gt; --&#010;&gt; &gt; &gt; View this message in context:&#010;&gt; &gt;&#010;&gt; &gt;&#010;&gt; http://lucene.472066.n3.nabble.com/Example-crawl-script-Nutch-2-1-tp4059960p4059985.html&#010;&gt; &gt; &gt; Sent from the Nutch - User mailing list archive at Nabble.com.&#010;&gt; &gt; &gt;&#010;&gt; &gt;&#010;&gt; &gt; --&#010;&gt; &gt; *Lewis*&#010;&gt; &gt;&#010;&gt;&#010;&#010;
</pre>
</div>
</content>
</entry>
<entry>
<title>Re: error crawling</title>
<author><name>Tejas Patil &lt;tejas.patil.cs@gmail.com&gt;</name></author>
<link rel="alternate" href="http://mail-archives.apache.org/mod_mbox/nutch-user/201305.mbox/%3cCAFKhtFz301FWxctN8+-wzOFSzobk3xLgHt_BJD63-NruxdLL3w@mail.gmail.com%3e"/>
<id>urn:uuid:%3cCAFKhtFz301FWxctN8+-wzOFSzobk3xLgHt_BJD63-NruxdLL3w@mail-gmail-com%3e</id>
<updated>2013-05-17T20:32:24Z</updated>
<content type="xhtml">
<div xmlns="http://www.w3.org/1999/xhtml">
<pre>
The exception speaks about the problem:&#010;&#010;java.lang.RuntimeException: java.lang.IllegalArgumentException: Illegal first&#010;character &lt;46&gt; at 0.&#010;User-space table names can only start with 'word characters': i.e.&#010;[a-zA-Z_0-9]: ./crawl/_webpage&#010;&#010;The crawlId passed must follow the regex [a-zA-Z_0-9]. The one you passed&#010;has dot and slash.&#010;$ ./bin/nutch inject urls/ -crawlId ./crawl/&#010;&#010;Try this:&#010;$ ./bin/nutch inject urls/ -crawlId crawl&#010;&#010;&#010;&#010;On Fri, May 17, 2013 at 12:47 PM, &lt;alxsss@aim.com&gt; wrote:&#010;&#010;&gt; What if you do bin/nutch inject urls/ ?&#010;&gt;&#010;&gt;&#010;&gt;&#010;&gt;&#010;&gt;&#010;&gt;&#010;&gt; -----Original Message-----&#010;&gt; From: Christopher Gross &lt;cogross@gmail.com&gt;&#010;&gt; To: user &lt;user@nutch.apache.org&gt;&#010;&gt; Sent: Fri, May 17, 2013 11:26 am&#010;&gt; Subject: error crawling&#010;&gt;&#010;&gt;&#010;&gt; I'm having trouble getting my nutch working.  I had it on another server&#010;&gt; and it was working fine.  I migrated it to a new server, and I've been&#010;&gt; getting nothing but problems.  My old script wasn't working right (getting&#010;&gt; a lot of "skipping" on the parser saying that the crawl id was null [a&#010;&gt; separate point of frustration]), so now I'm trying the 'newer' crawl&#010;&gt; script.  This one is worse, since I can't even get the inject to work.&#010;&gt;&#010;&gt; urls contains a "seed.txt" file that worked previously and contains a bunch&#010;&gt; of urls.  crawl is empty.&#010;&gt;&#010;&gt; from my $NUTCH_HOME directory:&#010;&gt;&#010;&gt; $ ./bin/nutch inject urls/ -crawlId ./crawl/&#010;&gt; InjectorJob: starting&#010;&gt; InjectorJob: urlDir: urls&#010;&gt; InjectorJob: org.apache.gora.util.GoraException:&#010;&gt; java.lang.RuntimeException: java.lang.IllegalArgumentException: Illegal&#010;&gt; first character &lt;46&gt; at 0. User-space table names can only start with 'word&#010;&gt; characters': i.e. [a-zA-Z_0-9]: ./crawl/_webpage&#010;&gt;         at&#010;&gt;&#010;&gt; org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:167)&#010;&gt;         at&#010;&gt;&#010;&gt; org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:135)&#010;&gt;         at&#010;&gt; org.apache.nutch.storage.StorageUtils.createWebStore(StorageUtils.java:75)&#010;&gt;         at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:214)&#010;&gt;         at org.apache.nutch.crawl.InjectorJob.inject(InjectorJob.java:228)&#010;&gt;         at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:248)&#010;&gt;         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)&#010;&gt;         at org.apache.nutch.crawl.InjectorJob.main(InjectorJob.java:258)&#010;&gt; Caused by: java.lang.RuntimeException: java.lang.IllegalArgumentException:&#010;&gt; Illegal first character &lt;46&gt; at 0. User-space table names can only start&#010;&gt; with 'word characters': i.e. [a-zA-Z_0-9]: ./crawl/_webpage&#010;&gt;         at&#010;&gt; org.apache.gora.hbase.store.HBaseStore.initialize(HBaseStore.java:125)&#010;&gt;         at&#010;&gt;&#010;&gt; org.apache.gora.store.DataStoreFactory.initializeDataStore(DataStoreFactory.java:102)&#010;&gt;         at&#010;&gt;&#010;&gt; org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:161)&#010;&gt;         ... 7 more&#010;&gt; Caused by: java.lang.IllegalArgumentException: Illegal first character &lt;46&gt;&#010;&gt; at 0. User-space table names can only start with 'word characters': i.e.&#010;&gt; [a-zA-Z_0-9]: ./crawl/_webpage&#010;&gt;         at&#010;&gt;&#010;&gt; org.apache.hadoop.hbase.HTableDescriptor.isLegalTableName(HTableDescriptor.java:280)&#010;&gt;         at&#010;&gt; org.apache.hadoop.hbase.HTableDescriptor.&lt;init&gt;(HTableDescriptor.java:172)&#010;&gt;         at&#010;&gt; org.apache.hadoop.hbase.HTableDescriptor.&lt;init&gt;(HTableDescriptor.java:158)&#010;&gt;         at&#010;&gt;&#010;&gt; org.apache.gora.hbase.store.HBaseMapping$HBaseMappingBuilder.build(HBaseMapping.java:171)&#010;&gt;         at&#010;&gt; org.apache.gora.hbase.store.HBaseStore.readMapping(HBaseStore.java:592)&#010;&gt;         at&#010;&gt; org.apache.gora.hbase.store.HBaseStore.initialize(HBaseStore.java:111)&#010;&gt;         ... 9 more&#010;&gt;&#010;&gt; Where is the "_webpage" coming from?  Am I just missing something?&#010;&gt;&#010;&gt; Any help/ideas/references would be appreciated.&#010;&gt;&#010;&gt; Thanks!&#010;&gt;&#010;&gt; -- Chris&#010;&gt;&#010;&gt;&#010;&gt;&#010;&#010;
</pre>
</div>
</content>
</entry>
<entry>
<title>Re: crawl stopping randomly before the specified depth</title>
<author><name>Tejas Patil &lt;tejas.patil.cs@gmail.com&gt;</name></author>
<link rel="alternate" href="http://mail-archives.apache.org/mod_mbox/nutch-user/201305.mbox/%3cCAFKhtFxrE3C_kAjT4MtM455V1vgNQAVWXd2XoFsuAG0M_QCu7A@mail.gmail.com%3e"/>
<id>urn:uuid:%3cCAFKhtFxrE3C_kAjT4MtM455V1vgNQAVWXd2XoFsuAG0M_QCu7A@mail-gmail-com%3e</id>
<updated>2013-05-17T20:24:01Z</updated>
<content type="xhtml">
<div xmlns="http://www.w3.org/1999/xhtml">
<pre>
On Fri, May 17, 2013 at 1:39 AM, Sourajit Basak &lt;sourajit.basac@gmail.com&gt;wrote:&#010;&#010;&gt; In order to control the number of part files generated, we made a minor&#010;&gt; change to handle 'numFetchers' argument in the one-step crawl command.&#010;&gt; Relevant portions of the code are given below.&#010;&gt;&#010;&gt; Although I set the depth to 3, the crawl stops at the first or second&#010;&gt; pass. Its random in nature.&#010;&gt;&#010;&gt; From the logs, I find&#010;&gt; *Generator: 0 records selected for fetching, exiting ...*&#010;&gt;&#010;&gt; I executed crawl with your code change on the latest codebase (trunk) and&#010;didn't see this happening.&#010;&#010;&#010;&gt;     ....................&#010;&gt;     int depth = 5;&#010;&gt;     int numFetchers = -1;&#010;&gt;     long topN = Long.MAX_VALUE;&#010;&gt;     .............&#010;&gt;&#010;&gt;     for (int i = 0; i &lt; args.length; i++) {&#010;&gt;       if ("-dir".equals(args[i])) {&#010;&gt;         dir = new Path(args[i+1]);&#010;&gt;         i++;&#010;&gt;       } else if ("-threads".equals(args[i])) {&#010;&gt;         threads = Integer.parseInt(args[i+1]);&#010;&gt;         i++;&#010;&gt;       } else if ("-depth".equals(args[i])) {&#010;&gt;         depth = Integer.parseInt(args[i+1]);&#010;&gt;         i++;&#010;&gt;       } else if ("-topN".equals(args[i])) {&#010;&gt;           topN = Integer.parseInt(args[i+1]);&#010;&gt;           i++;&#010;&gt;       } else if ("-solr".equals(args[i])) {&#010;&gt;         solrUrl = args[i + 1];&#010;&gt;         i++;&#010;&gt;       }* else if ("-numFetchers".equals(args[i])) {*&#010;&gt; *       numFetchers = Integer.parseInt(args[i+1]);*&#010;&gt; *       i++;*&#010;&gt; *      }* else if (args[i] != null) {&#010;&gt;         rootUrlDir = new Path(args[i]);&#010;&gt;       }&#010;&gt;     }&#010;&gt;&#010;&gt;     JobConf job = new NutchJob(getConf());&#010;&gt;&#010;&gt;    ..........................&#010;&gt;    ...................&#010;&gt;    .............&#010;&gt;&#010;&gt;     // initialize crawlDb&#010;&gt;     injector.inject(crawlDb, rootUrlDir);&#010;&gt;     int i;&#010;&gt;     for (i = 0; i &lt; depth; i++) {             // generate new segment&#010;&gt; *      Path[] segs = generator.generate(crawlDb, segments, numFetchers,&#010;&gt; topN, System*&#010;&gt; *          .currentTimeMillis());*&#010;&gt;       if (segs == null) {&#010;&gt;         LOG.info("Stopping at depth=" + i + " - no more URLs to fetch.");&#010;&gt;         break;&#010;&gt;       }&#010;&gt;       fetcher.fetch(segs[0], threads);  // fetch it&#010;&gt;       if (!Fetcher.isParsing(job)) {&#010;&gt;         parseSegment.parse(segs[0]);    // parse it, if needed&#010;&gt;       }&#010;&gt;       crawlDbTool.update(crawlDb, segs, true, true); // update crawldb&#010;&gt;     }&#010;&gt;&#010;&gt; The code change is correct.&#010;&#010;&#010;&gt; From Generator source, I think that the 'status' is empty, because the&#010;&gt; flow doesn't reach partitionSegments(....&#010;&gt;&#010;&gt;     FileStatus[] status = fs.listStatus(tempDir);&#010;&gt;     try {&#010;&gt;       for (FileStatus stat : status) {&#010;&gt;         Path subfetchlist = stat.getPath();&#010;&gt;         if (!subfetchlist.getName().startsWith("fetchlist-")) continue;&#010;&gt;         // start a new partition job for this segment&#010;&gt;         Path newSeg = partitionSegment(fs, segments, subfetchlist,&#010;&gt; numLists);&#010;&gt;         generatedSegments.add(newSeg);&#010;&gt;       }&#010;&gt;&#010;&gt; There is a catch block below this which has a different message.&#010;&#010;    } catch (Exception e) {&#010;      LOG.warn("Generator: exception while partitioning segments, exiting&#010;...");&#010;      fs.delete(tempDir, true);&#010;      return null;&#010;    }&#010;&#010;Had status been null, it would have not said "0 records ...." and given the&#010;message above. So, in my opinion, 'status' was not null.&#010;In case its empty, that can be verified by putting a log message inside the&#010;for loop. When I did that, I could see that some valid path was logged.&#010;&#010;What can be the cause for this behavior ?&#010;&gt;&#010;&gt; Can you set topN to a bigger value ? (I believe that you have already&#010;verified that the urls are not getting filtered due to any filters). As you&#010;said that it was kinda sporadic, I would retry it when I get free time.&#010;&#010;Thanks,&#010;&gt; Sourajit&#010;&gt;&#010;&#010;
</pre>
</div>
</content>
</entry>
<entry>
<title>Status of Elasticsearch indexer?</title>
<author><name>Chris Hairfield &lt;chairfield@latitudegeo.com&gt;</name></author>
<link rel="alternate" href="http://mail-archives.apache.org/mod_mbox/nutch-user/201305.mbox/%3c432B340F85E62946BED2317E8E64505E3C19FECE@Neon.latitudegeo.com%3e"/>
<id>urn:uuid:%3c432B340F85E62946BED2317E8E64505E3C19FECE@Neon-latitudegeo-com%3e</id>
<updated>2013-05-17T19:55:31Z</updated>
<content type="xhtml">
<div xmlns="http://www.w3.org/1999/xhtml">
<pre>
Hello everyone,&#010;&#010;I've been eagerly awaiting some of the functionality slated for 2.x, especially around your&#010;work integrating with Elasticsearch. If possible, could you give any additional status on&#010;pluggable indexing (NUTCH-1568) and the nutch-elasticsearch-indexer (NUTCH-1527)?&#010;&#010;It's been a wonderful experience diving into Nutch for the last month and watching you guys&#010;do pretty awesome work. Now that I can finally say I no longer feel completely overwhelmed,&#010;I'd like to throw in my support for these items. Further, if there is work that still needs&#010;to be done, I might like to try helping out myself :)&#010;&#010;Thanks!&#010;Chris&#010;&#010;
</pre>
</div>
</content>
</entry>
</feed>
