nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Markus Jelsma <markus.jel...@openindex.io>
Subject Re: HTTP error 400
Date Tue, 15 May 2012 11:05:36 GMT
Please follow the step-by-step tutorial, it's explained there:
http://wiki.apache.org/nutch/NutchTutorial

On Tuesday 15 May 2012 13:40:26 Tolga wrote:
> I'm a little confused. How can I not use the crawl command and execute
> the separate crawl cycle commands at the same time?
> 
> Regards,
> 
> On 5/11/12 9:40 AM, Markus Jelsma wrote:
> > Ah, that means don't use the crawl command and do a little shell
> > scripting to execute the separte crawl cycle commands, see the nutch
> > wiki for examples. And don't do solrdedup. Search the Solr wiki for
> > deduplication.
> > 
> > cheers
> > 
> > On Fri, 11 May 2012 07:39:36 +0300, Tolga <tolga@ozses.net> wrote:
> >> Hi,
> >> 
> >> How do I exactly "omit solrdedup and use Solr's internal
> >> deduplication" instead.? I don't even know what any of that means :D
> >> I've just used bin/nutch crawl urls -solr http://localhost:8983/solr/
> >> -depth 3 -topN 100 to get the error. I have to use all the steps?
> >> 
> >> Regards,
> >> 
> >> On 05/10/2012 11:38 PM, Markus Jelsma wrote:
> >>> thanks
> >>> 
> >>> This is a known issue:
> >>> https://issues.apache.org/jira/browse/NUTCH-1100
> >>> 
> >>> I have not been able find the bug nor do i know how to reproduce it
> >>> from scratch. If you have a public site with which we can reproduce
> >>> it please comment to the Jira ticket. Make sure you use either
> >>> default config or little, a seed URL and the exact crawl & dedup
> >>> steps to reproduce.
> >>> 
> >>> If you find it we might fix it. In any case we need to replace the
> >>> dedup command with a more scalable tool which it currently is not.
> >>> 
> >>> In the mean time you can omit solrdedup and use Solr's internal
> >>> deduplication instead, it works similar and uses the same signature
> >>> algorithm as Nutch has. Please consult the Solr wiki page on
> >>> deduplication.
> >>> 
> >>> Good luck
> >>> 
> >>> On Thu, 10 May 2012 22:54:37 +0300, Tolga <tolga@ozses.net> wrote:
> >>>> Hi Markus,
> >>>> 
> >>>> On 05/10/2012 09:42 AM, Markus Jelsma wrote:
> >>>>> Hi,
> >>>>> 
> >>>>> On Thu, 10 May 2012 09:10:04 +0300, Tolga <tolga@ozses.net>
wrote:
> >>>>>> Hi,
> >>>>>> 
> >>>>>> This will sound like a duplicate, but actually it differs from
the
> >>>>>> other one. Please bear with me. Following
> >>>>>> http://wiki.apache.org/nutch/NutchTutorial, I first issued the
> >>>>>> command
> >>>>>> 
> >>>>>> bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth
3
> >>>>>> -topN 5
> >>>>>> 
> >>>>>> Then when I got the message
> >>>>>> 
> >>>>>> Exception in thread "main" java.io.IOException: Job failed!
> >>>>>> 
> >>>>>>     at
> >>>>>> 
> >>>>>> org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252)
> >>>>>> 
> >>>>>>     at
> >>>>>> 
> >>>>>> org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDu
> >>>>>> plicates.java:373)>>>>>> 
> >>>>>>     at
> >>>>>> 
> >>>>>> org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDu
> >>>>>> plicates.java:353)>>>>>> 
> >>>>>>     at org.apache.nutch.crawl.Crawl.run(Crawl.java:153)
> >>>>>>     at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> >>>>>>     at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)
> >>>>> 
> >>>>> Please include the relevant part of the log. This can be a known
> >>>>> issue.
> >>>> 
> >>>> This is an excerpt from hadoop.log:
> >>>> 
> >>>> 2012-05-10 22:26:30,349 INFO  crawl.Crawl - crawl started in:
> >>>> crawl-20120510222629
> >>>> 2012-05-10 22:26:30,350 INFO  crawl.Crawl - rootUrlDir = urls
> >>>> 2012-05-10 22:26:30,351 INFO  crawl.Crawl - threads = 10
> >>>> 2012-05-10 22:26:30,351 INFO  crawl.Crawl - depth = 3
> >>>> 2012-05-10 22:26:30,351 INFO  crawl.Crawl -
> >>>> solrUrl=http://localhost:8983/solr/
> >>>> 2012-05-10 22:26:30,351 INFO  crawl.Crawl - topN = 100
> >>>> 2012-05-10 22:26:30,750 INFO  crawl.Injector - Injector: starting at
> >>>> 2012-05-10 22:26:30
> >>>> 2012-05-10 22:26:30,750 INFO  crawl.Injector - Injector: crawlDb:
> >>>> crawl-20120510222629/crawldb
> >>>> 2012-05-10 22:26:30,750 INFO  crawl.Injector - Injector: urlDir: urls
> >>>> 2012-05-10 22:26:30,809 INFO  crawl.Injector - Injector: Converting
> >>>> injected urls to crawl db entries.
> >>>> 2012-05-10 22:26:34,173 INFO  plugin.PluginRepository - Plugins:
> >>>> looking in: /root/apache-nutch-1.4-bin/runtime/local/plugins
> >>>> 2012-05-10 22:26:34,962 INFO  plugin.PluginRepository - Plugin
> >>>> Auto-activation mode: [true]
> >>>> 2012-05-10 22:26:34,962 INFO  plugin.PluginRepository - Registered
> >>>> Plugins:
> >>>> 2012-05-10 22:26:34,962 INFO  plugin.PluginRepository -     the nutch
> >>>> core extension points (nutch-extensionpoints)
> >>>> 2012-05-10 22:26:34,962 INFO  plugin.PluginRepository -     Basic URL
> >>>> Normalizer (urlnormalizer-basic)
> >>>> 2012-05-10 22:26:34,962 INFO  plugin.PluginRepository -     Html
> >>>> Parse Plug-in (parse-html)
> >>>> 2012-05-10 22:26:34,962 INFO  plugin.PluginRepository -     Basic
> >>>> Indexing Filter (index-basic)
> >>>> 2012-05-10 22:26:34,962 INFO  plugin.PluginRepository -     HTTP
> >>>> Framework (lib-http)
> >>>> 2012-05-10 22:26:34,962 INFO  plugin.PluginRepository -
> >>>> Pass-through URL Normalizer (urlnormalizer-pass)
> >>>> 2012-05-10 22:26:34,962 INFO  plugin.PluginRepository -     Regex URL
> >>>> Filter (urlfilter-regex)
> >>>> 2012-05-10 22:26:34,962 INFO  plugin.PluginRepository -     Http
> >>>> Protocol Plug-in (protocol-http)
> >>>> 2012-05-10 22:26:34,962 INFO  plugin.PluginRepository -     Regex URL
> >>>> Normalizer (urlnormalizer-regex)
> >>>> 2012-05-10 22:26:34,962 INFO  plugin.PluginRepository -     Tika
> >>>> Parser Plug-in (parse-tika)
> >>>> 2012-05-10 22:26:34,962 INFO  plugin.PluginRepository -     OPIC
> >>>> Scoring Plug-in (scoring-opic)
> >>>> 2012-05-10 22:26:34,962 INFO  plugin.PluginRepository -     CyberNeko
> >>>> HTML Parser (lib-nekohtml)
> >>>> 2012-05-10 22:26:34,962 INFO  plugin.PluginRepository -     Anchor
> >>>> Indexing Filter (index-anchor)
> >>>> 2012-05-10 22:26:34,962 INFO  plugin.PluginRepository -     Regex URL
> >>>> Filter Framework (lib-regex-filter)
> >>>> 2012-05-10 22:26:34,962 INFO  plugin.PluginRepository - Registered
> >>>> Extension-Points:
> >>>> 2012-05-10 22:26:34,963 INFO  plugin.PluginRepository -     Nutch URL
> >>>> Normalizer (org.apache.nutch.net.URLNormalizer)
> >>>> 2012-05-10 22:26:34,963 INFO  plugin.PluginRepository -     Nutch
> >>>> Protocol (org.apache.nutch.protocol.Protocol)
> >>>> 2012-05-10 22:26:34,963 INFO  plugin.PluginRepository -     Nutch
> >>>> Segment Merge Filter (org.apache.nutch.segment.SegmentMergeFilter)
> >>>> 2012-05-10 22:26:34,963 INFO  plugin.PluginRepository -     Nutch URL
> >>>> Filter (org.apache.nutch.net.URLFilter)
> >>>> 2012-05-10 22:26:34,963 INFO  plugin.PluginRepository -     Nutch
> >>>> Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
> >>>> 2012-05-10 22:26:34,963 INFO  plugin.PluginRepository -     HTML
> >>>> Parse Filter (org.apache.nutch.parse.HtmlParseFilter)
> >>>> 2012-05-10 22:26:34,963 INFO  plugin.PluginRepository -     Nutch
> >>>> Content Parser (org.apache.nutch.parse.Parser)
> >>>> 2012-05-10 22:26:34,963 INFO  plugin.PluginRepository -     Nutch
> >>>> Scoring (org.apache.nutch.scoring.ScoringFilter)
> >>>> 2012-05-10 22:26:35,439 INFO  regex.RegexURLNormalizer - can't find
> >>>> rules for scope 'inject', using default
> >>>> 2012-05-10 22:26:36,434 INFO  crawl.Injector - Injector: Merging
> >>>> injected urls into crawl db.
> >>>> 2012-05-10 22:26:36,710 WARN  util.NativeCodeLoader - Unable to load
> >>>> native-hadoop library for your platform... using builtin-java classes
> >>>> where applicable
> >>>> 2012-05-10 22:26:37,542 INFO  crawl.Injector - Injector: finished at
> >>>> 2012-05-10 22:26:37, elapsed: 00:00:06
> >>>> 2012-05-10 22:26:37,551 INFO  crawl.Generator - Generator: starting
> >>>> at 2012-05-10 22:26:37
> >>>> 2012-05-10 22:26:37,551 INFO  crawl.Generator - Generator: Selecting
> >>>> best-scoring urls due for fetch.
> >>>> 2012-05-10 22:26:37,551 INFO  crawl.Generator - Generator:
> >>>> filtering: true
> >>>> 2012-05-10 22:26:37,551 INFO  crawl.Generator - Generator:
> >>>> normalizing: true
> >>>> 2012-05-10 22:26:37,551 INFO  crawl.Generator - Generator: topN: 100
> >>>> 2012-05-10 22:26:37,552 INFO  crawl.Generator - Generator: jobtracker
> >>>> is 'local', generating exactly one partition.
> >>>> 2012-05-10 22:26:37,820 INFO  crawl.FetchScheduleFactory - Using
> >>>> FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
> >>>> 2012-05-10 22:26:37,820 INFO  crawl.AbstractFetchSchedule -
> >>>> defaultInterval=2592000
> >>>> 2012-05-10 22:26:37,820 INFO  crawl.AbstractFetchSchedule -
> >>>> maxInterval=7776000
> >>>> 2012-05-10 22:26:37,856 INFO  regex.RegexURLNormalizer - can't find
> >>>> rules for scope 'partition', using default
> >>>> ...
> >>>> ...
> >>>> INFO: [] webapp=/solr path=/update
> >>>> 
> >>>> 
> >>>> params={waitSearcher=true&waitFlush=true&wt=javabin&commit=true&version
> >>>> =2}
> >>>> 
> >>>> 
> >>>> status=0 QTime=221
> >>>> 2012-05-10 22:36:26,336 INFO  solr.SolrIndexer - SolrIndexer:
> >>>> finished at 2012-05-10 22:36:26, elapsed: 00:00:05
> >>>> 2012-05-10 22:36:26,339 INFO  solr.SolrDeleteDuplicates -
> >>>> SolrDeleteDuplicates: starting at 2012-05-10 22:36:26
> >>>> 2012-05-10 22:36:26,339 INFO  solr.SolrDeleteDuplicates -
> >>>> SolrDeleteDuplicates: Solr url: http://localhost:8983/solr/
> >>>> May 10, 2012 10:36:27 PM org.apache.solr.core.SolrCore execute
> >>>> INFO: [] webapp=/solr path=/select
> >>>> params={fl=id&wt=javabin&q=id:[*+TO+*]&rows=1&version=2}
hits=220
> >>>> status=0 QTime=74
> >>>> May 10, 2012 10:36:27 PM org.apache.solr.core.SolrCore execute
> >>>> INFO: [] webapp=/solr path=/select
> >>>> params={fl=id&wt=javabin&q=id:[*+TO+*]&rows=1&version=2}
hits=220
> >>>> status=0 QTime=0
> >>>> May 10, 2012 10:36:27 PM org.apache.solr.core.SolrCore execute
> >>>> INFO: [] webapp=/solr path=/select
> >>>> 
> >>>> 
> >>>> params={fl=id,boost,tstamp,digest&start=0&q=id:[*+TO+*]&wt=javabin&rows
> >>>> =220&version=2}
> >>>> 
> >>>> 
> >>>> hits=220 status=0 QTime=9
> >>>> 2012-05-10 22:36:27,656 WARN  mapred.LocalJobRunner - job_local_0020
> >>>> java.lang.NullPointerException
> >>>> 
> >>>>     at org.apache.hadoop.io.Text.encode(Text.java:388)
> >>>>     at org.apache.hadoop.io.Text.set(Text.java:178)
> >>>>     at
> >>>> 
> >>>> org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$1.ne
> >>>> xt(SolrDeleteDuplicates.java:270)>>>> 
> >>>>     at
> >>>> 
> >>>> org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$1.ne
> >>>> xt(SolrDeleteDuplicates.java:241)>>>> 
> >>>>     at
> >>>> 
> >>>> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask
> >>>> .java:192)>>>> 
> >>>>     at
> >>>> 
> >>>> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:
> >>>> 176)
> >>>> 
> >>>>     at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
> >>>>     at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
> >>>>     at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
> >>>>     at
> >>>> 
> >>>> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177
> >>>> )
> >>>> 
> >>>>>> I issued the commands
> >>>>>> 
> >>>>>> bin/nutch crawl urls -dir crawl -depth 3 -topN 5
> >>>>>> 
> >>>>>> and
> >>>>>> 
> >>>>>> bin/nutch solrindex http://127.0.0.1:8983/solr/ crawldb -linkdb
> >>>>>> crawldb/linkdb crawldb/segments/*
> >>>>>> 
> >>>>>> separately, after which I got no errors. When I browsed to
> >>>>>> http://localhost:8983/solr/admin and attempted a search, I got
the
> >>>>>> error
> >>>>>> 
> >>>>>>    HTTP ERROR 400
> >>>>>> 
> >>>>>> Problem accessing /solr/select. Reason:
> >>>>>>     undefined field text
> >>>>> 
> >>>>> But this is a Solr thing, you have no field named text. Resolve
> >>>>> this in Solr or on the Solr mailing list.
> >>>>> 
> >>>>>> ---------------------------------------------------------------------
> >>>>>> ---
> >>>>>> 
> >>>>>> 
> >>>>>> /Powered by Jetty://
> >>>>>> 
> >>>>>> /What am I doing wrong?
> >>>>>> 
> >>>>>> Regards,/
> >>>>>> /
> >>>> 
> >>>> Regards,
-- 
Markus Jelsma - CTO - Openindex


Mime
View raw message