nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dennis Kubes <nutch-...@dragonflymc.com>
Subject Re: NPE in org.apache.hadoop.io.SequenceFile$Sorter$MergeQueue
Date Wed, 14 Feb 2007 20:03:38 GMT
It may fix the problem it may not.  There have been many changes in 
hadoop since 0.4.  I think they are now on .11.x.  So if you are 
upgrading existing dfs implementations that currently have content that 
is something to take into consideration.  That being said the changes in 
hadoop from .4 to present may very well have fixed the error you are 
seeing and to use the most recent version of hadoop you will need to use 
the NUTCH-437 patch.

Looking at your output below though my first thought would be that this 
is something in the PDF parser and not hadoop causing the error.  Nutch 
uses pdfbox software to parse PDF files so you may want to take the 
specific file and see if it parses correctly outside of nutch using pdfbox.

Dennis Kubes

Armel T. Nene wrote:
> Dennis
> 
> I was wondering if this patch could fix my problem which is, if not the
> same, very similar to this one. I am using Nutch 0.8.2-dev, I have made
> checkout awhile ago from SVN but never updated again. I was able to crawl
> 10000 xml files before with no error whatsoever. This is the following
> errors that I get when I'm fetching:
> 
> INFO parser.custom: Custom-parse: Parsing content
> file:/C:/TeamBinder/AddressBook/9100/(65)E110_ST A0 (1).pdf
> 07/02/12 22:09:16 INFO fetcher.Fetcher: fetch of
> file:/C:/TeamBinder/AddressBook/9100/(65)E110_ST A0 (1).pdf failed with:
> java.lang.NullPointerException
> 07/02/12 22:09:17 INFO mapred.LocalJobRunner: 0 pages, 0 errors, 0.0
> pages/s, 0 kb/s, 
> 07/02/12 22:09:17 FATAL fetcher.Fetcher: java.lang.NullPointerException
> 07/02/12 22:09:17 FATAL fetcher.Fetcher: at
> org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:198)
> 07/02/12 22:09:17 FATAL fetcher.Fetcher: at
> org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:189)
> 07/02/12 22:09:17 FATAL fetcher.Fetcher: at
> org.apache.hadoop.mapred.MapTask$2.collect(MapTask.java:91)
> 07/02/12 22:09:17 FATAL fetcher.Fetcher: at
> org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:314)
> 07/02/12 22:09:17 FATAL fetcher.Fetcher: at
> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:232)
> 07/02/12 22:09:17 FATAL fetcher.Fetcher: fetcher
> caught:java.lang.NullPointerException
> 
> One of the problem is that my hadoop version says the following:
> hadoop-0.4.0-patched. Now I don't know if it means that I am running the
> 0.4.0 version but it seems a little bit confusing. Once you can clarify that
> for me, then I will be able to apply the patch to my version. 
> 
> Best Regards,
> 
> Armel
> 
> -----Original Message-----
> From: Dennis Kubes [mailto:nutch-dev@dragonflymc.com] 
> Sent: 13 February 2007 21:09
> To: nutch-dev@lucene.apache.org
> Subject: Re: NPE in org.apache.hadoop.io.SequenceFile$Sorter$MergeQueue
> 
> Actually I take it back.  I don't think it is the same problem but I do 
> think it is the right solution.
> 
> Dennis Kubes
> 
> Dennis Kubes wrote:
>> This has to do with HADOOP-964.  Replace the jar files in your Nutch 
>> versions with the most recent versions from Hadoop.  You will also need 
>> to apply NUTCH-437 patch to get Nutch to work with the most recent 
>> changes to the Hadoop codebase.
>>
>> Dennis Kubes
>>
>> Gal Nitzan wrote:
>>> Hi,
>>>
>>> Does anybody uses Nutch trunk?
>>>
>>> I am running nutch 0.9 and unable to fetch.
>>>
>>> after 50-60K urls I get NPE in
>>> org.apache.hadoop.io.SequenceFile$Sorter$MergeQueue every time.
>>>
>>> I was wandering if anyone have a work around or maybe something is 
>>> wrong with
>>> my setup.
>>>
>>> I have opened a new issue in jira
>>> http://issues.apache.org/jira/browse/hadoop-1008 for this.
>>>
>>> Any clue?
>>>
>>> Gal
>>>
>>>
> 

Mime
View raw message