crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gabriel Reid <gabriel.r...@gmail.com>
Subject Re: [jira] [Updated] (CRUNCH-209) Jobs with large numbers of directory inputs will fail with odd inputsplit exceptions
Date Thu, 23 May 2013 18:39:32 GMT
Yep, definitely looks like an improvement!

What was the actual cause of John's issue in the beginning? Is there a physical 
limit (or bug) in the serialization of Configuration values?

- Gabriel

On 23 May 2013, at 20:26, Josh Wills <jwills@cloudera.com> wrote:

> Glorious. That had been on my TODO list for awhile, I'm glad we found a
> problem that forced me to fix it. ;-) Will commit to master. We should also
> probably consider a point release (0.6.1) with that fix, esp. due to the
> startup improvements.
> 
> J
> 
> 
> On Thu, May 23, 2013 at 11:00 AM, John Jensen <jensen@richrelevance.com>wrote:
> 
>> 
>> Thanks, Josh. That worked perfectly!
>> 
>> It has the added benefit of dramatically improving the startup time. I
>> assume because we're no longer copying the monstrous jobconfs around.
>> 
>> -- John
>> 
>> ________________________________________
>> From: Josh Wills (JIRA) [jira@apache.org]
>> Sent: Wednesday, May 22, 2013 5:27 PM
>> To: crunch-dev@incubator.apache.org
>> Subject: [jira] [Updated] (CRUNCH-209) Jobs with large numbers of
>> directory inputs will fail with odd inputsplit exceptions
>> 
>>     [
>> https://issues.apache.org/jira/browse/CRUNCH-209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel]
>> 
>> Josh Wills updated CRUNCH-209:
>> ------------------------------
>> 
>>    Attachment: CRUNCH-209.patch
>> 
>> A hypothetical fix for John to test out.
>> 
>>> Jobs with large numbers of directory inputs will fail with odd
>> inputsplit exceptions
>>> 
>> ------------------------------------------------------------------------------------
>>> 
>>>                Key: CRUNCH-209
>>>                URL: https://issues.apache.org/jira/browse/CRUNCH-209
>>>            Project: Crunch
>>>         Issue Type: Bug
>>>         Components: Core
>>>   Affects Versions: 0.5.0, 0.6.0
>>>           Reporter: Josh Wills
>>>           Assignee: Josh Wills
>>>        Attachments: CRUNCH-209.patch
>>> 
>>> 
>>> From John Jensen on the user mailing list:
>>> I have a curious problem when running a crunch job on (avro) files in a
>> fairly large set of directories (just slightly less than 100).
>>> After running some fraction of the mappers they start failing with the
>> exception below. Things work fine with a smaller number of directories.
>>> The magic
>> 'zdHJpbmcifSx7Im5hbWUiOiJ2YWx1ZSIsInR5cGUiOiJzdHJpbmcifV19fSwiZGVmYXVsdCI'
>> string shows up in the 'crunch.inputs.dir' entry in the job config, so I
>> assume it has something to do with deserializing that value, but reading
>> through the code I don't see any obvious way how.
>>> Furthermore, the crunch.inputs.dir config entry is just under 1.5M, so
>> it would not surprise me if I'm running up against a hadoop limit somewhere.
>>> Stack trace:
>>> java.io.IOException: Split class zdHJp
>>> bmcifSx7Im5hbWUiOiJ2YWx1ZSIsInR5cGUiOiJzdHJpbmcifV19fSwiZGVmYXVsdCI not
>> found
>>>      at
>> org.apache.hadoop.mapred.MapTask.getSplitDetails(MapTask.java:342)
>>>      at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:614)
>>>      at org.apache.hadoop.mapred.MapTask.run(MapTask.java:325)
>>>      at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
>>>      at java.security.AccessController.doPrivileged(Native Method)
>>>      at javax.security.auth.Subject.doAs(Subject.java:415)
>>>      at
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
>>>      at org.apache.hadoop.mapred.Child.main(Child.java:262)
>>> Caused by: java.lang.ClassNotFoundException: Class zdHJp
>>> bmcifSx7Im5hbWUiOiJ2YWx1ZSIsInR5cGUiOiJzdHJpbmcifV19fSwiZGVmYXVsdCI not
>> found
>>>      at
>> org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:1493)
>>>      at
>> org.apache.hadoop.mapred.MapTask.getSplitDetails(MapTask.java:340)
>>>      ... 7 more
>> 
>> --
>> This message is automatically generated by JIRA.
>> If you think it was sent incorrectly, please contact your JIRA
>> administrators
>> For more information on JIRA, see: http://www.atlassian.com/software/jira
>> 
> 
> 
> 
> -- 
> Director of Data Science
> Cloudera <http://www.cloudera.com>
> Twitter: @josh_wills <http://twitter.com/josh_wills>


Mime
View raw message