crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Marshall Bockrath-Vandegrift (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CRUNCH-209) Jobs with large numbers of directory inputs will fail with odd inputsplit exceptions
Date Fri, 22 Nov 2013 16:22:35 GMT

    [ https://issues.apache.org/jira/browse/CRUNCH-209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13830086#comment-13830086
] 

Marshall Bockrath-Vandegrift commented on CRUNCH-209:
-----------------------------------------------------

Thanks for the response.  There's certainly options for work-arounds, but I was hoping to
get to the bottom of the problem in the first place.  The structure of the error suggests
something in the guts of Hadoop is incorrectly serializing splits beyond a certain size. 
If that's the case, I'd like to find the associated MAPREDUCE bug, or file one if it doesn't
yet exist.  Oh well -- more code-spelunking it seems.

> Jobs with large numbers of directory inputs will fail with odd inputsplit exceptions
> ------------------------------------------------------------------------------------
>
>                 Key: CRUNCH-209
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-209
>             Project: Crunch
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.5.0, 0.6.0
>            Reporter: Josh Wills
>            Assignee: Josh Wills
>             Fix For: 0.7.0
>
>         Attachments: CRUNCH-209.patch
>
>
> From John Jensen on the user mailing list:
> I have a curious problem when running a crunch job on (avro) files in a fairly large
set of directories (just slightly less than 100).
> After running some fraction of the mappers they start failing with the exception below.
Things work fine with a smaller number of directories.
> The magic 'zdHJpbmcifSx7Im5hbWUiOiJ2YWx1ZSIsInR5cGUiOiJzdHJpbmcifV19fSwiZGVmYXVsdCI'
string shows up in the 'crunch.inputs.dir' entry in the job config, so I assume it has something
to do with deserializing that value, but reading through the code I don't see any obvious
way how. 
> Furthermore, the crunch.inputs.dir config entry is just under 1.5M, so it would not surprise
me if I'm running up against a hadoop limit somewhere.
> Stack trace:
> java.io.IOException: Split class zdHJp
> bmcifSx7Im5hbWUiOiJ2YWx1ZSIsInR5cGUiOiJzdHJpbmcifV19fSwiZGVmYXVsdCI not found
> 	at org.apache.hadoop.mapred.MapTask.getSplitDetails(MapTask.java:342)
> 	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:614)
> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:325)
> 	at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at javax.security.auth.Subject.doAs(Subject.java:415)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
> 	at org.apache.hadoop.mapred.Child.main(Child.java:262)
> Caused by: java.lang.ClassNotFoundException: Class zdHJp
> bmcifSx7Im5hbWUiOiJ2YWx1ZSIsInR5cGUiOiJzdHJpbmcifV19fSwiZGVmYXVsdCI not found
> 	at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:1493)
> 	at org.apache.hadoop.mapred.MapTask.getSplitDetails(MapTask.java:340)
> 	... 7 more



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Mime
View raw message