crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Josh Wills (JIRA)" <>
Subject [jira] [Commented] (CRUNCH-209) Jobs with large numbers of directory inputs will fail with odd inputsplit exceptions
Date Fri, 22 Nov 2013 15:50:36 GMT


Josh Wills commented on CRUNCH-209:

Hmm-- this was awhile ago, and my initial fix was a hypothesis that just happened to work.
My hypothesis was that we were bumping up against the limits of the key size for an entry
in the job.xml file, and so the fix was to shrink the size of our entries by serializing a
lot less data. My guess would be that you are running into a similar issue with Cascading
(and that Crunch would hit it again too for a job that had enough input directories.)

I don't know a ton about how Cascading works, but if this situation was happening to you with
Crunch, my recommendation would be to split up your input directories into different sources
(pipes?) and then union those sources together. Crunch breaks up the directories for different
sources into different keys in the job.xml file, so that would be a slightly hacky way of
getting around the key size limits.

> Jobs with large numbers of directory inputs will fail with odd inputsplit exceptions
> ------------------------------------------------------------------------------------
>                 Key: CRUNCH-209
>                 URL:
>             Project: Crunch
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.5.0, 0.6.0
>            Reporter: Josh Wills
>            Assignee: Josh Wills
>             Fix For: 0.7.0
>         Attachments: CRUNCH-209.patch
> From John Jensen on the user mailing list:
> I have a curious problem when running a crunch job on (avro) files in a fairly large
set of directories (just slightly less than 100).
> After running some fraction of the mappers they start failing with the exception below.
Things work fine with a smaller number of directories.
> The magic 'zdHJpbmcifSx7Im5hbWUiOiJ2YWx1ZSIsInR5cGUiOiJzdHJpbmcifV19fSwiZGVmYXVsdCI'
string shows up in the 'crunch.inputs.dir' entry in the job config, so I assume it has something
to do with deserializing that value, but reading through the code I don't see any obvious
way how. 
> Furthermore, the crunch.inputs.dir config entry is just under 1.5M, so it would not surprise
me if I'm running up against a hadoop limit somewhere.
> Stack trace:
> Split class zdHJp
> bmcifSx7Im5hbWUiOiJ2YWx1ZSIsInR5cGUiOiJzdHJpbmcifV19fSwiZGVmYXVsdCI not found
> 	at org.apache.hadoop.mapred.MapTask.getSplitDetails(
> 	at org.apache.hadoop.mapred.MapTask.runNewMapper(
> 	at
> 	at org.apache.hadoop.mapred.Child$
> 	at Method)
> 	at
> 	at
> 	at org.apache.hadoop.mapred.Child.main(
> Caused by: java.lang.ClassNotFoundException: Class zdHJp
> bmcifSx7Im5hbWUiOiJ2YWx1ZSIsInR5cGUiOiJzdHJpbmcifV19fSwiZGVmYXVsdCI not found
> 	at org.apache.hadoop.conf.Configuration.getClassByName(
> 	at org.apache.hadoop.mapred.MapTask.getSplitDetails(
> 	... 7 more

This message was sent by Atlassian JIRA

View raw message