crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Harsh J <ha...@cloudera.com>
Subject Re: [jira] [Updated] (CRUNCH-209) Jobs with large numbers of directory inputs will fail with odd inputsplit exceptions
Date Thu, 23 May 2013 19:57:05 GMT
There is a limit in MR1, mapred.user.jobconf.limit per
http://hadoop.apache.org/docs/stable/mapred-default.html, that limits it to
5 MB (but this is applied at the JT level). I am not aware of any
serialization-time limits and think there are none as I've seen Hive use
the same code to write enormous sized files.

Worth noting that MR2, suitable to its service-less architecture, has no
such limits on jobconf size and the property isn't present in it anymore.


On Fri, May 24, 2013 at 12:13 AM, Josh Wills <josh.wills@gmail.com> wrote:

> On Thu, May 23, 2013 at 11:39 AM, Gabriel Reid <gabriel.reid@gmail.com
> >wrote:
>
> > Yep, definitely looks like an improvement!
> >
> > What was the actual cause of John's issue in the beginning? Is there a
> > physical
> > limit (or bug) in the serialization of Configuration values?
> >
>
> It seems like there must be, although I couldn't figure out where it was
> happening exactly, and Googling around for limits about jobconf
> serialization didn't turn up anything, either.
>
>
> >
> > - Gabriel
> >
> > On 23 May 2013, at 20:26, Josh Wills <jwills@cloudera.com> wrote:
> >
> > > Glorious. That had been on my TODO list for awhile, I'm glad we found a
> > > problem that forced me to fix it. ;-) Will commit to master. We should
> > also
> > > probably consider a point release (0.6.1) with that fix, esp. due to
> the
> > > startup improvements.
> > >
> > > J
> > >
> > >
> > > On Thu, May 23, 2013 at 11:00 AM, John Jensen <
> jensen@richrelevance.com
> > >wrote:
> > >
> > >>
> > >> Thanks, Josh. That worked perfectly!
> > >>
> > >> It has the added benefit of dramatically improving the startup time. I
> > >> assume because we're no longer copying the monstrous jobconfs around.
> > >>
> > >> -- John
> > >>
> > >> ________________________________________
> > >> From: Josh Wills (JIRA) [jira@apache.org]
> > >> Sent: Wednesday, May 22, 2013 5:27 PM
> > >> To: crunch-dev@incubator.apache.org
> > >> Subject: [jira] [Updated] (CRUNCH-209) Jobs with large numbers of
> > >> directory inputs will fail with odd inputsplit exceptions
> > >>
> > >>     [
> > >>
> >
> https://issues.apache.org/jira/browse/CRUNCH-209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
> > ]
> > >>
> > >> Josh Wills updated CRUNCH-209:
> > >> ------------------------------
> > >>
> > >>    Attachment: CRUNCH-209.patch
> > >>
> > >> A hypothetical fix for John to test out.
> > >>
> > >>> Jobs with large numbers of directory inputs will fail with odd
> > >> inputsplit exceptions
> > >>>
> > >>
> >
> ------------------------------------------------------------------------------------
> > >>>
> > >>>                Key: CRUNCH-209
> > >>>                URL: https://issues.apache.org/jira/browse/CRUNCH-209
> > >>>            Project: Crunch
> > >>>         Issue Type: Bug
> > >>>         Components: Core
> > >>>   Affects Versions: 0.5.0, 0.6.0
> > >>>           Reporter: Josh Wills
> > >>>           Assignee: Josh Wills
> > >>>        Attachments: CRUNCH-209.patch
> > >>>
> > >>>
> > >>> From John Jensen on the user mailing list:
> > >>> I have a curious problem when running a crunch job on (avro) files
> in a
> > >> fairly large set of directories (just slightly less than 100).
> > >>> After running some fraction of the mappers they start failing with
> the
> > >> exception below. Things work fine with a smaller number of
> directories.
> > >>> The magic
> > >>
> >
> 'zdHJpbmcifSx7Im5hbWUiOiJ2YWx1ZSIsInR5cGUiOiJzdHJpbmcifV19fSwiZGVmYXVsdCI'
> > >> string shows up in the 'crunch.inputs.dir' entry in the job config,
> so I
> > >> assume it has something to do with deserializing that value, but
> reading
> > >> through the code I don't see any obvious way how.
> > >>> Furthermore, the crunch.inputs.dir config entry is just under 1.5M,
> so
> > >> it would not surprise me if I'm running up against a hadoop limit
> > somewhere.
> > >>> Stack trace:
> > >>> java.io.IOException: Split class zdHJp
> > >>> bmcifSx7Im5hbWUiOiJ2YWx1ZSIsInR5cGUiOiJzdHJpbmcifV19fSwiZGVmYXVsdCI
> not
> > >> found
> > >>>      at
> > >> org.apache.hadoop.mapred.MapTask.getSplitDetails(MapTask.java:342)
> > >>>      at
> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:614)
> > >>>      at org.apache.hadoop.mapred.MapTask.run(MapTask.java:325)
> > >>>      at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
> > >>>      at java.security.AccessController.doPrivileged(Native Method)
> > >>>      at javax.security.auth.Subject.doAs(Subject.java:415)
> > >>>      at
> > >>
> >
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
> > >>>      at org.apache.hadoop.mapred.Child.main(Child.java:262)
> > >>> Caused by: java.lang.ClassNotFoundException: Class zdHJp
> > >>> bmcifSx7Im5hbWUiOiJ2YWx1ZSIsInR5cGUiOiJzdHJpbmcifV19fSwiZGVmYXVsdCI
> not
> > >> found
> > >>>      at
> > >>
> >
> org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:1493)
> > >>>      at
> > >> org.apache.hadoop.mapred.MapTask.getSplitDetails(MapTask.java:340)
> > >>>      ... 7 more
> > >>
> > >> --
> > >> This message is automatically generated by JIRA.
> > >> If you think it was sent incorrectly, please contact your JIRA
> > >> administrators
> > >> For more information on JIRA, see:
> > http://www.atlassian.com/software/jira
> > >>
> > >
> > >
> > >
> > > --
> > > Director of Data Science
> > > Cloudera <http://www.cloudera.com>
> > > Twitter: @josh_wills <http://twitter.com/josh_wills>
> >
> >
>



-- 
Harsh J

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message