Mailing-List: contact dev-help@crunch.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@crunch.apache.org
Received-SPF: pass (nike.apache.org: domain of harsh@cloudera.com designates
 209.85.223.170 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CANb5z2J8Y1gkRGYe9hZZv6K8GaCcbP5tqn5bVt=DPJWF=4b60g@mail.gmail.com>
References: <JIRA.12649005.1369268758436@arcas>
 <JIRA.12649005.1369268758436.8007.1369268841039@arcas>
 <77F0498D3B83AB409C64E0CEE52178410445CDF5@mbx025-e1-nj-6.exch025.domain.local>
 <CAH29n6Os-eXRMeyYC4u8aOmmnC8FSo-Xqi7x56W+OxG1S_U5yg@mail.gmail.com>
 <3955725B-C06F-4288-90B0-F7815F2A8DE1@gmail.com>
 <CANb5z2J8Y1gkRGYe9hZZv6K8GaCcbP5tqn5bVt=DPJWF=4b60g@mail.gmail.com>
From: Harsh J <harsh@cloudera.com>
Date: Fri, 24 May 2013 01:27:05 +0530
Message-ID: 
 <CAOcnVr0XLa3OHV+tcNHo_DMMyr-f6rgesxYp3K3pj7wWNwgu5A@mail.gmail.com>
Subject: Re: [jira] [Updated] (CRUNCH-209) Jobs with large numbers of
 directory inputs will fail with odd inputsplit exceptions
To: dev@crunch.apache.org
Content-Type: multipart/alternative; boundary=20cf301d425a6be96e04dd681731

--20cf301d425a6be96e04dd681731
Content-Type: text/plain; charset=ISO-8859-1

There is a limit in MR1, mapred.user.jobconf.limit per
http://hadoop.apache.org/docs/stable/mapred-default.html, that limits it to
5 MB (but this is applied at the JT level). I am not aware of any
serialization-time limits and think there are none as I've seen Hive use
the same code to write enormous sized files.

Worth noting that MR2, suitable to its service-less architecture, has no
such limits on jobconf size and the property isn't present in it anymore.


On Fri, May 24, 2013 at 12:13 AM, Josh Wills <josh.wills@gmail.com> wrote:

> On Thu, May 23, 2013 at 11:39 AM, Gabriel Reid <gabriel.reid@gmail.com
> >wrote:
>
> > Yep, definitely looks like an improvement!
> >
> > What was the actual cause of John's issue in the beginning? Is there a
> > physical
> > limit (or bug) in the serialization of Configuration values?
> >
>
> It seems like there must be, although I couldn't figure out where it was
> happening exactly, and Googling around for limits about jobconf
> serialization didn't turn up anything, either.
>
>
> >
> > - Gabriel
> >
> > On 23 May 2013, at 20:26, Josh Wills <jwills@cloudera.com> wrote:
> >
> > > Glorious. That had been on my TODO list for awhile, I'm glad we found a
> > > problem that forced me to fix it. ;-) Will commit to master. We should
> > also
> > > probably consider a point release (0.6.1) with that fix, esp. due to
> the
> > > startup improvements.
> > >
> > > J
> > >
> > >
> > > On Thu, May 23, 2013 at 11:00 AM, John Jensen <
> jensen@richrelevance.com
> > >wrote:
> > >
> > >>
> > >> Thanks, Josh. That worked perfectly!
> > >>
> > >> It has the added benefit of dramatically improving the startup time. I
> > >> assume because we're no longer copying the monstrous jobconfs around.
> > >>
> > >> -- John
> > >>
> > >> ________________________________________
> > >> From: Josh Wills (JIRA) [jira@apache.org]
> > >> Sent: Wednesday, May 22, 2013 5:27 PM
> > >> To: crunch-dev@incubator.apache.org
> > >> Subject: [jira] [Updated] (CRUNCH-209) Jobs with large numbers of
> > >> directory inputs will fail with odd inputsplit exceptions
> > >>
> > >>     [
> > >>
> >
> https://issues.apache.org/jira/browse/CRUNCH-209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
> > ]
> > >>
> > >> Josh Wills updated CRUNCH-209:
> > >> ------------------------------
> > >>
> > >>    Attachment: CRUNCH-209.patch
> > >>
> > >> A hypothetical fix for John to test out.
> > >>
> > >>> Jobs with large numbers of directory inputs will fail with odd
> > >> inputsplit exceptions
> > >>>
> > >>
> >
> ------------------------------------------------------------------------------------
> > >>>
> > >>>                Key: CRUNCH-209
> > >>>                URL: https://issues.apache.org/jira/browse/CRUNCH-209
> > >>>            Project: Crunch
> > >>>         Issue Type: Bug
> > >>>         Components: Core
> > >>>   Affects Versions: 0.5.0, 0.6.0
> > >>>           Reporter: Josh Wills
> > >>>           Assignee: Josh Wills
> > >>>        Attachments: CRUNCH-209.patch
> > >>>
> > >>>
> > >>> From John Jensen on the user mailing list:
> > >>> I have a curious problem when running a crunch job on (avro) files
> in a
> > >> fairly large set of directories (just slightly less than 100).
> > >>> After running some fraction of the mappers they start failing with
> the
> > >> exception below. Things work fine with a smaller number of
> directories.
> > >>> The magic
> > >>
> >
> 'zdHJpbmcifSx7Im5hbWUiOiJ2YWx1ZSIsInR5cGUiOiJzdHJpbmcifV19fSwiZGVmYXVsdCI'
> > >> string shows up in the 'crunch.inputs.dir' entry in the job config,
> so I
> > >> assume it has something to do with deserializing that value, but
> reading
> > >> through the code I don't see any obvious way how.
> > >>> Furthermore, the crunch.inputs.dir config entry is just under 1.5M,
> so
> > >> it would not surprise me if I'm running up against a hadoop limit
> > somewhere.
> > >>> Stack trace:
> > >>> java.io.IOException: Split class zdHJp
> > >>> bmcifSx7Im5hbWUiOiJ2YWx1ZSIsInR5cGUiOiJzdHJpbmcifV19fSwiZGVmYXVsdCI
> not
> > >> found
> > >>>      at
> > >> org.apache.hadoop.mapred.MapTask.getSplitDetails(MapTask.java:342)
> > >>>      at
> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:614)
> > >>>      at org.apache.hadoop.mapred.MapTask.run(MapTask.java:325)
> > >>>      at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
> > >>>      at java.security.AccessController.doPrivileged(Native Method)
> > >>>      at javax.security.auth.Subject.doAs(Subject.java:415)
> > >>>      at
> > >>
> >
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
> > >>>      at org.apache.hadoop.mapred.Child.main(Child.java:262)
> > >>> Caused by: java.lang.ClassNotFoundException: Class zdHJp
> > >>> bmcifSx7Im5hbWUiOiJ2YWx1ZSIsInR5cGUiOiJzdHJpbmcifV19fSwiZGVmYXVsdCI
> not
> > >> found
> > >>>      at
> > >>
> >
> org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:1493)
> > >>>      at
> > >> org.apache.hadoop.mapred.MapTask.getSplitDetails(MapTask.java:340)
> > >>>      ... 7 more
> > >>
> > >> --
> > >> This message is automatically generated by JIRA.
> > >> If you think it was sent incorrectly, please contact your JIRA
> > >> administrators
> > >> For more information on JIRA, see:
> > http://www.atlassian.com/software/jira
> > >>
> > >
> > >
> > >
> > > --
> > > Director of Data Science
> > > Cloudera <http://www.cloudera.com>
> > > Twitter: @josh_wills <http://twitter.com/josh_wills>
> >
> >
>


-- 
Harsh J

--20cf301d425a6be96e04dd681731--