crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Wills <josh.wi...@gmail.com>
Subject Re: Processing splittable inputs
Date Fri, 26 Feb 2016 23:43:14 GMT
Yeah, I suspect the Source-property approach is the right thing here.

On Fri, Feb 26, 2016 at 3:37 PM, Micah Whitacre <mkwhit@gmail.com> wrote:

> Where are you trying to specify them?  Inside a DoFn?  Prior to
> constructing the MRPipeline?
>
> I'd suggest trying either:
> 1. Setting those values on the initial Configuration object you pass to the
> MRPipeline
> 2. Setting them as Source specific properties[1] on the source itself.
>
> The latter approach might be better if you are reading a lot of different
> sources into your pipeline and don't want to affect them all.
>
> [1] -
>
> http://crunch.apache.org/apidocs/0.12.0/org/apache/crunch/Source.html#inputConf(java.lang.String,%20java.lang.String)
>
> On Fri, Feb 26, 2016 at 5:17 PM, Ben Juhn <benjijuhn@gmail.com> wrote:
>
> > The data isn’t compressed.  The parameters aren’t showing up in the job
> > configuration either.
> >
> >
> > > On Feb 25, 2016, at 5:15 PM, Ben Juhn <benjijuhn@gmail.com> wrote:
> > >
> > > Hello there,
> > >
> > > I haven’t been able to get crunch to split inputs into multiple
> > mappers.  Currently it’s giving me one mapper per text file, even though
> > they’re 1GB each.  I’ve tried supplying split.maxsize on the command line
> > and in the DoFn implementation:
> > >
> > > @Override
> > > public void configure(Configuration conf) {
> > > conf.set("crunch.combine.file.size", "67108864");
> > > conf.set("mapreduce.input.fileinputformat.split.maxsize", "67108864");
> > > conf.set("mapreduce.input.fileinputformat.split.minsize", "67108864");
> > > }
> > >
> > > Any suggestions?
> > >
> > > Thanks,
> > > Ben
> > >
> >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message