incubator-crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Wills <jwi...@cloudera.com>
Subject Re: multiple input files as pipeline source?
Date Wed, 13 Feb 2013 16:18:32 GMT
Yep, I would love that.


On Wed, Feb 13, 2013 at 7:30 AM, Dave Beech <dave@paraliatech.com> wrote:

> Actually, while we're on the subject of small files and
> CombineFileInputFormat...
>
> I believe Hive has a feature whereby CombineFileInputFormat is used
> internally if it's required to read many small files to make the
> resulting mapreduce jobs more efficient. Would it be worth looking
> into whether Crunch could support this, too?
>
>
> On 13 February 2013 15:27, Dave Beech <dave@paraliatech.com> wrote:
> > thanks!
> >
> > On 13 February 2013 15:22, Victor Iacoban <victor.iacoban@gmail.com>
> wrote:
> >> https://gist.github.com/viacoban/4945325
> >>
> >>
> >> On Wed, Feb 13, 2013 at 9:59 AM, Dave Beech <dave@paraliatech.com>
> wrote:
> >>
> >>> A gist would be great - thanks very much
> >>>
> >>> Dave
> >>>
> >>> On 13 February 2013 14:52, Victor Iacoban <victor.iacoban@gmail.com>
> >>> wrote:
> >>> > Dave,
> >>> >
> >>> > How do you want this, copy pasted code into a gist or a reusable jar?
> >>> >
> >>> > --victor
> >>> >
> >>> >
> >>> > On Wed, Feb 13, 2013 at 3:59 AM, Dave Beech <dave@paraliatech.com>
> >>> wrote:
> >>> >
> >>> >> Hi Victor,
> >>> >> Any chance you could share your implementation of a Source that
> reads
> >>> >> from multiple paths? I've wanted this for a while but haven't found
> >>> >> time to go ahead and write one myself!
> >>> >> Thanks,
> >>> >> Dave
> >>> >>
> >>> >> On 12 February 2013 23:07, Victor Iacoban <victor.iacoban@gmail.com
> >
> >>> >> wrote:
> >>> >> > Thanks J
> >>> >> >
> >>> >> > I could not extend the FileSourceImpl since it works with
only one
> >>> input
> >>> >> > path,
> >>> >> > but I implemented the Source interface directly and it appears
to
> do
> >>> the
> >>> >> > job, thx for the pointer
> >>> >> >
> >>> >> > -- victor
> >>> >> >
> >>> >> >
> >>> >> >
> >>> >> > On Tue, Feb 12, 2013 at 5:20 PM, Josh Wills <josh.wills@gmail.com
> >
> >>> >> wrote:
> >>> >> >
> >>> >> >> Yep-- check out the formattedFile function in o.a.c.io.From.
You
> can
> >>> >> also
> >>> >> >> write a custom extension of o.a.c.io.impl.FileSourceImpl
if it's
> one
> >>> >> you're
> >>> >> >> going to be using a lot, or if there is custom configuration
> >>> information
> >>> >> >> required to use the InputFormat.
> >>> >> >>
> >>> >> >> J
> >>> >> >>
> >>> >> >>
> >>> >> >> On Tue, Feb 12, 2013 at 2:13 PM, Victor Iacoban <
> >>> >> victor.iacoban@gmail.com
> >>> >> >> >wrote:
> >>> >> >>
> >>> >> >> > That's exactly what I have in the code not using
Crunch API:
> >>> >> >> > public class MultiSequenceFileInputFormat<K, V>
extends
> >>> >> >> > CombineFileInputFormat<K, V> {
> >>> >> >> > ...
> >>> >> >> > }
> >>> >> >> >
> >>> >> >> > Are you saying there is way to use my custom input
format with
> >>> Crunch?
> >>> >> >> >
> >>> >> >> >
> >>> >> >> >
> >>> >> >> > On Tue, Feb 12, 2013 at 5:06 PM, Josh Wills <
> josh.wills@gmail.com>
> >>> >> >> wrote:
> >>> >> >> >
> >>> >> >> > > Depends on the size of the files-- if there
are a bunch of
> tiny
> >>> >> ones,
> >>> >> >> it
> >>> >> >> > > can be worthwhile to have a CombineFileInputFormat,
ala
> >>> >> >> > >
> >>> >> >> > >
> >>> >>
> http://yaseminavcular.blogspot.com/2011/03/many-small-input-files.html
> >>> >> >> > >
> >>> >> >> > > J
> >>> >> >> > >
> >>> >> >> > >
> >>> >> >> > > On Tue, Feb 12, 2013 at 1:56 PM, Victor Iacoban
<
> >>> >> >> > victor.iacoban@gmail.com
> >>> >> >> > > >wrote:
> >>> >> >> > >
> >>> >> >> > > > Thanks Josh,
> >>> >> >> > > > Is there any performance penalty in unions,
assuming that I
> >>> have
> >>> >> >> > several
> >>> >> >> > > > hundreds of input files?
> >>> >> >> > > >
> >>> >> >> > > >
> >>> >> >> > > > On Tue, Feb 12, 2013 at 4:39 PM, Josh Wills
<
> >>> josh.wills@gmail.com
> >>> >> >
> >>> >> >> > > wrote:
> >>> >> >> > > >
> >>> >> >> > > > > Yeah, of course-- that's how stuff
like joins work.
> >>> >> >> > > > >
> >>> >> >> > > > > PTable<K, V> first = pipeline.read(new
TableSource<K,
> >>> >> >> V>(firstFile));
> >>> >> >> > > > > PTable<K, V> second = ...;
> >>> >> >> > > > > PTable<K, V> union = first.union(second);
> >>> >> >> > > > >
> >>> >> >> > > > > etc.
> >>> >> >> > > > >
> >>> >> >> > > > >
> >>> >> >> > > > > On Tue, Feb 12, 2013 at 1:36 PM, Victor
Iacoban <
> >>> >> >> > > > victor.iacoban@gmail.com
> >>> >> >> > > > > >wrote:
> >>> >> >> > > > >
> >>> >> >> > > > > > Is there any support in crunch
to use multiple sequence
> >>> files
> >>> >> as
> >>> >> >> > > > pipeline
> >>> >> >> > > > > > source?
> >>> >> >> > > > > > something similar to standard
MultipleInputs
> >>> >> >> > > > > >
> >>> >> >> > > > > > Thanks,
> >>> >> >> > > > > > victor
> >>> >> >> > > > > >
> >>> >> >> > > > >
> >>> >> >> > > >
> >>> >> >> > >
> >>> >> >> >
> >>> >> >>
> >>> >>
> >>>
>



-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message