crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Wills <jwi...@cloudera.com>
Subject Re: multiple input files as pipeline source?
Date Tue, 12 Feb 2013 23:08:54 GMT
Ah, gotcha-- good to know.


On Tue, Feb 12, 2013 at 3:07 PM, Victor Iacoban <victor.iacoban@gmail.com>wrote:

> Thanks J
>
> I could not extend the FileSourceImpl since it works with only one input
> path,
> but I implemented the Source interface directly and it appears to do the
> job, thx for the pointer
>
> -- victor
>
>
>
> On Tue, Feb 12, 2013 at 5:20 PM, Josh Wills <josh.wills@gmail.com> wrote:
>
> > Yep-- check out the formattedFile function in o.a.c.io.From. You can also
> > write a custom extension of o.a.c.io.impl.FileSourceImpl if it's one
> you're
> > going to be using a lot, or if there is custom configuration information
> > required to use the InputFormat.
> >
> > J
> >
> >
> > On Tue, Feb 12, 2013 at 2:13 PM, Victor Iacoban <
> victor.iacoban@gmail.com
> > >wrote:
> >
> > > That's exactly what I have in the code not using Crunch API:
> > > public class MultiSequenceFileInputFormat<K, V> extends
> > > CombineFileInputFormat<K, V> {
> > > ...
> > > }
> > >
> > > Are you saying there is way to use my custom input format with Crunch?
> > >
> > >
> > >
> > > On Tue, Feb 12, 2013 at 5:06 PM, Josh Wills <josh.wills@gmail.com>
> > wrote:
> > >
> > > > Depends on the size of the files-- if there are a bunch of tiny ones,
> > it
> > > > can be worthwhile to have a CombineFileInputFormat, ala
> > > >
> > > >
> http://yaseminavcular.blogspot.com/2011/03/many-small-input-files.html
> > > >
> > > > J
> > > >
> > > >
> > > > On Tue, Feb 12, 2013 at 1:56 PM, Victor Iacoban <
> > > victor.iacoban@gmail.com
> > > > >wrote:
> > > >
> > > > > Thanks Josh,
> > > > > Is there any performance penalty in unions, assuming that I have
> > > several
> > > > > hundreds of input files?
> > > > >
> > > > >
> > > > > On Tue, Feb 12, 2013 at 4:39 PM, Josh Wills <josh.wills@gmail.com>
> > > > wrote:
> > > > >
> > > > > > Yeah, of course-- that's how stuff like joins work.
> > > > > >
> > > > > > PTable<K, V> first = pipeline.read(new TableSource<K,
> > V>(firstFile));
> > > > > > PTable<K, V> second = ...;
> > > > > > PTable<K, V> union = first.union(second);
> > > > > >
> > > > > > etc.
> > > > > >
> > > > > >
> > > > > > On Tue, Feb 12, 2013 at 1:36 PM, Victor Iacoban <
> > > > > victor.iacoban@gmail.com
> > > > > > >wrote:
> > > > > >
> > > > > > > Is there any support in crunch to use multiple sequence
files
> as
> > > > > pipeline
> > > > > > > source?
> > > > > > > something similar to standard MultipleInputs
> > > > > > >
> > > > > > > Thanks,
> > > > > > > victor
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>



-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message