crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Victor Iacoban <victor.iaco...@gmail.com>
Subject Re: multiple input files as pipeline source?
Date Wed, 13 Feb 2013 15:22:57 GMT
https://gist.github.com/viacoban/4945325


On Wed, Feb 13, 2013 at 9:59 AM, Dave Beech <dave@paraliatech.com> wrote:

> A gist would be great - thanks very much
>
> Dave
>
> On 13 February 2013 14:52, Victor Iacoban <victor.iacoban@gmail.com>
> wrote:
> > Dave,
> >
> > How do you want this, copy pasted code into a gist or a reusable jar?
> >
> > --victor
> >
> >
> > On Wed, Feb 13, 2013 at 3:59 AM, Dave Beech <dave@paraliatech.com>
> wrote:
> >
> >> Hi Victor,
> >> Any chance you could share your implementation of a Source that reads
> >> from multiple paths? I've wanted this for a while but haven't found
> >> time to go ahead and write one myself!
> >> Thanks,
> >> Dave
> >>
> >> On 12 February 2013 23:07, Victor Iacoban <victor.iacoban@gmail.com>
> >> wrote:
> >> > Thanks J
> >> >
> >> > I could not extend the FileSourceImpl since it works with only one
> input
> >> > path,
> >> > but I implemented the Source interface directly and it appears to do
> the
> >> > job, thx for the pointer
> >> >
> >> > -- victor
> >> >
> >> >
> >> >
> >> > On Tue, Feb 12, 2013 at 5:20 PM, Josh Wills <josh.wills@gmail.com>
> >> wrote:
> >> >
> >> >> Yep-- check out the formattedFile function in o.a.c.io.From. You can
> >> also
> >> >> write a custom extension of o.a.c.io.impl.FileSourceImpl if it's one
> >> you're
> >> >> going to be using a lot, or if there is custom configuration
> information
> >> >> required to use the InputFormat.
> >> >>
> >> >> J
> >> >>
> >> >>
> >> >> On Tue, Feb 12, 2013 at 2:13 PM, Victor Iacoban <
> >> victor.iacoban@gmail.com
> >> >> >wrote:
> >> >>
> >> >> > That's exactly what I have in the code not using Crunch API:
> >> >> > public class MultiSequenceFileInputFormat<K, V> extends
> >> >> > CombineFileInputFormat<K, V> {
> >> >> > ...
> >> >> > }
> >> >> >
> >> >> > Are you saying there is way to use my custom input format with
> Crunch?
> >> >> >
> >> >> >
> >> >> >
> >> >> > On Tue, Feb 12, 2013 at 5:06 PM, Josh Wills <josh.wills@gmail.com>
> >> >> wrote:
> >> >> >
> >> >> > > Depends on the size of the files-- if there are a bunch of
tiny
> >> ones,
> >> >> it
> >> >> > > can be worthwhile to have a CombineFileInputFormat, ala
> >> >> > >
> >> >> > >
> >> http://yaseminavcular.blogspot.com/2011/03/many-small-input-files.html
> >> >> > >
> >> >> > > J
> >> >> > >
> >> >> > >
> >> >> > > On Tue, Feb 12, 2013 at 1:56 PM, Victor Iacoban <
> >> >> > victor.iacoban@gmail.com
> >> >> > > >wrote:
> >> >> > >
> >> >> > > > Thanks Josh,
> >> >> > > > Is there any performance penalty in unions, assuming
that I
> have
> >> >> > several
> >> >> > > > hundreds of input files?
> >> >> > > >
> >> >> > > >
> >> >> > > > On Tue, Feb 12, 2013 at 4:39 PM, Josh Wills <
> josh.wills@gmail.com
> >> >
> >> >> > > wrote:
> >> >> > > >
> >> >> > > > > Yeah, of course-- that's how stuff like joins work.
> >> >> > > > >
> >> >> > > > > PTable<K, V> first = pipeline.read(new TableSource<K,
> >> >> V>(firstFile));
> >> >> > > > > PTable<K, V> second = ...;
> >> >> > > > > PTable<K, V> union = first.union(second);
> >> >> > > > >
> >> >> > > > > etc.
> >> >> > > > >
> >> >> > > > >
> >> >> > > > > On Tue, Feb 12, 2013 at 1:36 PM, Victor Iacoban
<
> >> >> > > > victor.iacoban@gmail.com
> >> >> > > > > >wrote:
> >> >> > > > >
> >> >> > > > > > Is there any support in crunch to use multiple
sequence
> files
> >> as
> >> >> > > > pipeline
> >> >> > > > > > source?
> >> >> > > > > > something similar to standard MultipleInputs
> >> >> > > > > >
> >> >> > > > > > Thanks,
> >> >> > > > > > victor
> >> >> > > > > >
> >> >> > > > >
> >> >> > > >
> >> >> > >
> >> >> >
> >> >>
> >>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message