crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dave Beech <d...@paraliatech.com>
Subject Re: multiple input files as pipeline source?
Date Wed, 13 Feb 2013 14:59:35 GMT
A gist would be great - thanks very much

Dave

On 13 February 2013 14:52, Victor Iacoban <victor.iacoban@gmail.com> wrote:
> Dave,
>
> How do you want this, copy pasted code into a gist or a reusable jar?
>
> --victor
>
>
> On Wed, Feb 13, 2013 at 3:59 AM, Dave Beech <dave@paraliatech.com> wrote:
>
>> Hi Victor,
>> Any chance you could share your implementation of a Source that reads
>> from multiple paths? I've wanted this for a while but haven't found
>> time to go ahead and write one myself!
>> Thanks,
>> Dave
>>
>> On 12 February 2013 23:07, Victor Iacoban <victor.iacoban@gmail.com>
>> wrote:
>> > Thanks J
>> >
>> > I could not extend the FileSourceImpl since it works with only one input
>> > path,
>> > but I implemented the Source interface directly and it appears to do the
>> > job, thx for the pointer
>> >
>> > -- victor
>> >
>> >
>> >
>> > On Tue, Feb 12, 2013 at 5:20 PM, Josh Wills <josh.wills@gmail.com>
>> wrote:
>> >
>> >> Yep-- check out the formattedFile function in o.a.c.io.From. You can
>> also
>> >> write a custom extension of o.a.c.io.impl.FileSourceImpl if it's one
>> you're
>> >> going to be using a lot, or if there is custom configuration information
>> >> required to use the InputFormat.
>> >>
>> >> J
>> >>
>> >>
>> >> On Tue, Feb 12, 2013 at 2:13 PM, Victor Iacoban <
>> victor.iacoban@gmail.com
>> >> >wrote:
>> >>
>> >> > That's exactly what I have in the code not using Crunch API:
>> >> > public class MultiSequenceFileInputFormat<K, V> extends
>> >> > CombineFileInputFormat<K, V> {
>> >> > ...
>> >> > }
>> >> >
>> >> > Are you saying there is way to use my custom input format with Crunch?
>> >> >
>> >> >
>> >> >
>> >> > On Tue, Feb 12, 2013 at 5:06 PM, Josh Wills <josh.wills@gmail.com>
>> >> wrote:
>> >> >
>> >> > > Depends on the size of the files-- if there are a bunch of tiny
>> ones,
>> >> it
>> >> > > can be worthwhile to have a CombineFileInputFormat, ala
>> >> > >
>> >> > >
>> http://yaseminavcular.blogspot.com/2011/03/many-small-input-files.html
>> >> > >
>> >> > > J
>> >> > >
>> >> > >
>> >> > > On Tue, Feb 12, 2013 at 1:56 PM, Victor Iacoban <
>> >> > victor.iacoban@gmail.com
>> >> > > >wrote:
>> >> > >
>> >> > > > Thanks Josh,
>> >> > > > Is there any performance penalty in unions, assuming that
I have
>> >> > several
>> >> > > > hundreds of input files?
>> >> > > >
>> >> > > >
>> >> > > > On Tue, Feb 12, 2013 at 4:39 PM, Josh Wills <josh.wills@gmail.com
>> >
>> >> > > wrote:
>> >> > > >
>> >> > > > > Yeah, of course-- that's how stuff like joins work.
>> >> > > > >
>> >> > > > > PTable<K, V> first = pipeline.read(new TableSource<K,
>> >> V>(firstFile));
>> >> > > > > PTable<K, V> second = ...;
>> >> > > > > PTable<K, V> union = first.union(second);
>> >> > > > >
>> >> > > > > etc.
>> >> > > > >
>> >> > > > >
>> >> > > > > On Tue, Feb 12, 2013 at 1:36 PM, Victor Iacoban <
>> >> > > > victor.iacoban@gmail.com
>> >> > > > > >wrote:
>> >> > > > >
>> >> > > > > > Is there any support in crunch to use multiple
sequence files
>> as
>> >> > > > pipeline
>> >> > > > > > source?
>> >> > > > > > something similar to standard MultipleInputs
>> >> > > > > >
>> >> > > > > > Thanks,
>> >> > > > > > victor
>> >> > > > > >
>> >> > > > >
>> >> > > >
>> >> > >
>> >> >
>> >>
>>

Mime
View raw message