crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dave Beech <d...@paraliatech.com>
Subject Re: multiple input files as pipeline source?
Date Wed, 13 Feb 2013 15:27:55 GMT
thanks!

On 13 February 2013 15:22, Victor Iacoban <victor.iacoban@gmail.com> wrote:
> https://gist.github.com/viacoban/4945325
>
>
> On Wed, Feb 13, 2013 at 9:59 AM, Dave Beech <dave@paraliatech.com> wrote:
>
>> A gist would be great - thanks very much
>>
>> Dave
>>
>> On 13 February 2013 14:52, Victor Iacoban <victor.iacoban@gmail.com>
>> wrote:
>> > Dave,
>> >
>> > How do you want this, copy pasted code into a gist or a reusable jar?
>> >
>> > --victor
>> >
>> >
>> > On Wed, Feb 13, 2013 at 3:59 AM, Dave Beech <dave@paraliatech.com>
>> wrote:
>> >
>> >> Hi Victor,
>> >> Any chance you could share your implementation of a Source that reads
>> >> from multiple paths? I've wanted this for a while but haven't found
>> >> time to go ahead and write one myself!
>> >> Thanks,
>> >> Dave
>> >>
>> >> On 12 February 2013 23:07, Victor Iacoban <victor.iacoban@gmail.com>
>> >> wrote:
>> >> > Thanks J
>> >> >
>> >> > I could not extend the FileSourceImpl since it works with only one
>> input
>> >> > path,
>> >> > but I implemented the Source interface directly and it appears to do
>> the
>> >> > job, thx for the pointer
>> >> >
>> >> > -- victor
>> >> >
>> >> >
>> >> >
>> >> > On Tue, Feb 12, 2013 at 5:20 PM, Josh Wills <josh.wills@gmail.com>
>> >> wrote:
>> >> >
>> >> >> Yep-- check out the formattedFile function in o.a.c.io.From. You
can
>> >> also
>> >> >> write a custom extension of o.a.c.io.impl.FileSourceImpl if it's
one
>> >> you're
>> >> >> going to be using a lot, or if there is custom configuration
>> information
>> >> >> required to use the InputFormat.
>> >> >>
>> >> >> J
>> >> >>
>> >> >>
>> >> >> On Tue, Feb 12, 2013 at 2:13 PM, Victor Iacoban <
>> >> victor.iacoban@gmail.com
>> >> >> >wrote:
>> >> >>
>> >> >> > That's exactly what I have in the code not using Crunch API:
>> >> >> > public class MultiSequenceFileInputFormat<K, V> extends
>> >> >> > CombineFileInputFormat<K, V> {
>> >> >> > ...
>> >> >> > }
>> >> >> >
>> >> >> > Are you saying there is way to use my custom input format
with
>> Crunch?
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > On Tue, Feb 12, 2013 at 5:06 PM, Josh Wills <josh.wills@gmail.com>
>> >> >> wrote:
>> >> >> >
>> >> >> > > Depends on the size of the files-- if there are a bunch
of tiny
>> >> ones,
>> >> >> it
>> >> >> > > can be worthwhile to have a CombineFileInputFormat, ala
>> >> >> > >
>> >> >> > >
>> >> http://yaseminavcular.blogspot.com/2011/03/many-small-input-files.html
>> >> >> > >
>> >> >> > > J
>> >> >> > >
>> >> >> > >
>> >> >> > > On Tue, Feb 12, 2013 at 1:56 PM, Victor Iacoban <
>> >> >> > victor.iacoban@gmail.com
>> >> >> > > >wrote:
>> >> >> > >
>> >> >> > > > Thanks Josh,
>> >> >> > > > Is there any performance penalty in unions, assuming
that I
>> have
>> >> >> > several
>> >> >> > > > hundreds of input files?
>> >> >> > > >
>> >> >> > > >
>> >> >> > > > On Tue, Feb 12, 2013 at 4:39 PM, Josh Wills <
>> josh.wills@gmail.com
>> >> >
>> >> >> > > wrote:
>> >> >> > > >
>> >> >> > > > > Yeah, of course-- that's how stuff like joins
work.
>> >> >> > > > >
>> >> >> > > > > PTable<K, V> first = pipeline.read(new
TableSource<K,
>> >> >> V>(firstFile));
>> >> >> > > > > PTable<K, V> second = ...;
>> >> >> > > > > PTable<K, V> union = first.union(second);
>> >> >> > > > >
>> >> >> > > > > etc.
>> >> >> > > > >
>> >> >> > > > >
>> >> >> > > > > On Tue, Feb 12, 2013 at 1:36 PM, Victor Iacoban
<
>> >> >> > > > victor.iacoban@gmail.com
>> >> >> > > > > >wrote:
>> >> >> > > > >
>> >> >> > > > > > Is there any support in crunch to use
multiple sequence
>> files
>> >> as
>> >> >> > > > pipeline
>> >> >> > > > > > source?
>> >> >> > > > > > something similar to standard MultipleInputs
>> >> >> > > > > >
>> >> >> > > > > > Thanks,
>> >> >> > > > > > victor
>> >> >> > > > > >
>> >> >> > > > >
>> >> >> > > >
>> >> >> > >
>> >> >> >
>> >> >>
>> >>
>>

Mime
View raw message