incubator-crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dave Beech <d...@paraliatech.com>
Subject Re: multiple input files as pipeline source?
Date Wed, 13 Feb 2013 08:59:18 GMT
Hi Victor,
Any chance you could share your implementation of a Source that reads
from multiple paths? I've wanted this for a while but haven't found
time to go ahead and write one myself!
Thanks,
Dave

On 12 February 2013 23:07, Victor Iacoban <victor.iacoban@gmail.com> wrote:
> Thanks J
>
> I could not extend the FileSourceImpl since it works with only one input
> path,
> but I implemented the Source interface directly and it appears to do the
> job, thx for the pointer
>
> -- victor
>
>
>
> On Tue, Feb 12, 2013 at 5:20 PM, Josh Wills <josh.wills@gmail.com> wrote:
>
>> Yep-- check out the formattedFile function in o.a.c.io.From. You can also
>> write a custom extension of o.a.c.io.impl.FileSourceImpl if it's one you're
>> going to be using a lot, or if there is custom configuration information
>> required to use the InputFormat.
>>
>> J
>>
>>
>> On Tue, Feb 12, 2013 at 2:13 PM, Victor Iacoban <victor.iacoban@gmail.com
>> >wrote:
>>
>> > That's exactly what I have in the code not using Crunch API:
>> > public class MultiSequenceFileInputFormat<K, V> extends
>> > CombineFileInputFormat<K, V> {
>> > ...
>> > }
>> >
>> > Are you saying there is way to use my custom input format with Crunch?
>> >
>> >
>> >
>> > On Tue, Feb 12, 2013 at 5:06 PM, Josh Wills <josh.wills@gmail.com>
>> wrote:
>> >
>> > > Depends on the size of the files-- if there are a bunch of tiny ones,
>> it
>> > > can be worthwhile to have a CombineFileInputFormat, ala
>> > >
>> > > http://yaseminavcular.blogspot.com/2011/03/many-small-input-files.html
>> > >
>> > > J
>> > >
>> > >
>> > > On Tue, Feb 12, 2013 at 1:56 PM, Victor Iacoban <
>> > victor.iacoban@gmail.com
>> > > >wrote:
>> > >
>> > > > Thanks Josh,
>> > > > Is there any performance penalty in unions, assuming that I have
>> > several
>> > > > hundreds of input files?
>> > > >
>> > > >
>> > > > On Tue, Feb 12, 2013 at 4:39 PM, Josh Wills <josh.wills@gmail.com>
>> > > wrote:
>> > > >
>> > > > > Yeah, of course-- that's how stuff like joins work.
>> > > > >
>> > > > > PTable<K, V> first = pipeline.read(new TableSource<K,
>> V>(firstFile));
>> > > > > PTable<K, V> second = ...;
>> > > > > PTable<K, V> union = first.union(second);
>> > > > >
>> > > > > etc.
>> > > > >
>> > > > >
>> > > > > On Tue, Feb 12, 2013 at 1:36 PM, Victor Iacoban <
>> > > > victor.iacoban@gmail.com
>> > > > > >wrote:
>> > > > >
>> > > > > > Is there any support in crunch to use multiple sequence
files as
>> > > > pipeline
>> > > > > > source?
>> > > > > > something similar to standard MultipleInputs
>> > > > > >
>> > > > > > Thanks,
>> > > > > > victor
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>>

Mime
View raw message