incubator-crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dave Beech <d...@paraliatech.com>
Subject Re: multiple input files as pipeline source?
Date Wed, 13 Feb 2013 15:30:22 GMT
Actually, while we're on the subject of small files and
CombineFileInputFormat...

I believe Hive has a feature whereby CombineFileInputFormat is used
internally if it's required to read many small files to make the
resulting mapreduce jobs more efficient. Would it be worth looking
into whether Crunch could support this, too?


On 13 February 2013 15:27, Dave Beech <dave@paraliatech.com> wrote:
> thanks!
>
> On 13 February 2013 15:22, Victor Iacoban <victor.iacoban@gmail.com> wrote:
>> https://gist.github.com/viacoban/4945325
>>
>>
>> On Wed, Feb 13, 2013 at 9:59 AM, Dave Beech <dave@paraliatech.com> wrote:
>>
>>> A gist would be great - thanks very much
>>>
>>> Dave
>>>
>>> On 13 February 2013 14:52, Victor Iacoban <victor.iacoban@gmail.com>
>>> wrote:
>>> > Dave,
>>> >
>>> > How do you want this, copy pasted code into a gist or a reusable jar?
>>> >
>>> > --victor
>>> >
>>> >
>>> > On Wed, Feb 13, 2013 at 3:59 AM, Dave Beech <dave@paraliatech.com>
>>> wrote:
>>> >
>>> >> Hi Victor,
>>> >> Any chance you could share your implementation of a Source that reads
>>> >> from multiple paths? I've wanted this for a while but haven't found
>>> >> time to go ahead and write one myself!
>>> >> Thanks,
>>> >> Dave
>>> >>
>>> >> On 12 February 2013 23:07, Victor Iacoban <victor.iacoban@gmail.com>
>>> >> wrote:
>>> >> > Thanks J
>>> >> >
>>> >> > I could not extend the FileSourceImpl since it works with only
one
>>> input
>>> >> > path,
>>> >> > but I implemented the Source interface directly and it appears
to do
>>> the
>>> >> > job, thx for the pointer
>>> >> >
>>> >> > -- victor
>>> >> >
>>> >> >
>>> >> >
>>> >> > On Tue, Feb 12, 2013 at 5:20 PM, Josh Wills <josh.wills@gmail.com>
>>> >> wrote:
>>> >> >
>>> >> >> Yep-- check out the formattedFile function in o.a.c.io.From.
You can
>>> >> also
>>> >> >> write a custom extension of o.a.c.io.impl.FileSourceImpl if
it's one
>>> >> you're
>>> >> >> going to be using a lot, or if there is custom configuration
>>> information
>>> >> >> required to use the InputFormat.
>>> >> >>
>>> >> >> J
>>> >> >>
>>> >> >>
>>> >> >> On Tue, Feb 12, 2013 at 2:13 PM, Victor Iacoban <
>>> >> victor.iacoban@gmail.com
>>> >> >> >wrote:
>>> >> >>
>>> >> >> > That's exactly what I have in the code not using Crunch
API:
>>> >> >> > public class MultiSequenceFileInputFormat<K, V>
extends
>>> >> >> > CombineFileInputFormat<K, V> {
>>> >> >> > ...
>>> >> >> > }
>>> >> >> >
>>> >> >> > Are you saying there is way to use my custom input format
with
>>> Crunch?
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > On Tue, Feb 12, 2013 at 5:06 PM, Josh Wills <josh.wills@gmail.com>
>>> >> >> wrote:
>>> >> >> >
>>> >> >> > > Depends on the size of the files-- if there are a
bunch of tiny
>>> >> ones,
>>> >> >> it
>>> >> >> > > can be worthwhile to have a CombineFileInputFormat,
ala
>>> >> >> > >
>>> >> >> > >
>>> >> http://yaseminavcular.blogspot.com/2011/03/many-small-input-files.html
>>> >> >> > >
>>> >> >> > > J
>>> >> >> > >
>>> >> >> > >
>>> >> >> > > On Tue, Feb 12, 2013 at 1:56 PM, Victor Iacoban <
>>> >> >> > victor.iacoban@gmail.com
>>> >> >> > > >wrote:
>>> >> >> > >
>>> >> >> > > > Thanks Josh,
>>> >> >> > > > Is there any performance penalty in unions,
assuming that I
>>> have
>>> >> >> > several
>>> >> >> > > > hundreds of input files?
>>> >> >> > > >
>>> >> >> > > >
>>> >> >> > > > On Tue, Feb 12, 2013 at 4:39 PM, Josh Wills
<
>>> josh.wills@gmail.com
>>> >> >
>>> >> >> > > wrote:
>>> >> >> > > >
>>> >> >> > > > > Yeah, of course-- that's how stuff like
joins work.
>>> >> >> > > > >
>>> >> >> > > > > PTable<K, V> first = pipeline.read(new
TableSource<K,
>>> >> >> V>(firstFile));
>>> >> >> > > > > PTable<K, V> second = ...;
>>> >> >> > > > > PTable<K, V> union = first.union(second);
>>> >> >> > > > >
>>> >> >> > > > > etc.
>>> >> >> > > > >
>>> >> >> > > > >
>>> >> >> > > > > On Tue, Feb 12, 2013 at 1:36 PM, Victor
Iacoban <
>>> >> >> > > > victor.iacoban@gmail.com
>>> >> >> > > > > >wrote:
>>> >> >> > > > >
>>> >> >> > > > > > Is there any support in crunch to
use multiple sequence
>>> files
>>> >> as
>>> >> >> > > > pipeline
>>> >> >> > > > > > source?
>>> >> >> > > > > > something similar to standard MultipleInputs
>>> >> >> > > > > >
>>> >> >> > > > > > Thanks,
>>> >> >> > > > > > victor
>>> >> >> > > > > >
>>> >> >> > > > >
>>> >> >> > > >
>>> >> >> > >
>>> >> >> >
>>> >> >>
>>> >>
>>>

Mime
View raw message