crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dave Beech <d...@paraliatech.com>
Subject Re: multiple input files as pipeline source?
Date Wed, 13 Feb 2013 16:49:21 GMT
I haven't tried the code yet but I think it looks correct.
MultiSequenceFileRecordReader will get created via reflection and
needs the (CombineFileSplit split, TaskAttemptContext context, Integer
index) sig as its constructor.

On 13 February 2013 16:40, Josh Wills <josh.wills@gmail.com> wrote:
> Ha! Quite possibly. Let's JIRA it up.
>
> Victor, I haven't had much coffee yet, but it looks like there is a bug in
> the gist-- the MultiSequenceFileInputFormat refers to a new
> CombineFileRecordReader, which has a different constructor signature from
> the MultiSequenceFileRecordReader in the patch. What did I miss?
>
> J
>
>
> On Wed, Feb 13, 2013 at 8:34 AM, Dave Beech <dave@paraliatech.com> wrote:
>
>> Love it enough to write it for us? ;) I'll stick it in JIRA just in
>> case. Or if not, maybe one day I'll have a free couple of hours and
>> feel like doing it myself!
>>
>> Cheers,
>> Dave
>>
>> On 13 February 2013 16:18, Josh Wills <jwills@cloudera.com> wrote:
>> > Yep, I would love that.
>> >
>> >
>> > On Wed, Feb 13, 2013 at 7:30 AM, Dave Beech <dave@paraliatech.com>
>> wrote:
>> >
>> >> Actually, while we're on the subject of small files and
>> >> CombineFileInputFormat...
>> >>
>> >> I believe Hive has a feature whereby CombineFileInputFormat is used
>> >> internally if it's required to read many small files to make the
>> >> resulting mapreduce jobs more efficient. Would it be worth looking
>> >> into whether Crunch could support this, too?
>> >>
>> >>
>> >> On 13 February 2013 15:27, Dave Beech <dave@paraliatech.com> wrote:
>> >> > thanks!
>> >> >
>> >> > On 13 February 2013 15:22, Victor Iacoban <victor.iacoban@gmail.com>
>> >> wrote:
>> >> >> https://gist.github.com/viacoban/4945325
>> >> >>
>> >> >>
>> >> >> On Wed, Feb 13, 2013 at 9:59 AM, Dave Beech <dave@paraliatech.com>
>> >> wrote:
>> >> >>
>> >> >>> A gist would be great - thanks very much
>> >> >>>
>> >> >>> Dave
>> >> >>>
>> >> >>> On 13 February 2013 14:52, Victor Iacoban <victor.iacoban@gmail.com
>> >
>> >> >>> wrote:
>> >> >>> > Dave,
>> >> >>> >
>> >> >>> > How do you want this, copy pasted code into a gist or
a reusable
>> jar?
>> >> >>> >
>> >> >>> > --victor
>> >> >>> >
>> >> >>> >
>> >> >>> > On Wed, Feb 13, 2013 at 3:59 AM, Dave Beech <dave@paraliatech.com
>> >
>> >> >>> wrote:
>> >> >>> >
>> >> >>> >> Hi Victor,
>> >> >>> >> Any chance you could share your implementation of
a Source that
>> >> reads
>> >> >>> >> from multiple paths? I've wanted this for a while
but haven't
>> found
>> >> >>> >> time to go ahead and write one myself!
>> >> >>> >> Thanks,
>> >> >>> >> Dave
>> >> >>> >>
>> >> >>> >> On 12 February 2013 23:07, Victor Iacoban <
>> victor.iacoban@gmail.com
>> >> >
>> >> >>> >> wrote:
>> >> >>> >> > Thanks J
>> >> >>> >> >
>> >> >>> >> > I could not extend the FileSourceImpl since it
works with only
>> one
>> >> >>> input
>> >> >>> >> > path,
>> >> >>> >> > but I implemented the Source interface directly
and it appears
>> to
>> >> do
>> >> >>> the
>> >> >>> >> > job, thx for the pointer
>> >> >>> >> >
>> >> >>> >> > -- victor
>> >> >>> >> >
>> >> >>> >> >
>> >> >>> >> >
>> >> >>> >> > On Tue, Feb 12, 2013 at 5:20 PM, Josh Wills <
>> josh.wills@gmail.com
>> >> >
>> >> >>> >> wrote:
>> >> >>> >> >
>> >> >>> >> >> Yep-- check out the formattedFile function
in o.a.c.io.From.
>> You
>> >> can
>> >> >>> >> also
>> >> >>> >> >> write a custom extension of o.a.c.io.impl.FileSourceImpl
if
>> it's
>> >> one
>> >> >>> >> you're
>> >> >>> >> >> going to be using a lot, or if there is custom
configuration
>> >> >>> information
>> >> >>> >> >> required to use the InputFormat.
>> >> >>> >> >>
>> >> >>> >> >> J
>> >> >>> >> >>
>> >> >>> >> >>
>> >> >>> >> >> On Tue, Feb 12, 2013 at 2:13 PM, Victor Iacoban
<
>> >> >>> >> victor.iacoban@gmail.com
>> >> >>> >> >> >wrote:
>> >> >>> >> >>
>> >> >>> >> >> > That's exactly what I have in the code
not using Crunch API:
>> >> >>> >> >> > public class MultiSequenceFileInputFormat<K,
V> extends
>> >> >>> >> >> > CombineFileInputFormat<K, V> {
>> >> >>> >> >> > ...
>> >> >>> >> >> > }
>> >> >>> >> >> >
>> >> >>> >> >> > Are you saying there is way to use my
custom input format
>> with
>> >> >>> Crunch?
>> >> >>> >> >> >
>> >> >>> >> >> >
>> >> >>> >> >> >
>> >> >>> >> >> > On Tue, Feb 12, 2013 at 5:06 PM, Josh
Wills <
>> >> josh.wills@gmail.com>
>> >> >>> >> >> wrote:
>> >> >>> >> >> >
>> >> >>> >> >> > > Depends on the size of the files--
if there are a bunch of
>> >> tiny
>> >> >>> >> ones,
>> >> >>> >> >> it
>> >> >>> >> >> > > can be worthwhile to have a CombineFileInputFormat,
ala
>> >> >>> >> >> > >
>> >> >>> >> >> > >
>> >> >>> >>
>> >> http://yaseminavcular.blogspot.com/2011/03/many-small-input-files.html
>> >> >>> >> >> > >
>> >> >>> >> >> > > J
>> >> >>> >> >> > >
>> >> >>> >> >> > >
>> >> >>> >> >> > > On Tue, Feb 12, 2013 at 1:56 PM,
Victor Iacoban <
>> >> >>> >> >> > victor.iacoban@gmail.com
>> >> >>> >> >> > > >wrote:
>> >> >>> >> >> > >
>> >> >>> >> >> > > > Thanks Josh,
>> >> >>> >> >> > > > Is there any performance penalty
in unions, assuming
>> that I
>> >> >>> have
>> >> >>> >> >> > several
>> >> >>> >> >> > > > hundreds of input files?
>> >> >>> >> >> > > >
>> >> >>> >> >> > > >
>> >> >>> >> >> > > > On Tue, Feb 12, 2013 at 4:39
PM, Josh Wills <
>> >> >>> josh.wills@gmail.com
>> >> >>> >> >
>> >> >>> >> >> > > wrote:
>> >> >>> >> >> > > >
>> >> >>> >> >> > > > > Yeah, of course-- that's
how stuff like joins work.
>> >> >>> >> >> > > > >
>> >> >>> >> >> > > > > PTable<K, V> first
= pipeline.read(new TableSource<K,
>> >> >>> >> >> V>(firstFile));
>> >> >>> >> >> > > > > PTable<K, V> second
= ...;
>> >> >>> >> >> > > > > PTable<K, V> union
= first.union(second);
>> >> >>> >> >> > > > >
>> >> >>> >> >> > > > > etc.
>> >> >>> >> >> > > > >
>> >> >>> >> >> > > > >
>> >> >>> >> >> > > > > On Tue, Feb 12, 2013
at 1:36 PM, Victor Iacoban <
>> >> >>> >> >> > > > victor.iacoban@gmail.com
>> >> >>> >> >> > > > > >wrote:
>> >> >>> >> >> > > > >
>> >> >>> >> >> > > > > > Is there any support
in crunch to use multiple
>> sequence
>> >> >>> files
>> >> >>> >> as
>> >> >>> >> >> > > > pipeline
>> >> >>> >> >> > > > > > source?
>> >> >>> >> >> > > > > > something similar
to standard MultipleInputs
>> >> >>> >> >> > > > > >
>> >> >>> >> >> > > > > > Thanks,
>> >> >>> >> >> > > > > > victor
>> >> >>> >> >> > > > > >
>> >> >>> >> >> > > > >
>> >> >>> >> >> > > >
>> >> >>> >> >> > >
>> >> >>> >> >> >
>> >> >>> >> >>
>> >> >>> >>
>> >> >>>
>> >>
>> >
>> >
>> >
>> > --
>> > Director of Data Science
>> > Cloudera <http://www.cloudera.com>
>> > Twitter: @josh_wills <http://twitter.com/josh_wills>
>>
>>
>> On 13 February 2013 16:18, Josh Wills <jwills@cloudera.com> wrote:
>> > Yep, I would love that.
>> >
>> >
>> > On Wed, Feb 13, 2013 at 7:30 AM, Dave Beech <dave@paraliatech.com>
>> wrote:
>> >
>> >> Actually, while we're on the subject of small files and
>> >> CombineFileInputFormat...
>> >>
>> >> I believe Hive has a feature whereby CombineFileInputFormat is used
>> >> internally if it's required to read many small files to make the
>> >> resulting mapreduce jobs more efficient. Would it be worth looking
>> >> into whether Crunch could support this, too?
>> >>
>> >>
>> >> On 13 February 2013 15:27, Dave Beech <dave@paraliatech.com> wrote:
>> >> > thanks!
>> >> >
>> >> > On 13 February 2013 15:22, Victor Iacoban <victor.iacoban@gmail.com>
>> >> wrote:
>> >> >> https://gist.github.com/viacoban/4945325
>> >> >>
>> >> >>
>> >> >> On Wed, Feb 13, 2013 at 9:59 AM, Dave Beech <dave@paraliatech.com>
>> >> wrote:
>> >> >>
>> >> >>> A gist would be great - thanks very much
>> >> >>>
>> >> >>> Dave
>> >> >>>
>> >> >>> On 13 February 2013 14:52, Victor Iacoban <victor.iacoban@gmail.com
>> >
>> >> >>> wrote:
>> >> >>> > Dave,
>> >> >>> >
>> >> >>> > How do you want this, copy pasted code into a gist or
a reusable
>> jar?
>> >> >>> >
>> >> >>> > --victor
>> >> >>> >
>> >> >>> >
>> >> >>> > On Wed, Feb 13, 2013 at 3:59 AM, Dave Beech <dave@paraliatech.com
>> >
>> >> >>> wrote:
>> >> >>> >
>> >> >>> >> Hi Victor,
>> >> >>> >> Any chance you could share your implementation of
a Source that
>> >> reads
>> >> >>> >> from multiple paths? I've wanted this for a while
but haven't
>> found
>> >> >>> >> time to go ahead and write one myself!
>> >> >>> >> Thanks,
>> >> >>> >> Dave
>> >> >>> >>
>> >> >>> >> On 12 February 2013 23:07, Victor Iacoban <
>> victor.iacoban@gmail.com
>> >> >
>> >> >>> >> wrote:
>> >> >>> >> > Thanks J
>> >> >>> >> >
>> >> >>> >> > I could not extend the FileSourceImpl since it
works with only
>> one
>> >> >>> input
>> >> >>> >> > path,
>> >> >>> >> > but I implemented the Source interface directly
and it appears
>> to
>> >> do
>> >> >>> the
>> >> >>> >> > job, thx for the pointer
>> >> >>> >> >
>> >> >>> >> > -- victor
>> >> >>> >> >
>> >> >>> >> >
>> >> >>> >> >
>> >> >>> >> > On Tue, Feb 12, 2013 at 5:20 PM, Josh Wills <
>> josh.wills@gmail.com
>> >> >
>> >> >>> >> wrote:
>> >> >>> >> >
>> >> >>> >> >> Yep-- check out the formattedFile function
in o.a.c.io.From.
>> You
>> >> can
>> >> >>> >> also
>> >> >>> >> >> write a custom extension of o.a.c.io.impl.FileSourceImpl
if
>> it's
>> >> one
>> >> >>> >> you're
>> >> >>> >> >> going to be using a lot, or if there is custom
configuration
>> >> >>> information
>> >> >>> >> >> required to use the InputFormat.
>> >> >>> >> >>
>> >> >>> >> >> J
>> >> >>> >> >>
>> >> >>> >> >>
>> >> >>> >> >> On Tue, Feb 12, 2013 at 2:13 PM, Victor Iacoban
<
>> >> >>> >> victor.iacoban@gmail.com
>> >> >>> >> >> >wrote:
>> >> >>> >> >>
>> >> >>> >> >> > That's exactly what I have in the code
not using Crunch API:
>> >> >>> >> >> > public class MultiSequenceFileInputFormat<K,
V> extends
>> >> >>> >> >> > CombineFileInputFormat<K, V> {
>> >> >>> >> >> > ...
>> >> >>> >> >> > }
>> >> >>> >> >> >
>> >> >>> >> >> > Are you saying there is way to use my
custom input format
>> with
>> >> >>> Crunch?
>> >> >>> >> >> >
>> >> >>> >> >> >
>> >> >>> >> >> >
>> >> >>> >> >> > On Tue, Feb 12, 2013 at 5:06 PM, Josh
Wills <
>> >> josh.wills@gmail.com>
>> >> >>> >> >> wrote:
>> >> >>> >> >> >
>> >> >>> >> >> > > Depends on the size of the files--
if there are a bunch of
>> >> tiny
>> >> >>> >> ones,
>> >> >>> >> >> it
>> >> >>> >> >> > > can be worthwhile to have a CombineFileInputFormat,
ala
>> >> >>> >> >> > >
>> >> >>> >> >> > >
>> >> >>> >>
>> >> http://yaseminavcular.blogspot.com/2011/03/many-small-input-files.html
>> >> >>> >> >> > >
>> >> >>> >> >> > > J
>> >> >>> >> >> > >
>> >> >>> >> >> > >
>> >> >>> >> >> > > On Tue, Feb 12, 2013 at 1:56 PM,
Victor Iacoban <
>> >> >>> >> >> > victor.iacoban@gmail.com
>> >> >>> >> >> > > >wrote:
>> >> >>> >> >> > >
>> >> >>> >> >> > > > Thanks Josh,
>> >> >>> >> >> > > > Is there any performance penalty
in unions, assuming
>> that I
>> >> >>> have
>> >> >>> >> >> > several
>> >> >>> >> >> > > > hundreds of input files?
>> >> >>> >> >> > > >
>> >> >>> >> >> > > >
>> >> >>> >> >> > > > On Tue, Feb 12, 2013 at 4:39
PM, Josh Wills <
>> >> >>> josh.wills@gmail.com
>> >> >>> >> >
>> >> >>> >> >> > > wrote:
>> >> >>> >> >> > > >
>> >> >>> >> >> > > > > Yeah, of course-- that's
how stuff like joins work.
>> >> >>> >> >> > > > >
>> >> >>> >> >> > > > > PTable<K, V> first
= pipeline.read(new TableSource<K,
>> >> >>> >> >> V>(firstFile));
>> >> >>> >> >> > > > > PTable<K, V> second
= ...;
>> >> >>> >> >> > > > > PTable<K, V> union
= first.union(second);
>> >> >>> >> >> > > > >
>> >> >>> >> >> > > > > etc.
>> >> >>> >> >> > > > >
>> >> >>> >> >> > > > >
>> >> >>> >> >> > > > > On Tue, Feb 12, 2013
at 1:36 PM, Victor Iacoban <
>> >> >>> >> >> > > > victor.iacoban@gmail.com
>> >> >>> >> >> > > > > >wrote:
>> >> >>> >> >> > > > >
>> >> >>> >> >> > > > > > Is there any support
in crunch to use multiple
>> sequence
>> >> >>> files
>> >> >>> >> as
>> >> >>> >> >> > > > pipeline
>> >> >>> >> >> > > > > > source?
>> >> >>> >> >> > > > > > something similar
to standard MultipleInputs
>> >> >>> >> >> > > > > >
>> >> >>> >> >> > > > > > Thanks,
>> >> >>> >> >> > > > > > victor
>> >> >>> >> >> > > > > >
>> >> >>> >> >> > > > >
>> >> >>> >> >> > > >
>> >> >>> >> >> > >
>> >> >>> >> >> >
>> >> >>> >> >>
>> >> >>> >>
>> >> >>>
>> >>
>> >
>> >
>> >
>> > --
>> > Director of Data Science
>> > Cloudera <http://www.cloudera.com>
>> > Twitter: @josh_wills <http://twitter.com/josh_wills>
>>

Mime
View raw message