crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Wills <josh.wi...@gmail.com>
Subject Re: multiple input files as pipeline source?
Date Wed, 13 Feb 2013 16:52:24 GMT
Ah, I see it, thanks! Now for more coffee.


On Wed, Feb 13, 2013 at 8:49 AM, Dave Beech <dave@paraliatech.com> wrote:

> I haven't tried the code yet but I think it looks correct.
> MultiSequenceFileRecordReader will get created via reflection and
> needs the (CombineFileSplit split, TaskAttemptContext context, Integer
> index) sig as its constructor.
>
> On 13 February 2013 16:40, Josh Wills <josh.wills@gmail.com> wrote:
> > Ha! Quite possibly. Let's JIRA it up.
> >
> > Victor, I haven't had much coffee yet, but it looks like there is a bug
> in
> > the gist-- the MultiSequenceFileInputFormat refers to a new
> > CombineFileRecordReader, which has a different constructor signature from
> > the MultiSequenceFileRecordReader in the patch. What did I miss?
> >
> > J
> >
> >
> > On Wed, Feb 13, 2013 at 8:34 AM, Dave Beech <dave@paraliatech.com>
> wrote:
> >
> >> Love it enough to write it for us? ;) I'll stick it in JIRA just in
> >> case. Or if not, maybe one day I'll have a free couple of hours and
> >> feel like doing it myself!
> >>
> >> Cheers,
> >> Dave
> >>
> >> On 13 February 2013 16:18, Josh Wills <jwills@cloudera.com> wrote:
> >> > Yep, I would love that.
> >> >
> >> >
> >> > On Wed, Feb 13, 2013 at 7:30 AM, Dave Beech <dave@paraliatech.com>
> >> wrote:
> >> >
> >> >> Actually, while we're on the subject of small files and
> >> >> CombineFileInputFormat...
> >> >>
> >> >> I believe Hive has a feature whereby CombineFileInputFormat is used
> >> >> internally if it's required to read many small files to make the
> >> >> resulting mapreduce jobs more efficient. Would it be worth looking
> >> >> into whether Crunch could support this, too?
> >> >>
> >> >>
> >> >> On 13 February 2013 15:27, Dave Beech <dave@paraliatech.com>
wrote:
> >> >> > thanks!
> >> >> >
> >> >> > On 13 February 2013 15:22, Victor Iacoban <
> victor.iacoban@gmail.com>
> >> >> wrote:
> >> >> >> https://gist.github.com/viacoban/4945325
> >> >> >>
> >> >> >>
> >> >> >> On Wed, Feb 13, 2013 at 9:59 AM, Dave Beech <dave@paraliatech.com
> >
> >> >> wrote:
> >> >> >>
> >> >> >>> A gist would be great - thanks very much
> >> >> >>>
> >> >> >>> Dave
> >> >> >>>
> >> >> >>> On 13 February 2013 14:52, Victor Iacoban <
> victor.iacoban@gmail.com
> >> >
> >> >> >>> wrote:
> >> >> >>> > Dave,
> >> >> >>> >
> >> >> >>> > How do you want this, copy pasted code into a gist
or a
> reusable
> >> jar?
> >> >> >>> >
> >> >> >>> > --victor
> >> >> >>> >
> >> >> >>> >
> >> >> >>> > On Wed, Feb 13, 2013 at 3:59 AM, Dave Beech <
> dave@paraliatech.com
> >> >
> >> >> >>> wrote:
> >> >> >>> >
> >> >> >>> >> Hi Victor,
> >> >> >>> >> Any chance you could share your implementation
of a Source
> that
> >> >> reads
> >> >> >>> >> from multiple paths? I've wanted this for a while
but haven't
> >> found
> >> >> >>> >> time to go ahead and write one myself!
> >> >> >>> >> Thanks,
> >> >> >>> >> Dave
> >> >> >>> >>
> >> >> >>> >> On 12 February 2013 23:07, Victor Iacoban <
> >> victor.iacoban@gmail.com
> >> >> >
> >> >> >>> >> wrote:
> >> >> >>> >> > Thanks J
> >> >> >>> >> >
> >> >> >>> >> > I could not extend the FileSourceImpl since
it works with
> only
> >> one
> >> >> >>> input
> >> >> >>> >> > path,
> >> >> >>> >> > but I implemented the Source interface directly
and it
> appears
> >> to
> >> >> do
> >> >> >>> the
> >> >> >>> >> > job, thx for the pointer
> >> >> >>> >> >
> >> >> >>> >> > -- victor
> >> >> >>> >> >
> >> >> >>> >> >
> >> >> >>> >> >
> >> >> >>> >> > On Tue, Feb 12, 2013 at 5:20 PM, Josh Wills
<
> >> josh.wills@gmail.com
> >> >> >
> >> >> >>> >> wrote:
> >> >> >>> >> >
> >> >> >>> >> >> Yep-- check out the formattedFile function
in
> o.a.c.io.From.
> >> You
> >> >> can
> >> >> >>> >> also
> >> >> >>> >> >> write a custom extension of o.a.c.io.impl.FileSourceImpl
if
> >> it's
> >> >> one
> >> >> >>> >> you're
> >> >> >>> >> >> going to be using a lot, or if there
is custom
> configuration
> >> >> >>> information
> >> >> >>> >> >> required to use the InputFormat.
> >> >> >>> >> >>
> >> >> >>> >> >> J
> >> >> >>> >> >>
> >> >> >>> >> >>
> >> >> >>> >> >> On Tue, Feb 12, 2013 at 2:13 PM, Victor
Iacoban <
> >> >> >>> >> victor.iacoban@gmail.com
> >> >> >>> >> >> >wrote:
> >> >> >>> >> >>
> >> >> >>> >> >> > That's exactly what I have in the
code not using Crunch
> API:
> >> >> >>> >> >> > public class MultiSequenceFileInputFormat<K,
V> extends
> >> >> >>> >> >> > CombineFileInputFormat<K, V>
{
> >> >> >>> >> >> > ...
> >> >> >>> >> >> > }
> >> >> >>> >> >> >
> >> >> >>> >> >> > Are you saying there is way to
use my custom input format
> >> with
> >> >> >>> Crunch?
> >> >> >>> >> >> >
> >> >> >>> >> >> >
> >> >> >>> >> >> >
> >> >> >>> >> >> > On Tue, Feb 12, 2013 at 5:06 PM,
Josh Wills <
> >> >> josh.wills@gmail.com>
> >> >> >>> >> >> wrote:
> >> >> >>> >> >> >
> >> >> >>> >> >> > > Depends on the size of the
files-- if there are a
> bunch of
> >> >> tiny
> >> >> >>> >> ones,
> >> >> >>> >> >> it
> >> >> >>> >> >> > > can be worthwhile to have
a CombineFileInputFormat, ala
> >> >> >>> >> >> > >
> >> >> >>> >> >> > >
> >> >> >>> >>
> >> >>
> http://yaseminavcular.blogspot.com/2011/03/many-small-input-files.html
> >> >> >>> >> >> > >
> >> >> >>> >> >> > > J
> >> >> >>> >> >> > >
> >> >> >>> >> >> > >
> >> >> >>> >> >> > > On Tue, Feb 12, 2013 at 1:56
PM, Victor Iacoban <
> >> >> >>> >> >> > victor.iacoban@gmail.com
> >> >> >>> >> >> > > >wrote:
> >> >> >>> >> >> > >
> >> >> >>> >> >> > > > Thanks Josh,
> >> >> >>> >> >> > > > Is there any performance
penalty in unions, assuming
> >> that I
> >> >> >>> have
> >> >> >>> >> >> > several
> >> >> >>> >> >> > > > hundreds of input files?
> >> >> >>> >> >> > > >
> >> >> >>> >> >> > > >
> >> >> >>> >> >> > > > On Tue, Feb 12, 2013
at 4:39 PM, Josh Wills <
> >> >> >>> josh.wills@gmail.com
> >> >> >>> >> >
> >> >> >>> >> >> > > wrote:
> >> >> >>> >> >> > > >
> >> >> >>> >> >> > > > > Yeah, of course--
that's how stuff like joins work.
> >> >> >>> >> >> > > > >
> >> >> >>> >> >> > > > > PTable<K, V>
first = pipeline.read(new
> TableSource<K,
> >> >> >>> >> >> V>(firstFile));
> >> >> >>> >> >> > > > > PTable<K, V>
second = ...;
> >> >> >>> >> >> > > > > PTable<K, V>
union = first.union(second);
> >> >> >>> >> >> > > > >
> >> >> >>> >> >> > > > > etc.
> >> >> >>> >> >> > > > >
> >> >> >>> >> >> > > > >
> >> >> >>> >> >> > > > > On Tue, Feb 12,
2013 at 1:36 PM, Victor Iacoban <
> >> >> >>> >> >> > > > victor.iacoban@gmail.com
> >> >> >>> >> >> > > > > >wrote:
> >> >> >>> >> >> > > > >
> >> >> >>> >> >> > > > > > Is there any
support in crunch to use multiple
> >> sequence
> >> >> >>> files
> >> >> >>> >> as
> >> >> >>> >> >> > > > pipeline
> >> >> >>> >> >> > > > > > source?
> >> >> >>> >> >> > > > > > something similar
to standard MultipleInputs
> >> >> >>> >> >> > > > > >
> >> >> >>> >> >> > > > > > Thanks,
> >> >> >>> >> >> > > > > > victor
> >> >> >>> >> >> > > > > >
> >> >> >>> >> >> > > > >
> >> >> >>> >> >> > > >
> >> >> >>> >> >> > >
> >> >> >>> >> >> >
> >> >> >>> >> >>
> >> >> >>> >>
> >> >> >>>
> >> >>
> >> >
> >> >
> >> >
> >> > --
> >> > Director of Data Science
> >> > Cloudera <http://www.cloudera.com>
> >> > Twitter: @josh_wills <http://twitter.com/josh_wills>
> >>
> >>
> >> On 13 February 2013 16:18, Josh Wills <jwills@cloudera.com> wrote:
> >> > Yep, I would love that.
> >> >
> >> >
> >> > On Wed, Feb 13, 2013 at 7:30 AM, Dave Beech <dave@paraliatech.com>
> >> wrote:
> >> >
> >> >> Actually, while we're on the subject of small files and
> >> >> CombineFileInputFormat...
> >> >>
> >> >> I believe Hive has a feature whereby CombineFileInputFormat is used
> >> >> internally if it's required to read many small files to make the
> >> >> resulting mapreduce jobs more efficient. Would it be worth looking
> >> >> into whether Crunch could support this, too?
> >> >>
> >> >>
> >> >> On 13 February 2013 15:27, Dave Beech <dave@paraliatech.com>
wrote:
> >> >> > thanks!
> >> >> >
> >> >> > On 13 February 2013 15:22, Victor Iacoban <
> victor.iacoban@gmail.com>
> >> >> wrote:
> >> >> >> https://gist.github.com/viacoban/4945325
> >> >> >>
> >> >> >>
> >> >> >> On Wed, Feb 13, 2013 at 9:59 AM, Dave Beech <dave@paraliatech.com
> >
> >> >> wrote:
> >> >> >>
> >> >> >>> A gist would be great - thanks very much
> >> >> >>>
> >> >> >>> Dave
> >> >> >>>
> >> >> >>> On 13 February 2013 14:52, Victor Iacoban <
> victor.iacoban@gmail.com
> >> >
> >> >> >>> wrote:
> >> >> >>> > Dave,
> >> >> >>> >
> >> >> >>> > How do you want this, copy pasted code into a gist
or a
> reusable
> >> jar?
> >> >> >>> >
> >> >> >>> > --victor
> >> >> >>> >
> >> >> >>> >
> >> >> >>> > On Wed, Feb 13, 2013 at 3:59 AM, Dave Beech <
> dave@paraliatech.com
> >> >
> >> >> >>> wrote:
> >> >> >>> >
> >> >> >>> >> Hi Victor,
> >> >> >>> >> Any chance you could share your implementation
of a Source
> that
> >> >> reads
> >> >> >>> >> from multiple paths? I've wanted this for a while
but haven't
> >> found
> >> >> >>> >> time to go ahead and write one myself!
> >> >> >>> >> Thanks,
> >> >> >>> >> Dave
> >> >> >>> >>
> >> >> >>> >> On 12 February 2013 23:07, Victor Iacoban <
> >> victor.iacoban@gmail.com
> >> >> >
> >> >> >>> >> wrote:
> >> >> >>> >> > Thanks J
> >> >> >>> >> >
> >> >> >>> >> > I could not extend the FileSourceImpl since
it works with
> only
> >> one
> >> >> >>> input
> >> >> >>> >> > path,
> >> >> >>> >> > but I implemented the Source interface directly
and it
> appears
> >> to
> >> >> do
> >> >> >>> the
> >> >> >>> >> > job, thx for the pointer
> >> >> >>> >> >
> >> >> >>> >> > -- victor
> >> >> >>> >> >
> >> >> >>> >> >
> >> >> >>> >> >
> >> >> >>> >> > On Tue, Feb 12, 2013 at 5:20 PM, Josh Wills
<
> >> josh.wills@gmail.com
> >> >> >
> >> >> >>> >> wrote:
> >> >> >>> >> >
> >> >> >>> >> >> Yep-- check out the formattedFile function
in
> o.a.c.io.From.
> >> You
> >> >> can
> >> >> >>> >> also
> >> >> >>> >> >> write a custom extension of o.a.c.io.impl.FileSourceImpl
if
> >> it's
> >> >> one
> >> >> >>> >> you're
> >> >> >>> >> >> going to be using a lot, or if there
is custom
> configuration
> >> >> >>> information
> >> >> >>> >> >> required to use the InputFormat.
> >> >> >>> >> >>
> >> >> >>> >> >> J
> >> >> >>> >> >>
> >> >> >>> >> >>
> >> >> >>> >> >> On Tue, Feb 12, 2013 at 2:13 PM, Victor
Iacoban <
> >> >> >>> >> victor.iacoban@gmail.com
> >> >> >>> >> >> >wrote:
> >> >> >>> >> >>
> >> >> >>> >> >> > That's exactly what I have in the
code not using Crunch
> API:
> >> >> >>> >> >> > public class MultiSequenceFileInputFormat<K,
V> extends
> >> >> >>> >> >> > CombineFileInputFormat<K, V>
{
> >> >> >>> >> >> > ...
> >> >> >>> >> >> > }
> >> >> >>> >> >> >
> >> >> >>> >> >> > Are you saying there is way to
use my custom input format
> >> with
> >> >> >>> Crunch?
> >> >> >>> >> >> >
> >> >> >>> >> >> >
> >> >> >>> >> >> >
> >> >> >>> >> >> > On Tue, Feb 12, 2013 at 5:06 PM,
Josh Wills <
> >> >> josh.wills@gmail.com>
> >> >> >>> >> >> wrote:
> >> >> >>> >> >> >
> >> >> >>> >> >> > > Depends on the size of the
files-- if there are a
> bunch of
> >> >> tiny
> >> >> >>> >> ones,
> >> >> >>> >> >> it
> >> >> >>> >> >> > > can be worthwhile to have
a CombineFileInputFormat, ala
> >> >> >>> >> >> > >
> >> >> >>> >> >> > >
> >> >> >>> >>
> >> >>
> http://yaseminavcular.blogspot.com/2011/03/many-small-input-files.html
> >> >> >>> >> >> > >
> >> >> >>> >> >> > > J
> >> >> >>> >> >> > >
> >> >> >>> >> >> > >
> >> >> >>> >> >> > > On Tue, Feb 12, 2013 at 1:56
PM, Victor Iacoban <
> >> >> >>> >> >> > victor.iacoban@gmail.com
> >> >> >>> >> >> > > >wrote:
> >> >> >>> >> >> > >
> >> >> >>> >> >> > > > Thanks Josh,
> >> >> >>> >> >> > > > Is there any performance
penalty in unions, assuming
> >> that I
> >> >> >>> have
> >> >> >>> >> >> > several
> >> >> >>> >> >> > > > hundreds of input files?
> >> >> >>> >> >> > > >
> >> >> >>> >> >> > > >
> >> >> >>> >> >> > > > On Tue, Feb 12, 2013
at 4:39 PM, Josh Wills <
> >> >> >>> josh.wills@gmail.com
> >> >> >>> >> >
> >> >> >>> >> >> > > wrote:
> >> >> >>> >> >> > > >
> >> >> >>> >> >> > > > > Yeah, of course--
that's how stuff like joins work.
> >> >> >>> >> >> > > > >
> >> >> >>> >> >> > > > > PTable<K, V>
first = pipeline.read(new
> TableSource<K,
> >> >> >>> >> >> V>(firstFile));
> >> >> >>> >> >> > > > > PTable<K, V>
second = ...;
> >> >> >>> >> >> > > > > PTable<K, V>
union = first.union(second);
> >> >> >>> >> >> > > > >
> >> >> >>> >> >> > > > > etc.
> >> >> >>> >> >> > > > >
> >> >> >>> >> >> > > > >
> >> >> >>> >> >> > > > > On Tue, Feb 12,
2013 at 1:36 PM, Victor Iacoban <
> >> >> >>> >> >> > > > victor.iacoban@gmail.com
> >> >> >>> >> >> > > > > >wrote:
> >> >> >>> >> >> > > > >
> >> >> >>> >> >> > > > > > Is there any
support in crunch to use multiple
> >> sequence
> >> >> >>> files
> >> >> >>> >> as
> >> >> >>> >> >> > > > pipeline
> >> >> >>> >> >> > > > > > source?
> >> >> >>> >> >> > > > > > something similar
to standard MultipleInputs
> >> >> >>> >> >> > > > > >
> >> >> >>> >> >> > > > > > Thanks,
> >> >> >>> >> >> > > > > > victor
> >> >> >>> >> >> > > > > >
> >> >> >>> >> >> > > > >
> >> >> >>> >> >> > > >
> >> >> >>> >> >> > >
> >> >> >>> >> >> >
> >> >> >>> >> >>
> >> >> >>> >>
> >> >> >>>
> >> >>
> >> >
> >> >
> >> >
> >> > --
> >> > Director of Data Science
> >> > Cloudera <http://www.cloudera.com>
> >> > Twitter: @josh_wills <http://twitter.com/josh_wills>
> >>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message