Mailing-List: contact crunch-dev-help@incubator.apache.org; run by ezmlm
Precedence: bulk
Reply-To: crunch-dev@incubator.apache.org
Received-SPF: pass (athena.apache.org: domain of jwills@cloudera.com
 designates 209.85.128.173 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAFZSZPtZjnYqLctM+u-bwfyiQaox9OqwPW73FZA9wb-1G6AL_Q@mail.gmail.com>
References: 
 <CAGqYk+UzoOgAG96q5dHwjJK=v5SWLSHcXRVxZkjEKmO1wtPunQ@mail.gmail.com>
 <CANb5z2JLVUpnSf14402Bq_BB9c=4HagnHwG4h=FzjPQgqJO2xw@mail.gmail.com>
 <CAGqYk+UEN6nAYAkfrK0d2tNWu8jqwSCP6071TGUNa0w67z+yzA@mail.gmail.com>
 <CANb5z2K52zmT3v5mDajgoHrPNtC53uAhc6iJe9-eY8d4Z=O_PA@mail.gmail.com>
 <CAGqYk+VPzg+unV+fue2Ono3T-zxtX=iRo9dL2NxbMe7sXZtpmw@mail.gmail.com>
 <CANb5z2Kwyb7iSC_5T11=6t081mmHb2ax_pp8zPMZGpjUJUud8Q@mail.gmail.com>
 <CAGqYk+VPn_TeSsJBSz1FmTGi1Os=BzzWoWVTFk7LARjnTEbu4A@mail.gmail.com>
 <CAFZSZPvspyvhQ5niBk3y=PzrHz8_Ai_TG1JPEv-itC2XGjG9Zw@mail.gmail.com>
 <CAGqYk+UTR=FZtxwOKs7GCNuyMvpCzX2soiN=24eUDTbBjLcuCg@mail.gmail.com>
 <CAFZSZPurQnO2G6SdvtYQUb+SvGPRXziZr3M-XrqKrKJFMSYckQ@mail.gmail.com>
 <CAGqYk+UpzPYstxRY9659F-H-T8S6MkTC+2ZrHNPpWZ==aGnmaQ@mail.gmail.com>
 <CAFZSZPtd1Ky9TFkFEMe-041LTwPPTd6WT1EYB7s08MMYYWyNtg@mail.gmail.com>
 <CAFZSZPtZjnYqLctM+u-bwfyiQaox9OqwPW73FZA9wb-1G6AL_Q@mail.gmail.com>
From: Josh Wills <jwills@cloudera.com>
Date: Wed, 13 Feb 2013 08:18:32 -0800
Message-ID: 
 <CAH29n6MM5QSGzoUE7kvBreG6R8dMVk1xK+35yF_P-ALY=21gGg@mail.gmail.com>
Subject: Re: multiple input files as pipeline source?
To: crunch <crunch-dev@incubator.apache.org>
Content-Type: multipart/alternative; boundary=20cf307c9bba7ab6c804d59d7f70

--20cf307c9bba7ab6c804d59d7f70
Content-Type: text/plain; charset=ISO-8859-1

Yep, I would love that.


On Wed, Feb 13, 2013 at 7:30 AM, Dave Beech <dave@paraliatech.com> wrote:

> Actually, while we're on the subject of small files and
> CombineFileInputFormat...
>
> I believe Hive has a feature whereby CombineFileInputFormat is used
> internally if it's required to read many small files to make the
> resulting mapreduce jobs more efficient. Would it be worth looking
> into whether Crunch could support this, too?
>
>
> On 13 February 2013 15:27, Dave Beech <dave@paraliatech.com> wrote:
> > thanks!
> >
> > On 13 February 2013 15:22, Victor Iacoban <victor.iacoban@gmail.com>
> wrote:
> >> https://gist.github.com/viacoban/4945325
> >>
> >>
> >> On Wed, Feb 13, 2013 at 9:59 AM, Dave Beech <dave@paraliatech.com>
> wrote:
> >>
> >>> A gist would be great - thanks very much
> >>>
> >>> Dave
> >>>
> >>> On 13 February 2013 14:52, Victor Iacoban <victor.iacoban@gmail.com>
> >>> wrote:
> >>> > Dave,
> >>> >
> >>> > How do you want this, copy pasted code into a gist or a reusable jar?
> >>> >
> >>> > --victor
> >>> >
> >>> >
> >>> > On Wed, Feb 13, 2013 at 3:59 AM, Dave Beech <dave@paraliatech.com>
> >>> wrote:
> >>> >
> >>> >> Hi Victor,
> >>> >> Any chance you could share your implementation of a Source that
> reads
> >>> >> from multiple paths? I've wanted this for a while but haven't found
> >>> >> time to go ahead and write one myself!
> >>> >> Thanks,
> >>> >> Dave
> >>> >>
> >>> >> On 12 February 2013 23:07, Victor Iacoban <victor.iacoban@gmail.com
> >
> >>> >> wrote:
> >>> >> > Thanks J
> >>> >> >
> >>> >> > I could not extend the FileSourceImpl since it works with only one
> >>> input
> >>> >> > path,
> >>> >> > but I implemented the Source interface directly and it appears to
> do
> >>> the
> >>> >> > job, thx for the pointer
> >>> >> >
> >>> >> > -- victor
> >>> >> >
> >>> >> >
> >>> >> >
> >>> >> > On Tue, Feb 12, 2013 at 5:20 PM, Josh Wills <josh.wills@gmail.com
> >
> >>> >> wrote:
> >>> >> >
> >>> >> >> Yep-- check out the formattedFile function in o.a.c.io.From. You
> can
> >>> >> also
> >>> >> >> write a custom extension of o.a.c.io.impl.FileSourceImpl if it's
> one
> >>> >> you're
> >>> >> >> going to be using a lot, or if there is custom configuration
> >>> information
> >>> >> >> required to use the InputFormat.
> >>> >> >>
> >>> >> >> J
> >>> >> >>
> >>> >> >>
> >>> >> >> On Tue, Feb 12, 2013 at 2:13 PM, Victor Iacoban <
> >>> >> victor.iacoban@gmail.com
> >>> >> >> >wrote:
> >>> >> >>
> >>> >> >> > That's exactly what I have in the code not using Crunch API:
> >>> >> >> > public class MultiSequenceFileInputFormat<K, V> extends
> >>> >> >> > CombineFileInputFormat<K, V> {
> >>> >> >> > ...
> >>> >> >> > }
> >>> >> >> >
> >>> >> >> > Are you saying there is way to use my custom input format with
> >>> Crunch?
> >>> >> >> >
> >>> >> >> >
> >>> >> >> >
> >>> >> >> > On Tue, Feb 12, 2013 at 5:06 PM, Josh Wills <
> josh.wills@gmail.com>
> >>> >> >> wrote:
> >>> >> >> >
> >>> >> >> > > Depends on the size of the files-- if there are a bunch of
> tiny
> >>> >> ones,
> >>> >> >> it
> >>> >> >> > > can be worthwhile to have a CombineFileInputFormat, ala
> >>> >> >> > >
> >>> >> >> > >
> >>> >>
> http://yaseminavcular.blogspot.com/2011/03/many-small-input-files.html
> >>> >> >> > >
> >>> >> >> > > J
> >>> >> >> > >
> >>> >> >> > >
> >>> >> >> > > On Tue, Feb 12, 2013 at 1:56 PM, Victor Iacoban <
> >>> >> >> > victor.iacoban@gmail.com
> >>> >> >> > > >wrote:
> >>> >> >> > >
> >>> >> >> > > > Thanks Josh,
> >>> >> >> > > > Is there any performance penalty in unions, assuming that I
> >>> have
> >>> >> >> > several
> >>> >> >> > > > hundreds of input files?
> >>> >> >> > > >
> >>> >> >> > > >
> >>> >> >> > > > On Tue, Feb 12, 2013 at 4:39 PM, Josh Wills <
> >>> josh.wills@gmail.com
> >>> >> >
> >>> >> >> > > wrote:
> >>> >> >> > > >
> >>> >> >> > > > > Yeah, of course-- that's how stuff like joins work.
> >>> >> >> > > > >
> >>> >> >> > > > > PTable<K, V> first = pipeline.read(new TableSource<K,
> >>> >> >> V>(firstFile));
> >>> >> >> > > > > PTable<K, V> second = ...;
> >>> >> >> > > > > PTable<K, V> union = first.union(second);
> >>> >> >> > > > >
> >>> >> >> > > > > etc.
> >>> >> >> > > > >
> >>> >> >> > > > >
> >>> >> >> > > > > On Tue, Feb 12, 2013 at 1:36 PM, Victor Iacoban <
> >>> >> >> > > > victor.iacoban@gmail.com
> >>> >> >> > > > > >wrote:
> >>> >> >> > > > >
> >>> >> >> > > > > > Is there any support in crunch to use multiple sequence
> >>> files
> >>> >> as
> >>> >> >> > > > pipeline
> >>> >> >> > > > > > source?
> >>> >> >> > > > > > something similar to standard MultipleInputs
> >>> >> >> > > > > >
> >>> >> >> > > > > > Thanks,
> >>> >> >> > > > > > victor
> >>> >> >> > > > > >
> >>> >> >> > > > >
> >>> >> >> > > >
> >>> >> >> > >
> >>> >> >> >
> >>> >> >>
> >>> >>
> >>>
>


-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>

--20cf307c9bba7ab6c804d59d7f70--