Mailing-List: contact dev-help@spark.apache.org; run by ezmlm
Precedence: bulk
Received-SPF: pass (nike.apache.org: domain of irashid@cloudera.com designates
 209.85.212.169 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CALD+6GOVmPDBS-+MSNrjAwEco=dbMVAUTEpCk9-D5UxQD32-XQ@mail.gmail.com>
References: 
 <CANx3uAi6mHQm8ZSM+gQUTocXmKS0T-V5Thgx8PsfN9Jg-awCuA@mail.gmail.com>
 <CANx3uAiQs4onHehxd_DRVx7Ouh2zRdy2PS3hxnDKQ_LqCoaJnA@mail.gmail.com>
 <CA+3qhFR5MY12j-KL75=HK-taEv+XYC_F6Wm6ntn_FY=AcO3JXQ@mail.gmail.com>
 <CALD+6GOVmPDBS-+MSNrjAwEco=dbMVAUTEpCk9-D5UxQD32-XQ@mail.gmail.com>
From: Imran Rashid <irashid@cloudera.com>
Date: Wed, 25 Mar 2015 15:42:32 -0500
Message-ID: 
 <CA+3qhFQUdMb+AWtSmNWTzMQJ_mdr3U8fCJ5mciCUpEPGahqirQ@mail.gmail.com>
Subject: Re: hadoop input/output format advanced control
To: Nick Pentreath <nick.pentreath@gmail.com>
Cc: "dev@spark.apache.org" <dev@spark.apache.org>
Content-Type: multipart/alternative; boundary=047d7b5d65926ea564051222f145

--047d7b5d65926ea564051222f145
Content-Type: text/plain; charset=UTF-8

Hi Nick,

I don't remember the exact details of these scenarios, but I think the user
wanted a lot more control over how the files got grouped into partitions,
to group the files together by some arbitrary function.  I didn't think
that was possible w/ CombineFileInputFormat, but maybe there is a way?

thanks

On Tue, Mar 24, 2015 at 1:50 PM, Nick Pentreath <nick.pentreath@gmail.com>
wrote:

> Imran, on your point to read multiple files together in a partition, is it
> not simpler to use the approach of copy Hadoop conf and set per-RDD
> settings for min split to control the input size per partition, together
> with something like CombineFileInputFormat?
>
> On Tue, Mar 24, 2015 at 5:28 PM, Imran Rashid <irashid@cloudera.com>
> wrote:
>
> > I think this would be a great addition, I totally agree that you need to
> be
> > able to set these at a finer context than just the SparkContext.
> >
> > Just to play devil's advocate, though -- the alternative is for you just
> > subclass HadoopRDD yourself, or make a totally new RDD, and then you
> could
> > expose whatever you need.  Why is this solution better?  IMO the criteria
> > are:
> > (a) common operations
> > (b) error-prone / difficult to implement
> > (c) non-obvious, but important for performance
> >
> > I think this case fits (a) & (c), so I think its still worthwhile.  But
> its
> > also worth asking whether or not its too difficult for a user to extend
> > HadoopRDD right now.  There have been several cases in the past week
> where
> > we've suggested that a user should read from hdfs themselves (eg., to
> read
> > multiple files together in one partition) -- with*out* reusing the code
> in
> > HadoopRDD, though they would lose things like the metric tracking &
> > preferred locations you get from HadoopRDD.  Does HadoopRDD need to some
> > refactoring to make that easier to do?  Or do we just need a good
> example?
> >
> > Imran
> >
> > (sorry for hijacking your thread, Koert)
> >
> >
> >
> > On Mon, Mar 23, 2015 at 3:52 PM, Koert Kuipers <koert@tresata.com>
> wrote:
> >
> > > see email below. reynold suggested i send it to dev instead of user
> > >
> > > ---------- Forwarded message ----------
> > > From: Koert Kuipers <koert@tresata.com>
> > > Date: Mon, Mar 23, 2015 at 4:36 PM
> > > Subject: hadoop input/output format advanced control
> > > To: "user@spark.apache.org" <user@spark.apache.org>
> > >
> > >
> > > currently its pretty hard to control the Hadoop Input/Output formats
> used
> > > in Spark. The conventions seems to be to add extra parameters to all
> > > methods and then somewhere deep inside the code (for example in
> > > PairRDDFunctions.saveAsHadoopFile) all these parameters get translated
> > into
> > > settings on the Hadoop Configuration object.
> > >
> > > for example for compression i see "codec: Option[Class[_ <:
> > > CompressionCodec]] = None" added to a bunch of methods.
> > >
> > > how scalable is this solution really?
> > >
> > > for example i need to read from a hadoop dataset and i dont want the
> > input
> > > (part) files to get split up. the way to do this is to set
> > > "mapred.min.split.size". now i dont want to set this at the level of
> the
> > > SparkContext (which can be done), since i dont want it to apply to
> input
> > > formats in general. i want it to apply to just this one specific input
> > > dataset i need to read. which leaves me with no options currently. i
> > could
> > > go add yet another input parameter to all the methods
> > > (SparkContext.textFile, SparkContext.hadoopFile,
> SparkContext.objectFile,
> > > etc.). but that seems ineffective.
> > >
> > > why can we not expose a Map[String, String] or some other generic way
> to
> > > manipulate settings for hadoop input/output formats? it would require
> > > adding one more parameter to all methods to deal with hadoop
> input/output
> > > formats, but after that its done. one parameter to rule them all....
> > >
> > > then i could do:
> > > val x = sc.textFile("/some/path", formatSettings =
> > > Map("mapred.min.split.size" -> "12345"))
> > >
> > > or
> > > rdd.saveAsTextFile("/some/path, formatSettings =
> > > Map(mapred.output.compress" -> "true",
> "mapred.output.compression.codec"
> > ->
> > > "somecodec"))
> > >
> >
>

--047d7b5d65926ea564051222f145--