Mailing-List: contact user-help@spark.apache.org; run by ezmlm
Precedence: bulk
Date: Fri, 20 Nov 2015 15:41:13 -0700 (MST)
From: nezih <nyigitbasi@netflix.com>
To: user@spark.apache.org
Message-ID: <1448059273903-25440.post@n3.nabble.com>
Subject: question about combining small input splits
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit

Hey everyone,
I have a Hive table that has a lot of small parquet files and I am creating
a data frame out of it to do some processing, but since I have a large
number of splits/files my job creates a lot of tasks, which I don't want.
Basically what I want is the same functionality that Hive provides, that is,
to combine these small input splits into larger ones by specifying a max
split size setting. Is this currently possible with Spark?

While exploring whether I can use coalesce I hit another issue. With
coalesce I can only control the number of output files not their sizes. And
since the total input dataset size can vary significantly in my case, I
cannot just use a fixed partition count as the size of each output can get
very large. I looked for getting the total input size from an rdd to come up
with some heuristic to set the partition count, but I couldn't find any ways
to do it. 

Any help is appreciated.

Thanks,

Nezih


--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/question-about-combining-small-input-splits-tp25440.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org