spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Rosenstrauch <>
Subject Spark misconfigured? Small input split sizes in shark query
Date Tue, 15 Jul 2014 21:58:39 GMT
Got a spark/shark cluster up and running recently, and have been kicking 
the tires on it.  However, been wrestling with an issue on it that I'm 
not quite sure how to solve.  (Or, at least, not quite sure about the 
correct way to solve it.)

I ran a simple Hive query (select count ...) against a dataset of .tsv 
files stored in S3, and then ran the same query on shark for comparison. 
  But the shark query took 3x as long.

After a bit of digging, I was able to find out what was happening: 
apparently with the hive query each map task was reading an input split 
consisting of 2 entire files from the dataset (approximately 180MB 
each), while with shark each task was reading an input split consisting 
of a 64MB chunk from one of the files.  This made sense:  since the 
shark query had to open each S3 file 3 separate times (and had to run 3x 
as many tasks) it made sense that it took much longer.

After much experimentation I was finally able to work around this issue 
by overriding the value of mapreduce.input.fileinputformat.split.minsize 
in my hive-site.xml file.  (Bumping it up to 512MB.)  However, I'm 
feeling like this isn't really the "right" way to solve the issue:

a) That parm is normally set to 1.  It doesn't seem right that I should 
need to override it - or set it to a value as large as 512MB.

b) We only seem to experience this issue on an existing Hadoop cluster 
that we've deployed spark/shark onto.  When we run the same query on a 
new cluster launched via the spark ec2 scripts, the number of splits 
seems to get calculated correctly - without the need for overriding that 
parm.  This leads me to believe we may just have something misconfigured 
on our existing cluster.

c) This seems like an error prone way to overcome this issue.  512MB is 
an arbitrary value, and should I happen to be running a query against 
files that are larger than 512MB, I'll again run into the chunking issue.

So my gut tells me there's a better way to solve this issue - i.e., 
somehow configuring spark so that the input splits it generates won't 
chunk the input files.  Anyone know how to accomplish this / what I 
might have misconfigured?



View raw message