spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Loughran <ste...@hortonworks.com>
Subject Re: Leveraging S3 select
Date Wed, 13 Dec 2017 11:59:52 GMT


On 8 Dec 2017, at 17:05, Andrew Duffy <aduffy@palantir.com<mailto:aduffy@palantir.com>>
wrote:

Hey Steve,

Happen to have a link to the TPC-DS benchmark data w/random S3 reads? I've done a decent amount
of digging, but all I've found is a reference in a slide deck

Is that one of mine?

We haven't done any benchmarking with/without random IO for a while, as we've taken that as
a given and worrying about the other aspects of the problem: speeding up directory listings
& getFileStatus calls (used a lot in the serialized partitioning phase), and making direct
commits of work to S3 both correct and performant.

I'm trying to sort out some benchmarking there, which involves: cherry picking the new changes
to an internal release, building that, having someone who understands benchmarking set up
a cluster and run the tests, which involves their time and the cost of the clusters. I say
clusters as it'll inevitably involve playing with different VM options and some EMR clusters
alongside(*).

One bit of fun there becomes the fact that different instances of the same cluster specs may
give different numbers; it depends on actual CPUs allocated, network, neighbours. When we
do publish some numbers, we do it from a single cluster instance, rather than doing "best
per-test outcome on multiple clusters". Good to check if others do the same.

Otherwise: test with your own code & the Hadoop 2.8.1+ JARs; see what numbers you get.
If you are using Parquet or ORC, I would not consider using the sequential IO code. At the
same time, if you are working with CSV, Avro, gzip, you don't want to use it, because what
would be a single file GET with some forward skips of read & discard of data is now a
slow sequence of GETs with latency between each one.
HADOOP-14965<https://issues.apache.org/jira/browse/HADOOP-14965> (not yet committed)
changes the default policy of an input stream to "switch to random IO mode on the first backwards
seek", so you don't need to decide upfront that to use. There's the potential cost of the
first HTTPS abort on that initial backwards seek, but after, random IO all the way. the Wasb
client has been doing this for a while and everyone is happy, not least because its one less
tuning option to document & test, and eliminates a whole class of support calls "client
is fast to read .csv but not .orc files".

-Steve


(*) I have a hadoop branch-2.9 fork with the new committer stuff in if someone wants to compare
numbers there. Bear in mind that the current RDD.write(s3a://something) command, when it uses
the Hadoop FileOutputFormats and hence the FileOutputCommitter is not just observably a slow
O(data) kind of operation, it is *not correct*, so the performance is just a detail. It's
the one you notice, but not the issue to fear. Fixed by HADOOP-13786 & a bit of glue to
keep Spark happy

Mime
View raw message