spark-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From GitBox <...@apache.org>
Subject [GitHub] [spark] steveloughran commented on issue #25899: [SPARK-29089][SQL] Parallelize blocking FileSystem calls in DataSource#checkAndGlobPathIfNecessary
Date Fri, 27 Sep 2019 11:38:42 GMT
steveloughran commented on issue #25899: [SPARK-29089][SQL] Parallelize blocking FileSystem
calls in DataSource#checkAndGlobPathIfNecessary
URL: https://github.com/apache/spark/pull/25899#issuecomment-535903963
 
 
   >  it seems like the sweet spot is somewhere between 20-30 threads (for my environment
anyways, 2015 macbook pro, i7/w 8 cores).
   
   interesting. You may get different numbers running in EC2; it's always best to benchmark
perf there. Remote dev amplifies some performance issues (cost of reopening an http connection,
general latency) while hiding others (how easy it is for spark jobs to overload s3 shards
and so get throttled, cause delays, trigger speculative task execution, more throttling, etc,
etc)
   
   Try changing "fs.s3a.connection.maximum" from the default of 48 to something bigger. That's
the limit on the http pool size. It's a small number to stop a single s3a instance from overloading
the system, but you may want to consider. There's also "fs.s3a.max.total.tasks" which controls
the thread pool size used for background writing of blocks of large files; in hadoop trunk
parallel delete/rename operations, plus stuff in the AWS SDK itself.
   
   * "fs.s3a.connection.maximum" should be > than "fs.s3a.max.total.tasks"
   * "fs.s3a.threads.keepalivetime" from 60 to 300 to keep those connections around for longer
(avoids that https overhead)
   
   Try with some bigger numbers and see if you get the same results. Your scanning threads
may just be blocking on the http connection pool
   
   for bonus fun force random IO for ORC/parquet perf, but with remote reads, set the min
block to read to be 256K or bigger
   
   ```
   spark.hadoop.fs.s3a.readahead.range 256K
   spark.hadoop.fs.s3a.input.fadvise random
   ```
   
   note: Java 8's SSL default encryption is underperformant. We've been doing work there but
it's too early to think about backporting it. I'm planning to do a refresh of the s3a connector
for hadoop 3.2.2 which should include it (https://github.com/apache/hadoop/pull/970)
   For now: look at [stack overflow](https://stackoverflow.com/questions/25992131/slow-aes-gcm-encryption-and-decryption-with-java-8u20)

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Mime
View raw message