spark-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From GitBox <...@apache.org>
Subject [GitHub] [spark] cozos edited a comment on issue #25899: [SPARK-29089][SQL] Parallelize blocking FileSystem calls in DataSource#checkAndGlobPathIfNecessary
Date Tue, 24 Sep 2019 04:11:10 GMT
cozos edited a comment on issue #25899: [SPARK-29089][SQL] Parallelize blocking FileSystem
calls in DataSource#checkAndGlobPathIfNecessary
URL: https://github.com/apache/spark/pull/25899#issuecomment-534380034
 
 
   I ran additional measurements testing out different thread numbers on `ThreadUtils` on
the S3 Landsat data, and it seems like the sweet spot is somewhere between 20-30 seconds (for
my environment anyways)
   
   **30 glob paths paths* - 30 glob paths with the final result of 1206 files
   **single glob path* - 1 single glob path with the final result of 1206 files
   **raw paths* - 1206 raw paths without any globs
   
   see here: https://github.com/apache/spark/pull/25899#issuecomment-534069194
   
   **original code**
   30 glob paths paths _15.6 seconds_
   single glob path _11.3 seconds_
   raw paths 59 seconds_ 
   
   **8 threads**
   30 glob paths paths _1.48 seconds_
   single glob path _11 seconds_
   raw paths _7.73 seconds_
   
   **20 threads**
   30 glob paths paths _1.47 seconds_
   single glob path _15.45 seconds_
   raw paths _4.16 seconds_
   
   **30 threads**
   20 glob paths paths _0.92 seconds_
   single glob path _11.74 seconds_
   raw paths _4.12 seconds_
   
   **40 threads**
   30 glob paths paths _0.93 seconds_
   single glob path _13.48 seconds_
   raw paths _4.08 seconds_
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Mime
View raw message