hudi-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From GitBox <...@apache.org>
Subject [GitHub] [incubator-hudi] tverdokhlebd opened a new issue #1443: [SUPPORT] Spark-Hudi consumes too much space in a temp folder while upsert
Date Wed, 25 Mar 2020 08:54:32 GMT
tverdokhlebd opened a new issue #1443: [SUPPORT] Spark-Hudi consumes too much space in a temp
folder while upsert
URL: https://github.com/apache/incubator-hudi/issues/1443
 
 
   We have a Vertica database which has around 25 million rows (~9GB) and S3 storage for them.
I try to move data from the Vertica to s3 with Hudi 0.5.1.
   
   First moving happens successfully and quickly (insert stage), and moved data takes around
3GB on S3 storage.
   
   Second moving happens unsuccessfully (upsert stage) and throws an error "No left space
on device".
   
   I connected external storage (200GB), set a spark temp folder to the mounted disk and restart
moving.
   But, unfortunately, Spark consumes too much external space and then the error "No Left
space on device" appears again. Also, the upsert stage takes too much time to process data.
   
   I tried to tune configuration (default parameters), but it does not help.
   
   Config:
   
   docker run --rm -v /var/lib/jenkins-slave/workspace/Transfer_ml_data_to_s3_hudi:/var/lib/jenkins-slave/workspace/Transfer_ml_data_to_s3_hudi
-v /mnt/ml_data:/mnt/ml_data bde2020/spark-master:2.4.5-hadoop2.7 
   bash ./spark/bin/spark-submit 
   --master 'local[2]' 
   --packages org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating,org.apache.hadoop:hadoop-aws:2.7.3,org.apache.spark:spark-avro_2.11:2.4.4

   --conf spark.local.dir=/mnt/ml_data 
   --conf spark.ui.enabled=false 
   --conf spark.task.maxFailures=1 
   --conf spark.driver.maxResultSize=2g 
   --conf spark.driver.memory=4g 
   --conf spark.rdd.compress=true 
   --conf spark.serializer=org.apache.spark.serializer.KryoSerializer 
   --conf spark.kryoserializer.buffer.max=512m 
   --conf spark.shuffle.service.enabled=true 
   --conf spark.sql.hive.convertMetastoreParquet=false 
   --conf spark.task.cpus=1 
   --conf spark.task.maxFailures=1 
   --conf spark.hadoop.fs.defaultFS=s3a://mtu-ml-bucket/ml_hudi 
   --conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem 
   --conf spark.hadoop.fs.s3a.access.key=**** 
   --conf spark.hadoop.fs.s3a.secret.key=**** 
   --conf spark.executorEnv.date=2020-03-15 
   --conf spark.executorEnv.numDays=1 
   --conf spark.executorEnv.jdbcDriver=com.vertica.jdbc.Driver 
   --conf 'spark.executorEnv.jdbcUrl=jdbc:vertica://mtubi-vertica-prod.gradium.info:5433'

   --conf spark.executorEnv.jdbcUser==******** 
   --conf spark.executorEnv.jdbcPassword=******** 
   --conf spark.executorEnv.schemaName=mtu_owner 
   --conf spark.executorEnv.tableName=ext_ml_data 
   --conf spark.executorEnv.dateColumnName=hit_date 
   --conf spark.executorEnv.partitionColumnName=hit_timestamp 
   --conf spark.executorEnv.partitionsCount=8 
   --conf spark.executorEnv.outputPath=s3a://mtu-ml-bucket/ml_hudi 
   --conf spark.executorEnv.hudiParallelism=1 
   --conf spark.executorEnv.hudiBulkInsertParallelism=1
   --class mtu.spark.analytics.MLDataToS3Job /var/lib/jenkins-slave/workspace/Transfer_ml_data_to_s3_hudi/ml-vertica-to-s3.jar
   
   Code:
   
   val dbTable = s"(select *, mod($partitionColumnName, $partitionsCount) as partition from
$schemaName.$tableName where $dateColumnName='$date') as t"
   spark
   .read
   .jdbc(
     url = jdbcUrl,
     table = dbTable,
     columnName = "partition",
     lowerBound = 0,
     upperBound = partitionsCount.toInt,
     numPartitions = partitionsCount.toInt,
     connectionProperties = buildConnectionProperties()
   )
   .write
   .format("org.apache.hudi")
   .option(HoodieWriteConfig.TABLE_NAME, tableName)
   .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, hudiRecordKey)
   .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, hudiPrecombineKey)
   .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, hudiPartitionPathKey)
   .option("hoodie.bulkinsert.shuffle.parallelism", hudiBulkInsertParallelism)
   .option("hoodie.insert.shuffle.parallelism", hudiParallelism)
   .option("hoodie.upsert.shuffle.parallelism", hudiParallelism)
   .mode(SaveMode.Append)
   .save(outputPath)

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

Mime
View raw message