spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From zhangxiongfei <zhangxiongfei0...@163.com>
Subject Why does the HDFS parquet file generated by Spark SQL have different size with those on Tachyon?
Date Fri, 17 Apr 2015 09:51:20 GMT
Hi,
I did some tests on Parquet Files with Spark SQL DataFrame API.
I generated 36 gzip compressed parquet files by Spark SQL and stored them on Tachyon,The size
of each file is about  222M.Then read them with below code.
val tfs =sqlContext.parquetFile("tachyon://datanode8.bitauto.dmp:19998/apps/tachyon/adClick");
Next,I just save this DataFrame onto HDFS with below code.It will generate 36 parquet files
too,but the size of each file is about 265M
tfs.repartition(36).saveAsParquetFile("/user/zhangxf/adClick-parquet-tachyon");
My question is Why the files on HDFS has different size with those on Tachyon even though
they come from the same original data?


Thanks
Zhang Xiongfei

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message