spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From zhangxiongfei <zhangxiongfei0...@163.com>
Subject Re:Re: Spark SQL 1.3.1 "saveAsParquetFile" will output tachyon file with different block size
Date Wed, 15 Apr 2015 02:22:02 GMT

JIRA opened:https://issues.apache.org/jira/browse/SPARK-6921




At 2015-04-15 00:57:24, "Cheng Lian" <lian.cs.zju@gmail.com> wrote: >Would you mind
to open a JIRA for this? > >I think your suspicion makes sense. Will have a look at
this tomorrow. >Thanks for reporting! > >Cheng > >On 4/13/15 7:13 PM, zhangxiongfei
wrote: >> Hi experts >> I run below code in Spark Shell to access parquet files
in Tachyon. >> 1.First,created a DataFrame by loading a bunch of Parquet Files in Tachyon
>> val ta3 =sqlContext.parquetFile("tachyon://tachyonserver:19998/apps/tachyon/zhangxf/parquetAdClick-6p-256m");
>> 2.Second, set the "fs.local.block.size" to 256M to make sure that block size of output
files in Tachyon is 256M. >> sc.hadoopConfiguration.setLong("fs.local.block.size",268435456)
>> 3.Third,saved above DataFrame into Parquet files that is stored in Tachyon >>
ta3.saveAsParquetFile("tachyon://tachyonserver:19998/apps/tachyon/zhangxf/parquetAdClick-6p-256m-test");
>> After above code run successfully, the output parquet files were stored in Tachyon,but
these files have different block size,below is the information of those files in the path
"tachyon://tachyonserver:19998/apps/tachyon/zhangxf/parquetAdClick-6p-256m-test": >>
File Name Size Block Size In-Memory Pin Creation Time >> _SUCCESS 0.00 B 256.00 MB 100%
NO 04-13-2015 17:48:23:519 >> _common_metadata 1088.00 B 256.00 MB 100% NO 04-13-2015
17:48:23:741 >> _metadata 22.71 KB 256.00 MB 100% NO 04-13-2015 17:48:23:646 >>
part-r-00001.parquet 177.19 MB 32.00 MB 100% NO 04-13-2015 17:46:44:626 >> part-r-00002.parquet
177.21 MB 32.00 MB 100% NO 04-13-2015 17:46:44:636 >> part-r-00003.parquet 177.02 MB
32.00 MB 100% NO 04-13-2015 17:46:45:439 >> part-r-00004.parquet 177.21 MB 32.00 MB
100% NO 04-13-2015 17:46:44:845 >> part-r-00005.parquet 177.40 MB 32.00 MB 100% NO 04-13-2015
17:46:44:638 >> part-r-00006.parquet 177.33 MB 32.00 MB 100% NO 04-13-2015 17:46:44:648
>> >> It seems that the API saveAsParquetFile does not distribute/broadcast the
hadoopconfiguration to executors like the other API such as saveAsTextFile.The configutation
"fs.local.block.size" only take effects on Driver. >> If I set that configuration before
loading parquet files,the problem is gone. >> Could anyone help me verify this problem?
>> >> Thanks >> Zhang Xiongfei > > >---------------------------------------------------------------------
>To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org >For additional commands,
e-mail: dev-help@spark.apache.org 
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message