spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Cheng Lian <lian.cs....@gmail.com>
Subject Re: Parquet compression codecs not applied
Date Fri, 06 Feb 2015 01:13:03 GMT
Hi Ayoub,

The doc page isn’t wrong, but it’s indeed confusing. 
|spark.sql.parquet.compression.codec| is used when you’re wring Parquet 
file with something like |data.saveAsParquetFile(...)|. However, you are 
using Hive DDL in the example code. All Hive DDLs and commands like 
|SET| are directly delegated to Hive, which unfortunately ignores Spark 
configurations. And yet, it should be updated.

Best,
Cheng

On 1/10/15 5:49 AM, Ayoub Benali wrote:

> it worked thanks.
>
> this doc page 
> <https://spark.apache.org/docs/1.2.0/sql-programming-guide.html>recommends 
> to use "spark.sql.parquet.compression.codec" to set the compression 
> coded and I thought this setting would be forwarded to the hive 
> context given that HiveContext extends SQLContext, but it was not.
>
> I am wondering if this behavior is normal, if not I could open an 
> issue with a potential fix so that 
> "spark.sql.parquet.compression.codec" would be translated to 
> "parquet.compression" in the hive context ?
>
> Or the documentation should be updated to mention that the compression 
> coded is set differently with HiveContext.
>
> Ayoub.
>
>
>
> 2015-01-09 17:51 GMT+01:00 Michael Armbrust <michael@databricks.com 
> <mailto:michael@databricks.com>>:
>
>     This is a little confusing, but that code path is actually going
>     through hive.  So the spark sql configuration does not help.
>
>     Perhaps, try:
>     set parquet.compression=GZIP;
>
>     On Fri, Jan 9, 2015 at 2:41 AM, Ayoub <benali.ayoub.info@gmail.com
>     <mailto:benali.ayoub.info@gmail.com>> wrote:
>
>         Hello,
>
>         I tried to save a table created via the hive context as a
>         parquet file but
>         whatever compression codec (uncompressed, snappy, gzip or lzo)
>         I set via
>         setConf like:
>
>         setConf("spark.sql.parquet.compression.codec", "gzip")
>
>         the size of the generated files is the always the same, so it
>         seems like
>         spark context ignores the compression codec that I set.
>
>         Here is a code sample applied via the spark shell:
>
>         import org.apache.spark.sql.hive.HiveContext
>         val hiveContext = new HiveContext(sc)
>
>         hiveContext.sql("SET hive.exec.dynamic.partition = true")
>         hiveContext.sql("SET hive.exec.dynamic.partition.mode =
>         nonstrict")
>         hiveContext.setConf("spark.sql.parquet.binaryAsString",
>         "true") // required
>         to make data compatible with impala
>         hiveContext.setConf("spark.sql.parquet.compression.codec", "gzip")
>
>         hiveContext.sql("create external table if not exists foo (bar
>         STRING, ts
>         INT) Partitioned by (year INT, month INT, day INT) STORED AS
>         PARQUET
>         Location 'hdfs://path/data/foo'")
>
>         hiveContext.sql("insert into table foo partition(year,
>         month,day) select *,
>         year(from_unixtime(ts)) as year, month(from_unixtime(ts)) as
>         month,
>         day(from_unixtime(ts)) as day from raw_foo")
>
>         I tried that with spark 1.2 and 1.3 snapshot against hive 0.13
>         and I also tried that with Impala on the same cluster which
>         applied
>         correctly the compression codecs.
>
>         Does anyone know what could be the problem ?
>
>         Thanks,
>         Ayoub.
>
>
>
>
>         --
>         View this message in context:
>         http://apache-spark-user-list.1001560.n3.nabble.com/Parquet-compression-codecs-not-applied-tp21058.html
>         Sent from the Apache Spark User List mailing list archive at
>         Nabble.com.
>
>         ---------------------------------------------------------------------
>         To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>         <mailto:user-unsubscribe@spark.apache.org>
>         For additional commands, e-mail: user-help@spark.apache.org
>         <mailto:user-help@spark.apache.org>
>
>
>
​

Mime
View raw message