spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Cheng Lian <lian.cs....@gmail.com>
Subject Re: Spark SQL, Hive & Parquet data types
Date Fri, 20 Feb 2015 13:44:51 GMT
For the second question, we do plan to support Hive 0.14, possibly in 
Spark 1.4.0.

For the first question:

 1. In Spark 1.2.0, the Parquet support code doesn’t support timestamp
    type, so you can’t.
 2. In Spark 1.3.0, timestamp support was added, also Spark SQL uses its
    own Parquet support to handle both read path and write path when
    dealing with Parquet tables declared in Hive metastore, as long as
    you’re not writing to a partitioned table. So yes, you can.

The Parquet version bundled with Spark 1.3.0 is 1.6.0rc3, which supports 
timestamp type natively. However, the Parquet versions bundled with Hive 
0.13.1 and Hive 0.14.0 are 1.3.2 and 1.5.0 respectively. Neither of them 
supports timestamp type. Hive 0.14.0 “supports” read/write timestamp 
from/to Parquet by converting timestamps from/to Parquet binaries. 
Similarly, Impala converts timestamp into Parquet int96. This can be 
annoying for Spark SQL, because we must interpret Parquet files in 
different ways according to the original writer of the file. As Parquet 
matures, recent Parquet versions support more and more standard data 
types. Mappings from complex nested types to Parquet types are also 
being standardized 1 
<https://github.com/apache/incubator-parquet-mr/pull/83>.

On 2/20/15 6:50 AM, The Watcher wrote:

> Still trying to get my head around Spark SQL & Hive.
>
> 1) Let's assume I *only* use Spark SQL to create and insert data into HIVE
> tables, declared in a Hive meta-store.
>
> Does it matter at all if Hive supports the data types I need with Parquet,
> or is all that matters what Catalyst & spark's parquet relation support ?
>
> Case in point : timestamps & Parquet
> * Parquet now supports them as per
> https://github.com/Parquet/parquet-mr/issues/218
> * Hive only supports them in 0.14
> So would I be able to read/write timestamps natively in Spark 1.2 ? Spark
> 1.3 ?
>
> I have found this thread
> http://apache-spark-user-list.1001560.n3.nabble.com/timestamp-not-implemented-yet-td15414.html
> which seems to indicate that the data types supported by Hive would matter
> to Spark SQL.
> If so, why is that ? Doesn't the read path go through Spark SQL to read the
> parquet file ?
>
> 2) Is there planned support for Hive 0.14 ?
>
> Thanks
>
​

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message