spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Carlos Pereira <cpere...@groupon.com>
Subject Re: CREATE TABLE ignores database when using PARQUET option
Date Fri, 08 May 2015 22:11:06 GMT
Thanks Michael for the quick return. I was looking forward the automatic
schema inferring (I think that's you mean by 'schema merging' ?), and I
think the STORED AS would still require me to define the table columns
right?

Anyways, I am glad to hear you guys already working to fix this on future
releases.

Thanks,
Carlos

On Fri, May 8, 2015 at 2:43 PM, Michael Armbrust <michael@databricks.com>
wrote:

> This is an unfortunate limitation of the datasource api which does not
> support multiple databases.  For parquet in particular (if you aren't using
> schema merging).  You can create a hive table using STORED AS PARQUET
> today.  I hope to fix this limitation in Spark 1.5.
>
> On Fri, May 8, 2015 at 2:41 PM, Carlos Pereira <cpereira@groupon.com>
> wrote:
>
>> Hi, I would like to create a hive table on top a existent parquet file as
>> described here:
>>
>> https://databricks.com/blog/2015/03/24/spark-sql-graduates-from-alpha-in-spark-1-3.html
>>
>> Due network restrictions, I need to store the metadata definition in a
>> different path than '/user/hive/warehouse', so I first set a new database
>> on
>> my own HDFS dir:
>>
>> CREATE DATABASE foo_db LOCATION '/user/foo';
>> USE foo_db;
>>
>> And then I run the following query:
>>
>> CREATE TABLE mytable_parquet
>> USING parquet
>> OPTIONS (path "/user/foo/data.parquet")
>>
>> The problem is that SparkSQL is not using the same database defined the in
>> shell context, but the default metastore instead of:
>>
>> ----------------------------
>>  > CREATE TABLE mytable_parquet USING parquet OPTIONS (path
>> "/user/foo/data.parquet");
>> 15/05/08 20:42:21 INFO metastore.HiveMetaStore: 0: get_table : *db=foo_db*
>> tbl=mytable_parquet
>>
>> 15/05/08 20:42:21 INFO HiveMetaStore.audit: ugi=foo     ip=unknown-ip-addr
>> cmd=get_table : db=foo_db tbl=mytable_parquet
>>
>> 15/05/08 20:42:21 INFO metastore.HiveMetaStore: 0: create_table:
>> Table(tableName:mytable_parquet, *dbName:default,* owner:foo,
>> createTime:1431117741, lastAccessTime:0, retention:0,
>> sd:StorageDescriptor(cols:[FieldSchema(name:col, type:array<string>,
>> comment:from deserializer)], location:null,
>> inputFormat:org.apache.hadoop.mapred.SequenceFileInputFormat,
>> outputFormat:org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat,
>> compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null,
>>
>> serializationLib:org.apache.hadoop.hive.serde2.MetadataTypedColumnsetSerDe,
>> parameters:{serialization.format=1, path=/user/foo/data.parquet}),
>> bucketCols:[], sortCols:[], parameters:{},
>> skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[],
>> skewedColValueLocationMaps:{})), partitionKeys:[],
>> parameters:{EXTERNAL=TRUE, spark.sql.sources.provider=parquet},
>> viewOriginalText:null, viewExpandedText:null, tableType:EXTERNAL_TABLE)
>> 15/05/08 20:42:21 INFO HiveMetaStore.audit: ugi=foo     ip=unknown-ip-addr
>> cmd=create_table: Table(tableName:mytable_parquet, dbName:default,
>> owner:foo, createTime:1431117741, lastAccessTime:0, retention:0,
>> sd:StorageDescriptor(cols:[FieldSchema(name:col, type:array<string>,
>> comment:from deserializer)], location:null,
>> inputFormat:org.apache.hadoop.mapred.SequenceFileInputFormat,
>> outputFormat:org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat,
>> compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null,
>>
>> serializationLib:org.apache.hadoop.hive.serde2.MetadataTypedColumnsetSerDe,
>> parameters:{serialization.format=1, path=/user/foo/data.parquet}),
>> bucketCols:[], sortCols:[], parameters:{},
>> skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[],
>> skewedColValueLocationMaps:{})), partitionKeys:[],
>> parameters:{EXTERNAL=TRUE, spark.sql.sources.provider=parquet},
>> viewOriginalText:null, viewExpandedText:null, tableType:EXTERNAL_TABLE)
>>
>> 15/05/08 20:42:21 ERROR hive.log: Got exception:
>> org.apache.hadoop.security.AccessControlException Permission denied:
>> user=foo, access=WRITE,
>> inode="/user/hive/warehouse":hive:grp_gdoop_hdfs:drwxr-xr-x
>> ----------------------------
>>
>>
>> The permission error above happens because my linux user does not have
>> write
>> access on the default metastore path. I can workaround this issue if I use
>> CREATE TEMPORARY TABLE and have no metadata written on disk.
>>
>> I would like to know if I am doing anything wrong here and if there is any
>> additional property I can use to force the database/metastore_dir I need
>> to
>> write on.
>>
>> Thanks,
>> Carlos
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/CREATE-TABLE-ignores-database-when-using-PARQUET-option-tp22824.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>>
>>
>

Mime
View raw message