Mailing-List: contact user-help@spark.apache.org; run by ezmlm
Precedence: bulk
MIME-Version: 1.0
In-Reply-To: 
 <CAAswR-5fTK6wHRVHSMt7DyQyDZng9fksXBzP1mHurHh=GmXQ9g@mail.gmail.com>
References: <1431121287326-22824.post@n3.nabble.com>
	<CAAswR-5fTK6wHRVHSMt7DyQyDZng9fksXBzP1mHurHh=GmXQ9g@mail.gmail.com>
Date: Fri, 8 May 2015 15:11:06 -0700
Message-ID: 
 <CAHWezS0vcZXyKGAhC3Sx7cjAYf_5MzDenDHS9269Mz7HDSgd4A@mail.gmail.com>
Subject: Re: CREATE TABLE ignores database when using PARQUET option
From: Carlos Pereira <cpereira@groupon.com>
To: Michael Armbrust <michael@databricks.com>
Cc: user <user@spark.apache.org>
Content-Type: multipart/alternative; boundary=089e0158cb3205b5a50515994ea6

--089e0158cb3205b5a50515994ea6
Content-Type: text/plain; charset=UTF-8

Thanks Michael for the quick return. I was looking forward the automatic
schema inferring (I think that's you mean by 'schema merging' ?), and I
think the STORED AS would still require me to define the table columns
right?

Anyways, I am glad to hear you guys already working to fix this on future
releases.

Thanks,
Carlos

On Fri, May 8, 2015 at 2:43 PM, Michael Armbrust <michael@databricks.com>
wrote:

> This is an unfortunate limitation of the datasource api which does not
> support multiple databases.  For parquet in particular (if you aren't using
> schema merging).  You can create a hive table using STORED AS PARQUET
> today.  I hope to fix this limitation in Spark 1.5.
>
> On Fri, May 8, 2015 at 2:41 PM, Carlos Pereira <cpereira@groupon.com>
> wrote:
>
>> Hi, I would like to create a hive table on top a existent parquet file as
>> described here:
>>
>> https://databricks.com/blog/2015/03/24/spark-sql-graduates-from-alpha-in-spark-1-3.html
>>
>> Due network restrictions, I need to store the metadata definition in a
>> different path than '/user/hive/warehouse', so I first set a new database
>> on
>> my own HDFS dir:
>>
>> CREATE DATABASE foo_db LOCATION '/user/foo';
>> USE foo_db;
>>
>> And then I run the following query:
>>
>> CREATE TABLE mytable_parquet
>> USING parquet
>> OPTIONS (path "/user/foo/data.parquet")
>>
>> The problem is that SparkSQL is not using the same database defined the in
>> shell context, but the default metastore instead of:
>>
>> ----------------------------
>>  > CREATE TABLE mytable_parquet USING parquet OPTIONS (path
>> "/user/foo/data.parquet");
>> 15/05/08 20:42:21 INFO metastore.HiveMetaStore: 0: get_table : *db=foo_db*
>> tbl=mytable_parquet
>>
>> 15/05/08 20:42:21 INFO HiveMetaStore.audit: ugi=foo     ip=unknown-ip-addr
>> cmd=get_table : db=foo_db tbl=mytable_parquet
>>
>> 15/05/08 20:42:21 INFO metastore.HiveMetaStore: 0: create_table:
>> Table(tableName:mytable_parquet, *dbName:default,* owner:foo,
>> createTime:1431117741, lastAccessTime:0, retention:0,
>> sd:StorageDescriptor(cols:[FieldSchema(name:col, type:array<string>,
>> comment:from deserializer)], location:null,
>> inputFormat:org.apache.hadoop.mapred.SequenceFileInputFormat,
>> outputFormat:org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat,
>> compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null,
>>
>> serializationLib:org.apache.hadoop.hive.serde2.MetadataTypedColumnsetSerDe,
>> parameters:{serialization.format=1, path=/user/foo/data.parquet}),
>> bucketCols:[], sortCols:[], parameters:{},
>> skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[],
>> skewedColValueLocationMaps:{})), partitionKeys:[],
>> parameters:{EXTERNAL=TRUE, spark.sql.sources.provider=parquet},
>> viewOriginalText:null, viewExpandedText:null, tableType:EXTERNAL_TABLE)
>> 15/05/08 20:42:21 INFO HiveMetaStore.audit: ugi=foo     ip=unknown-ip-addr
>> cmd=create_table: Table(tableName:mytable_parquet, dbName:default,
>> owner:foo, createTime:1431117741, lastAccessTime:0, retention:0,
>> sd:StorageDescriptor(cols:[FieldSchema(name:col, type:array<string>,
>> comment:from deserializer)], location:null,
>> inputFormat:org.apache.hadoop.mapred.SequenceFileInputFormat,
>> outputFormat:org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat,
>> compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null,
>>
>> serializationLib:org.apache.hadoop.hive.serde2.MetadataTypedColumnsetSerDe,
>> parameters:{serialization.format=1, path=/user/foo/data.parquet}),
>> bucketCols:[], sortCols:[], parameters:{},
>> skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[],
>> skewedColValueLocationMaps:{})), partitionKeys:[],
>> parameters:{EXTERNAL=TRUE, spark.sql.sources.provider=parquet},
>> viewOriginalText:null, viewExpandedText:null, tableType:EXTERNAL_TABLE)
>>
>> 15/05/08 20:42:21 ERROR hive.log: Got exception:
>> org.apache.hadoop.security.AccessControlException Permission denied:
>> user=foo, access=WRITE,
>> inode="/user/hive/warehouse":hive:grp_gdoop_hdfs:drwxr-xr-x
>> ----------------------------
>>
>>
>> The permission error above happens because my linux user does not have
>> write
>> access on the default metastore path. I can workaround this issue if I use
>> CREATE TEMPORARY TABLE and have no metadata written on disk.
>>
>> I would like to know if I am doing anything wrong here and if there is any
>> additional property I can use to force the database/metastore_dir I need
>> to
>> write on.
>>
>> Thanks,
>> Carlos
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/CREATE-TABLE-ignores-database-when-using-PARQUET-option-tp22824.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>>
>>
>

--089e0158cb3205b5a50515994ea6
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div><div><div>Thanks Michael for the quick return. I was =
looking forward the automatic schema inferring (I think that&#39;s you mean=
 by &#39;schema merging&#39; ?), and I think the STORED AS would still requ=
ire me to define the table columns right?<br><br></div>Anyways, I am glad t=
o hear you guys already working to fix this on future releases.<br><br></di=
v>Thanks,<br></div>Carlos=C2=A0 <br></div><div class=3D"gmail_extra"><br><d=
iv class=3D"gmail_quote">On Fri, May 8, 2015 at 2:43 PM, Michael Armbrust <=
span dir=3D"ltr">&lt;<a href=3D"mailto:michael@databricks.com" target=3D"_b=
lank">michael@databricks.com</a>&gt;</span> wrote:<br><blockquote class=3D"=
gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-=
left:1ex"><div dir=3D"ltr">This is an unfortunate limitation of the datasou=
rce api which does not support multiple databases.=C2=A0 For parquet in par=
ticular (if you aren&#39;t using schema merging).=C2=A0 You can create a hi=
ve table using STORED AS PARQUET today.=C2=A0 I hope to fix this limitation=
 in Spark 1.5.</div><div class=3D"gmail_extra"><br><div class=3D"gmail_quot=
e">On Fri, May 8, 2015 at 2:41 PM, Carlos Pereira <span dir=3D"ltr">&lt;<a =
href=3D"mailto:cpereira@groupon.com" target=3D"_blank">cpereira@groupon.com=
</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quote" style=3D"margin=
:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi, I would like t=
o create a hive table on top a existent parquet file as<br>
described here:<br>
<a href=3D"https://databricks.com/blog/2015/03/24/spark-sql-graduates-from-=
alpha-in-spark-1-3.html" target=3D"_blank">https://databricks.com/blog/2015=
/03/24/spark-sql-graduates-from-alpha-in-spark-1-3.html</a><br>
<br>
Due network restrictions, I need to store the metadata definition in a<br>
different path than &#39;/user/hive/warehouse&#39;, so I first set a new da=
tabase on<br>
my own HDFS dir:<br>
<br>
CREATE DATABASE foo_db LOCATION &#39;/user/foo&#39;;<br>
USE foo_db;<br>
<br>
And then I run the following query:<br>
<br>
CREATE TABLE mytable_parquet<br>
USING parquet<br>
OPTIONS (path &quot;/user/foo/data.parquet&quot;)<br>
<br>
The problem is that SparkSQL is not using the same database defined the in<=
br>
shell context, but the default metastore instead of:<br>
<br>
----------------------------<br>
=C2=A0&gt; CREATE TABLE mytable_parquet USING parquet OPTIONS (path<br>
&quot;/user/foo/data.parquet&quot;);<br>
15/05/08 20:42:21 INFO metastore.HiveMetaStore: 0: get_table : *db=3Dfoo_db=
*<br>
tbl=3Dmytable_parquet<br>
<br>
15/05/08 20:42:21 INFO HiveMetaStore.audit: ugi=3Dfoo=C2=A0 =C2=A0 =C2=A0ip=
=3Dunknown-ip-addr<br>
cmd=3Dget_table : db=3Dfoo_db tbl=3Dmytable_parquet<br>
<br>
15/05/08 20:42:21 INFO metastore.HiveMetaStore: 0: create_table:<br>
Table(tableName:mytable_parquet, *dbName:default,* owner:foo,<br>
createTime:1431117741, lastAccessTime:0, retention:0,<br>
sd:StorageDescriptor(cols:[FieldSchema(name:col, type:array&lt;string&gt;,<=
br>
comment:from deserializer)], location:null,<br>
inputFormat:org.apache.hadoop.mapred.SequenceFileInputFormat,<br>
outputFormat:org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat,<br>
compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null,<br>
serializationLib:org.apache.hadoop.hive.serde2.MetadataTypedColumnsetSerDe,=
<br>
parameters:{serialization.format=3D1, path=3D/user/foo/data.parquet}),<br>
bucketCols:[], sortCols:[], parameters:{},<br>
skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[],<br>
skewedColValueLocationMaps:{})), partitionKeys:[],<br>
parameters:{EXTERNAL=3DTRUE, spark.sql.sources.provider=3Dparquet},<br>
viewOriginalText:null, viewExpandedText:null, tableType:EXTERNAL_TABLE)<br>
15/05/08 20:42:21 INFO HiveMetaStore.audit: ugi=3Dfoo=C2=A0 =C2=A0 =C2=A0ip=
=3Dunknown-ip-addr<br>
cmd=3Dcreate_table: Table(tableName:mytable_parquet, dbName:default,<br>
owner:foo, createTime:1431117741, lastAccessTime:0, retention:0,<br>
sd:StorageDescriptor(cols:[FieldSchema(name:col, type:array&lt;string&gt;,<=
br>
comment:from deserializer)], location:null,<br>
inputFormat:org.apache.hadoop.mapred.SequenceFileInputFormat,<br>
outputFormat:org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat,<br>
compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null,<br>
serializationLib:org.apache.hadoop.hive.serde2.MetadataTypedColumnsetSerDe,=
<br>
parameters:{serialization.format=3D1, path=3D/user/foo/data.parquet}),<br>
bucketCols:[], sortCols:[], parameters:{},<br>
skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[],<br>
skewedColValueLocationMaps:{})), partitionKeys:[],<br>
parameters:{EXTERNAL=3DTRUE, spark.sql.sources.provider=3Dparquet},<br>
viewOriginalText:null, viewExpandedText:null, tableType:EXTERNAL_TABLE)<br>
<br>
15/05/08 20:42:21 ERROR hive.log: Got exception:<br>
org.apache.hadoop.security.AccessControlException Permission denied:<br>
user=3Dfoo, access=3DWRITE,<br>
inode=3D&quot;/user/hive/warehouse&quot;:hive:grp_gdoop_hdfs:drwxr-xr-x<br>
----------------------------<br>
<br>
<br>
The permission error above happens because my linux user does not have writ=
e<br>
access on the default metastore path. I can workaround this issue if I use<=
br>
CREATE TEMPORARY TABLE and have no metadata written on disk.<br>
<br>
I would like to know if I am doing anything wrong here and if there is any<=
br>
additional property I can use to force the database/metastore_dir I need to=
<br>
write on.<br>
<br>
Thanks,<br>
Carlos<br>
<br>
<br>
<br><span class=3D"HOEnZb"><font color=3D"#888888">
<br>
--<br>
View this message in context: <a href=3D"http://apache-spark-user-list.1001=
560.n3.nabble.com/CREATE-TABLE-ignores-database-when-using-PARQUET-option-t=
p22824.html" target=3D"_blank">http://apache-spark-user-list.1001560.n3.nab=
ble.com/CREATE-TABLE-ignores-database-when-using-PARQUET-option-tp22824.htm=
l</a><br>
Sent from the Apache Spark User List mailing list archive at Nabble.com.<br=
>
<br>
---------------------------------------------------------------------<br>
To unsubscribe, e-mail: <a href=3D"mailto:user-unsubscribe@spark.apache.org=
" target=3D"_blank">user-unsubscribe@spark.apache.org</a><br>
For additional commands, e-mail: <a href=3D"mailto:user-help@spark.apache.o=
rg" target=3D"_blank">user-help@spark.apache.org</a><br>
<br>
</font></span></blockquote></div><br></div>
</blockquote></div><br></div>

--089e0158cb3205b5a50515994ea6--