Return-Path: X-Original-To: apmail-spark-user-archive@minotaur.apache.org Delivered-To: apmail-spark-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 9E1C8182F0 for ; Fri, 8 May 2015 22:11:36 +0000 (UTC) Received: (qmail 37860 invoked by uid 500); 8 May 2015 22:11:33 -0000 Delivered-To: apmail-spark-user-archive@spark.apache.org Received: (qmail 37308 invoked by uid 500); 8 May 2015 22:11:28 -0000 Mailing-List: contact user-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list user@spark.apache.org Received: (qmail 37298 invoked by uid 99); 8 May 2015 22:11:28 -0000 Received: from Unknown (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 08 May 2015 22:11:28 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id BA0F5C092D for ; Fri, 8 May 2015 22:11:27 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 4.213 X-Spam-Level: **** X-Spam-Status: No, score=4.213 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=3, SPF_PASS=-0.001, URIBL_BLOCKED=0.001, URI_HEX=1.313] autolearn=disabled Authentication-Results: spamd4-us-west.apache.org (amavisd-new); dkim=pass (1024-bit key) header.d=groupon.com Received: from mx1-us-east.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id zXm8bYQJXRVH for ; Fri, 8 May 2015 22:11:15 +0000 (UTC) Received: from mail-la0-f41.google.com (na3sys010aog113.obsmtp.com [74.125.245.94]) by mx1-us-east.apache.org (ASF Mail Server at mx1-us-east.apache.org) with ESMTPS id BE5F342989 for ; Fri, 8 May 2015 22:11:14 +0000 (UTC) Received: from mail-la0-f41.google.com ([209.85.215.41]) (using TLSv1) by na3sys010aob113.postini.com ([74.125.244.12]) with SMTP ID DSNKVU00fKq4ZKRVjPgnO4lgDVXD1Z492wdJ@postini.com; Fri, 08 May 2015 15:11:14 PDT Received: by labbd9 with SMTP id bd9so61641650lab.2 for ; Fri, 08 May 2015 15:11:07 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=groupon.com; s=google; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=xHH4S/PoCQBJhLa1wsUs+5yUNjnfcFsg8XVo+mStNEo=; b=d7Hjf7QpbGV89fDd+qrYE0hctNUnSX4TcmnRdFe5IeHezg4IQUAbK/EEHgdQ1na71a A+7PRsegDJ/JVKUs9l/x2DAoGgeLUGYGCvlcpMG6Zy8p/YqucScPLFxWGGhCwmVIw5wE MsZKrdxPhNy2K1SLGkl8qfr76G42zpYxnER38= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to:cc:content-type; bh=xHH4S/PoCQBJhLa1wsUs+5yUNjnfcFsg8XVo+mStNEo=; b=VDxFK56o7vYhKVUYMTzBcdTRoxWIc7Nv8+IqbQFGlFPIamkKMffO9Va+ZMniwjf4vF 0mSuuB9uKQPSuNot77Ei9Znynhxx6Dv+fx9aUQh+/tffygvzG3wheuHM//VRicomkpX6 OQgCm91ZFSHxuzeYULbGUZDPCvZnR0tjFEWk6uJw21ssIuk2gDvTxmn6pDreFIPMi65Y C9DZJ6x0dQs9pDgBaxyT9mHyHLjfMZ4XZ8Uygr+k9/42x4+Sm1nSoJ2/aS/UGNg3Nd03 xUfH4v3ojZkuF+TPQEP41vaSrmQ19Vxv14m0zUihpvKwW4ONtvYj+piX++KEg3i2Z+Ep deMg== X-Gm-Message-State: ALoCoQmELCGboDC++afmTJSsbePvH7DUsmtQHR4yEXD2UkOf81pE7chLUokFyQL2m6p/vRghLPTje4SWkc4ncDd0o6E5yPVJ7UYrvuBcgVNjo9px1VSkwo/jSoBCYaDycdWIL40j+YofUoAPN+fGSYSD3/QHrRhWvg== X-Received: by 10.152.27.1 with SMTP id p1mr110658lag.112.1431123066954; Fri, 08 May 2015 15:11:06 -0700 (PDT) MIME-Version: 1.0 X-Received: by 10.152.27.1 with SMTP id p1mr110648lag.112.1431123066715; Fri, 08 May 2015 15:11:06 -0700 (PDT) Received: by 10.152.146.169 with HTTP; Fri, 8 May 2015 15:11:06 -0700 (PDT) In-Reply-To: References: <1431121287326-22824.post@n3.nabble.com> Date: Fri, 8 May 2015 15:11:06 -0700 Message-ID: Subject: Re: CREATE TABLE ignores database when using PARQUET option From: Carlos Pereira To: Michael Armbrust Cc: user Content-Type: multipart/alternative; boundary=089e0158cb3205b5a50515994ea6 --089e0158cb3205b5a50515994ea6 Content-Type: text/plain; charset=UTF-8 Thanks Michael for the quick return. I was looking forward the automatic schema inferring (I think that's you mean by 'schema merging' ?), and I think the STORED AS would still require me to define the table columns right? Anyways, I am glad to hear you guys already working to fix this on future releases. Thanks, Carlos On Fri, May 8, 2015 at 2:43 PM, Michael Armbrust wrote: > This is an unfortunate limitation of the datasource api which does not > support multiple databases. For parquet in particular (if you aren't using > schema merging). You can create a hive table using STORED AS PARQUET > today. I hope to fix this limitation in Spark 1.5. > > On Fri, May 8, 2015 at 2:41 PM, Carlos Pereira > wrote: > >> Hi, I would like to create a hive table on top a existent parquet file as >> described here: >> >> https://databricks.com/blog/2015/03/24/spark-sql-graduates-from-alpha-in-spark-1-3.html >> >> Due network restrictions, I need to store the metadata definition in a >> different path than '/user/hive/warehouse', so I first set a new database >> on >> my own HDFS dir: >> >> CREATE DATABASE foo_db LOCATION '/user/foo'; >> USE foo_db; >> >> And then I run the following query: >> >> CREATE TABLE mytable_parquet >> USING parquet >> OPTIONS (path "/user/foo/data.parquet") >> >> The problem is that SparkSQL is not using the same database defined the in >> shell context, but the default metastore instead of: >> >> ---------------------------- >> > CREATE TABLE mytable_parquet USING parquet OPTIONS (path >> "/user/foo/data.parquet"); >> 15/05/08 20:42:21 INFO metastore.HiveMetaStore: 0: get_table : *db=foo_db* >> tbl=mytable_parquet >> >> 15/05/08 20:42:21 INFO HiveMetaStore.audit: ugi=foo ip=unknown-ip-addr >> cmd=get_table : db=foo_db tbl=mytable_parquet >> >> 15/05/08 20:42:21 INFO metastore.HiveMetaStore: 0: create_table: >> Table(tableName:mytable_parquet, *dbName:default,* owner:foo, >> createTime:1431117741, lastAccessTime:0, retention:0, >> sd:StorageDescriptor(cols:[FieldSchema(name:col, type:array, >> comment:from deserializer)], location:null, >> inputFormat:org.apache.hadoop.mapred.SequenceFileInputFormat, >> outputFormat:org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat, >> compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, >> >> serializationLib:org.apache.hadoop.hive.serde2.MetadataTypedColumnsetSerDe, >> parameters:{serialization.format=1, path=/user/foo/data.parquet}), >> bucketCols:[], sortCols:[], parameters:{}, >> skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], >> skewedColValueLocationMaps:{})), partitionKeys:[], >> parameters:{EXTERNAL=TRUE, spark.sql.sources.provider=parquet}, >> viewOriginalText:null, viewExpandedText:null, tableType:EXTERNAL_TABLE) >> 15/05/08 20:42:21 INFO HiveMetaStore.audit: ugi=foo ip=unknown-ip-addr >> cmd=create_table: Table(tableName:mytable_parquet, dbName:default, >> owner:foo, createTime:1431117741, lastAccessTime:0, retention:0, >> sd:StorageDescriptor(cols:[FieldSchema(name:col, type:array, >> comment:from deserializer)], location:null, >> inputFormat:org.apache.hadoop.mapred.SequenceFileInputFormat, >> outputFormat:org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat, >> compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, >> >> serializationLib:org.apache.hadoop.hive.serde2.MetadataTypedColumnsetSerDe, >> parameters:{serialization.format=1, path=/user/foo/data.parquet}), >> bucketCols:[], sortCols:[], parameters:{}, >> skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], >> skewedColValueLocationMaps:{})), partitionKeys:[], >> parameters:{EXTERNAL=TRUE, spark.sql.sources.provider=parquet}, >> viewOriginalText:null, viewExpandedText:null, tableType:EXTERNAL_TABLE) >> >> 15/05/08 20:42:21 ERROR hive.log: Got exception: >> org.apache.hadoop.security.AccessControlException Permission denied: >> user=foo, access=WRITE, >> inode="/user/hive/warehouse":hive:grp_gdoop_hdfs:drwxr-xr-x >> ---------------------------- >> >> >> The permission error above happens because my linux user does not have >> write >> access on the default metastore path. I can workaround this issue if I use >> CREATE TEMPORARY TABLE and have no metadata written on disk. >> >> I would like to know if I am doing anything wrong here and if there is any >> additional property I can use to force the database/metastore_dir I need >> to >> write on. >> >> Thanks, >> Carlos >> >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/CREATE-TABLE-ignores-database-when-using-PARQUET-option-tp22824.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org >> For additional commands, e-mail: user-help@spark.apache.org >> >> > --089e0158cb3205b5a50515994ea6 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Thanks Michael for the quick return. I was = looking forward the automatic schema inferring (I think that's you mean= by 'schema merging' ?), and I think the STORED AS would still requ= ire me to define the table columns right?

Anyways, I am glad t= o hear you guys already working to fix this on future releases.

Thanks,
Carlos=C2=A0

On Fri, May 8, 2015 at 2:43 PM, Michael Armbrust <= span dir=3D"ltr"><michael@databricks.com> wrote:
This is an unfortunate limitation of the datasou= rce api which does not support multiple databases.=C2=A0 For parquet in par= ticular (if you aren't using schema merging).=C2=A0 You can create a hi= ve table using STORED AS PARQUET today.=C2=A0 I hope to fix this limitation= in Spark 1.5.

On Fri, May 8, 2015 at 2:41 PM, Carlos Pereira <cpereira@groupon.com= > wrote:
Hi, I would like t= o create a hive table on top a existent parquet file as
described here:
https://databricks.com/blog/2015= /03/24/spark-sql-graduates-from-alpha-in-spark-1-3.html

Due network restrictions, I need to store the metadata definition in a
different path than '/user/hive/warehouse', so I first set a new da= tabase on
my own HDFS dir:

CREATE DATABASE foo_db LOCATION '/user/foo';
USE foo_db;

And then I run the following query:

CREATE TABLE mytable_parquet
USING parquet
OPTIONS (path "/user/foo/data.parquet")

The problem is that SparkSQL is not using the same database defined the in<= br> shell context, but the default metastore instead of:

----------------------------
=C2=A0> CREATE TABLE mytable_parquet USING parquet OPTIONS (path
"/user/foo/data.parquet");
15/05/08 20:42:21 INFO metastore.HiveMetaStore: 0: get_table : *db=3Dfoo_db= *
tbl=3Dmytable_parquet

15/05/08 20:42:21 INFO HiveMetaStore.audit: ugi=3Dfoo=C2=A0 =C2=A0 =C2=A0ip= =3Dunknown-ip-addr
cmd=3Dget_table : db=3Dfoo_db tbl=3Dmytable_parquet

15/05/08 20:42:21 INFO metastore.HiveMetaStore: 0: create_table:
Table(tableName:mytable_parquet, *dbName:default,* owner:foo,
createTime:1431117741, lastAccessTime:0, retention:0,
sd:StorageDescriptor(cols:[FieldSchema(name:col, type:array<string>,<= br> comment:from deserializer)], location:null,
inputFormat:org.apache.hadoop.mapred.SequenceFileInputFormat,
outputFormat:org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat,
compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null,
serializationLib:org.apache.hadoop.hive.serde2.MetadataTypedColumnsetSerDe,=
parameters:{serialization.format=3D1, path=3D/user/foo/data.parquet}),
bucketCols:[], sortCols:[], parameters:{},
skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[],
skewedColValueLocationMaps:{})), partitionKeys:[],
parameters:{EXTERNAL=3DTRUE, spark.sql.sources.provider=3Dparquet},
viewOriginalText:null, viewExpandedText:null, tableType:EXTERNAL_TABLE)
15/05/08 20:42:21 INFO HiveMetaStore.audit: ugi=3Dfoo=C2=A0 =C2=A0 =C2=A0ip= =3Dunknown-ip-addr
cmd=3Dcreate_table: Table(tableName:mytable_parquet, dbName:default,
owner:foo, createTime:1431117741, lastAccessTime:0, retention:0,
sd:StorageDescriptor(cols:[FieldSchema(name:col, type:array<string>,<= br> comment:from deserializer)], location:null,
inputFormat:org.apache.hadoop.mapred.SequenceFileInputFormat,
outputFormat:org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat,
compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null,
serializationLib:org.apache.hadoop.hive.serde2.MetadataTypedColumnsetSerDe,=
parameters:{serialization.format=3D1, path=3D/user/foo/data.parquet}),
bucketCols:[], sortCols:[], parameters:{},
skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[],
skewedColValueLocationMaps:{})), partitionKeys:[],
parameters:{EXTERNAL=3DTRUE, spark.sql.sources.provider=3Dparquet},
viewOriginalText:null, viewExpandedText:null, tableType:EXTERNAL_TABLE)

15/05/08 20:42:21 ERROR hive.log: Got exception:
org.apache.hadoop.security.AccessControlException Permission denied:
user=3Dfoo, access=3DWRITE,
inode=3D"/user/hive/warehouse":hive:grp_gdoop_hdfs:drwxr-xr-x
----------------------------


The permission error above happens because my linux user does not have writ= e
access on the default metastore path. I can workaround this issue if I use<= br> CREATE TEMPORARY TABLE and have no metadata written on disk.

I would like to know if I am doing anything wrong here and if there is any<= br> additional property I can use to force the database/metastore_dir I need to=
write on.

Thanks,
Carlos




--
View this message in context: http://apache-spark-user-list.1001560.n3.nab= ble.com/CREATE-TABLE-ignores-database-when-using-PARQUET-option-tp22824.htm= l
Sent from the Apache Spark User List mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org



--089e0158cb3205b5a50515994ea6--