hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sandeep Giri <sand...@cloudxlab.com>
Subject Re: Why does the user need write permission on the location of external hive table?
Date Sat, 18 Jun 2016 18:38:17 GMT
Hi Igor,
Yes, the same error occurs on unzipped files too.

Hi Mich,
Turning off the permissions is not possible for real cluster.

Regards,
Sandeep Giri

On Tue, Jun 7, 2016 at 5:57 AM Markovitz, Dudu <dmarkovitz@paypal.com>
wrote:

> I’ll rephrase –
>
> I would strongly recommend not to work with zipped files and MR.
>
> Dudu
>
> *From:* Mich Talebzadeh [mailto:mich.talebzadeh@gmail.com]
> *Sent:* Tuesday, June 07, 2016 2:59 AM
>
>
> *To:* user <user@hive.apache.org>
> *Subject:* Re: Why does the user need write permission on the location of
> external hive table?
>
>
>
> Well your mileage varies as usual.
>
>
>
>
>
> I have not encountered any issues with it. Why do you want to use
> map-reduce anyway? Try Spark engine. It uses DAG + in-memory capabilities
> and faster compared to map-reduce by order of magnitue. map-reduce is
> depreciated and you can configure hive to use spark from 0.14 onwards.
>
>
>
> hive> set hive.execution.engine=mr;
> Hive-on-MR is deprecated in Hive 2 and may not be available in the future
> versions. Consider using a different execution engine (i.e. spark, tez) or
> using Hive 1.X releases.
>
>
>
> hive> select count(1) from dummy;
> Query ID = hduser_20160607005023_eff56232-1890-4cd3-8999-a0ff87fedfb7
> Total jobs = 1
> Launching Job 1 out of 1
> In order to change the average load for a reducer (in bytes):
>   set hive.exec.reducers.bytes.per.reducer=<number>
> In order to limit the maximum number of reducers:
>   set hive.exec.reducers.max=<number>
> In order to set a constant number of reducers:
>   set mapreduce.job.reduces=<number>
> Starting Spark Job = a30151eb-3bab-400f-8ee7-74e4a2578dd8
>
> Query Hive on Spark job[1] stages:
> 2
> 3
>
> Status: Running (Hive on Spark job[1])
> Job Progress Format
> CurrentTime StageId_StageAttemptId:
> SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount
> [StageCost]
> 2016-06-07 00:52:37,678 Stage-2_0: 0(+2)/22     Stage-3_0: 0/1
> 2016-06-07 00:52:38,683 Stage-2_0: 2(+2)/22     Stage-3_0: 0/1
> 2016-06-07 00:52:39,689 Stage-2_0: 6(+2)/22     Stage-3_0: 0/1
> 2016-06-07 00:52:40,696 Stage-2_0: 8(+2)/22     Stage-3_0: 0/1
> 2016-06-07 00:52:41,700 Stage-2_0: 11(+0)/22    Stage-3_0: 0/1
> 2016-06-07 00:52:44,716 Stage-2_0: 11(+2)/22    Stage-3_0: 0/1
> 2016-06-07 00:52:45,721 Stage-2_0: 13(+2)/22    Stage-3_0: 0/1
> 2016-06-07 00:52:46,726 Stage-2_0: 15(+2)/22    Stage-3_0: 0/1
> 2016-06-07 00:52:47,731 Stage-2_0: 17(+2)/22    Stage-3_0: 0/1
> 2016-06-07 00:52:48,738 Stage-2_0: 21(+1)/22    Stage-3_0: 0/1
> 2016-06-07 00:52:49,743 Stage-2_0: 22/22 Finished       Stage-3_0: 1/1
> Finished
> Status: Finished successfully in 13.07 seconds
> OK
> 100000000
> Time taken: 13.293 seconds, Fetched: 1 row(s)
>
>
>
> As seen from Spark GUI
>
>
>
>
>
> [image: Inline images 1]
>
>
>
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn  *https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
>
>
> On 7 June 2016 at 00:35, Markovitz, Dudu <dmarkovitz@paypal.com> wrote:
>
> Hi guys
>
>
>
> I would strongly recommend not to work with zipped files.
>
>
>
> “Hadoop will not be able to split your file into chunks/blocks and run
> multiple maps in parallel”
>
> https://cwiki.apache.org/confluence/display/Hive/CompressedStorage
>
>
>
> Dudu
>
>
>
> *From:* Mich Talebzadeh [mailto:mich.talebzadeh@gmail.com]
> *Sent:* Tuesday, June 07, 2016 2:23 AM
> *To:* user <user@hive.apache.org>
> *Subject:* Re: Why does the user need write permission on the location of
> external hive table?
>
>
>
> Hi Igor,
>
>
>
> Hive can read from zipped files. If you are getting a lot of external
> files it makes sense to zip them and store on staging hdfs directory
>
>
>
> 1) download say these csv files into your local file system and use bzip2
> to zip them as part of ETL
>
>
>
>  ls -l
> total 68
> -rw-r--r-- 1 hduser hadoop 7334 Apr 25 11:29 nw_2011.csv.bz2
> -rw-r--r-- 1 hduser hadoop 6235 Apr 25 11:29 nw_2012.csv.bz2
> -rw-r--r-- 1 hduser hadoop 5476 Apr 25 11:29 nw_2013.csv.bz2
> -rw-r--r-- 1 hduser hadoop 2725 Apr 25 11:29 nw_2014.csv.bz2
> -rw-r--r-- 1 hduser hadoop 1868 Apr 25 11:29 nw_2015.csv.bz2
> -rw-r--r-- 1 hduser hadoop  693 Apr 25 11:29 nw_2016.csv.bz2
>
>
>
> Then put these files in a staging directory on hdfs usinh a shell script
>
>
>
>
>
> for FILE in `ls *.*|grep -v .ksh`
> do
>   echo "Bzipping ${FILE}"
>   /usr/bin/bzip2 ${FILE}
>    hdfs dfs -copyFromLocal ${FILE}.bz2 ${TARGETDIR}
> done
>
>
>
> OK now the files are saved in ${TARGETDIR}
>
>
>
> Now create the external table looking at this staging directory. *No need
> to tell hive that these files are compressed*. It knows how to handle it.
> They are stored as textfiles
>
>
>
>
>
> DROP TABLE IF EXISTS stg_t2;
> CREATE EXTERNAL TABLE stg_t2 (
>  INVOICENUMBER string
> ,PAYMENTDATE string
> ,NET string
> ,VAT string
> ,TOTAL string
> )
> COMMENT 'from csv file from excel sheet nw_10124772'
> ROW FORMAT serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
> *STORED AS TEXTFILE*
> LOCATION '/data/stg/accounts/nw/10124772'
> TBLPROPERTIES ("skip.header.line.count"="1")
>
>
>
> Now create the Hive table internally. Note that I want this data to be
> compressed. You will tell it to compress the table with ZLIB or SNAPPY
>
>
>
>
>
> DROP TABLE IF EXISTS t2;
> CREATE TABLE t2 (
>  INVOICENUMBER          INT
> ,PAYMENTDATE            date
> ,NET                    DECIMAL(20,2)
> ,VAT                    DECIMAL(20,2)
> ,TOTAL                  DECIMAL(20,2)
> )
> COMMENT 'from csv file from excel sheet nw_10124772'
> CLUSTERED BY (INVOICENUMBER) INTO 256 BUCKETS
> STORED AS ORC
> *TBLPROPERTIES ( "orc.compress"="ZLIB" )*
>
>
>
> Put data in target table. do the conversion and ignore empty rows
>
>
>
> INSERT INTO TABLE t2
> SELECT
>           INVOICENUMBER
>         ,
> TO_DATE(FROM_UNIXTIME(UNIX_TIMESTAMP(paymentdate,'dd/MM/yyyy'),'yyyy-MM-dd'))
> AS paymentdate
>         , CAST(REGEXP_REPLACE(net,'[^\\d\\.]','') AS DECIMAL(20,2))
>         , CAST(REGEXP_REPLACE(vat,'[^\\d\\.]','') AS DECIMAL(20,2))
>         , CAST(REGEXP_REPLACE(total,'[^\\d\\.]','') AS DECIMAL(20,2))
> FROM
> stg_t2
> WHERE
> --        INVOICENUMBER > 0 AND
>         CAST(REGEXP_REPLACE(total,'[^\\d\\.]','') AS DECIMAL(20,2)) > 0.0
> -- Exclude empty rows
> ;
>
>
> So pretty straight forward.
>
>
>
> Now to your question
>
>
>
> "it will affect performance, correct?"
>
>
>
>
>
> Compression is a well established algorithm. It has been around in
> databases. Almost all RDBMS (Oracle, Sybase etc) do compress the data at
> database and backups through an option. Compression is more CPU intensive
> than without it. However, the database will handle the conversion of data
> from compressed to none when you read it or whatever. So yes there is a
> performance price to pay albeit small using more CPU to uncompress the data
> and present it. However, that is a small price to pay to reduce the storage
> cost for data.
>
>
>
> HTH
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn  *https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
>
>
> On 6 June 2016 at 23:18, Igor Kravzov <igork.inexso@gmail.com> wrote:
>
> Mich, will Hive automatically detect and unzip zipped files? Ir there is
> special option in table configuration?
>
> it will affect performance, correct?
>
>
>
> On Mon, Jun 6, 2016 at 4:14 PM, Mich Talebzadeh <mich.talebzadeh@gmail.com>
> wrote:
>
> Hi Sandeep.
>
>
>
> I tend to use Hive external tables as staffing tables but still I will
> require access writes to hdfs.
>
>
>
> Zip files work OK as well. For example our CSV files are zipped using
> bzip2 to save space
>
>
>
> However, you may request a temporary solution by disabling permission in
> $HADOOP_HOME/etc/Hadoop/hdfs-site.xml
>
>
>
> <property>
>
>     <name>dfs.permissions</name>
>
>     <value>false</value>
>
> </property>
>
>
>
> There are other ways as well.
>
>
>
> Check this
>
>
>
> http://stackoverflow.com/questions/11593374/permission-denied-at-hdfs
>
>
>
> HTH
>
>
>
>
>
>
>
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn  *https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
>
>
> On 6 June 2016 at 21:00, Igor Kravzov <igork.inexso@gmail.com> wrote:
>
> I see file are with extension .gz. Are these zipped?
>
> Did you try with unzipped files?
>
> Maybe in order to read the data hive needs to unzip files but does not
> have write permission?
>
> Just a wild guess...
>
>
>
> On Tue, May 31, 2016 at 4:20 AM, Sandeep Giri <sandeep@cloudxlab.com>
> wrote:
>
> Yes, when I run hadoop fs it gives results correctly.
>
>
>
> *hadoop fs -ls /data/SentimentFiles/SentimentFiles/upload/data/tweets_raw/*
>
> *Found 30 items*
>
> *-rw-r--r--   3 hdfs hdfs       6148 2015-12-04 15:19
> /data/SentimentFiles/SentimentFiles/upload/data/tweets_raw/.DS_Store*
>
> *-rw-r--r--   3 hdfs hdfs     803323 2015-12-04 15:19
> /data/SentimentFiles/SentimentFiles/upload/data/tweets_raw/FlumeData.1367523670393.gz*
>
> *-rw-r--r--   3 hdfs hdfs     284355 2015-12-04 15:19
> /data/SentimentFiles/SentimentFiles/upload/data/tweets_raw/FlumeData.1367523670394.gz*
>
> *....*
>
>
>
>
>
>
>
> On Tue, May 31, 2016 at 1:42 PM, Mich Talebzadeh <
> mich.talebzadeh@gmail.com> wrote:
>
> is this location correct and valid?
>
>
>
> LOCATION '/data/SentimentFiles/*SentimentFiles*/upload/data/tweets_raw/'
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn  *https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
>
>
> On 31 May 2016 at 08:50, Sandeep Giri <sandeep@cloudxlab.com> wrote:
>
> Hi Hive Team,
>
>
>
> As per my understanding, in Hive, you can create two kinds of tables:
> Managed and External.
>
>
>
> In case of managed table, you own the data and hence when you drop the
> table the data is deleted.
>
>
>
> In case of external table, you don't have ownership of the data and hence
> when you delete such a table, the underlying data is not deleted. Only
> metadata is deleted.
>
>
>
> Now, recently i have observed that you can not create an external table
> over a location on which you don't have write (modification) permissions in
> HDFS. I completely fail to understand this.
>
>
>
> Use case: It is quite common that the data you are churning is huge and
> read-only. So, to churn such data via Hive, will you have to copy this huge
> data to a location on which you have write permissions?
>
>
>
> Please help.
>
>
>
> My data is located in a hdfs folder
> (/data/SentimentFiles/SentimentFiles/upload/data/tweets_raw/)  on which I
> only have readonly permission. And I am trying to execute the following
> command
>
>
>
> *CREATE EXTERNAL TABLE tweets_raw (*
>
> *        id BIGINT,*
>
> *        created_at STRING,*
>
> *        source STRING,*
>
> *        favorited BOOLEAN,*
>
> *        retweet_count INT,*
>
> *        retweeted_status STRUCT<*
>
> *        text:STRING,*
>
> *        users:STRUCT<screen_name:STRING,name:STRING>>,*
>
> *        entities STRUCT<*
>
> *        urls:ARRAY<STRUCT<expanded_url:STRING>>,*
>
> *        user_mentions:ARRAY<STRUCT<screen_name:STRING,name:STRING>>,*
>
> *        hashtags:ARRAY<STRUCT<text:STRING>>>,*
>
> *        text STRING,*
>
> *        user1 STRUCT<*
>
> *        screen_name:STRING,*
>
> *        name:STRING,*
>
> *        friends_count:INT,*
>
> *        followers_count:INT,*
>
> *        statuses_count:INT,*
>
> *        verified:BOOLEAN,*
>
> *        utc_offset:STRING, -- was INT but nulls are strings*
>
> *        time_zone:STRING>,*
>
> *        in_reply_to_screen_name STRING,*
>
> *        year int,*
>
> *        month int,*
>
> *        day int,*
>
> *        hour int*
>
> *        )*
>
> *        ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'*
>
> *        WITH SERDEPROPERTIES ("ignore.malformed.json" = "true")*
>
> *        LOCATION
> '/data/SentimentFiles/SentimentFiles/upload/data/tweets_raw/'*
>
> *        ;*
>
>
>
> It throws the following error:
>
>
>
> FAILED: Execution Error, return code 1 from
> org.apache.hadoop.hive.ql.exec.DDLTask.
> MetaException(message:java.security.AccessControlException: Permission
> denied: user=sandeep, access=WRITE,
> inode="/data/SentimentFiles/SentimentFiles/upload/data/tweets_raw":hdfs:hdfs:drwxr-xr-x
>
>         at
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:319)
>
>         at
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:219)
>
>         at
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:190)
>
>         at
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1771)
>
>         at
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1755)
>
>         at
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPathAccess(FSDirectory.java:1729)
>
>         at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkAccess(FSNamesystem.java:8348)
>
>         at
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.checkAccess(NameNodeRpcServer.java:1978)
>
>         at
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.checkAccess(ClientNamenodeProtocolServerSideTranslatorPB.ja
>
> va:1443)
>
>         at
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProto
>
> s.java)
>
>         at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
>
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
>
>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2151)
>
>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2147)
>
>         at java.security.AccessController.doPrivileged(Native Method)
>
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>
>         at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
>
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2145)
>
>
>
>
>
>
>
> --
>
> Regards,
>
> Sandeep Giri,
>
> +1-(347) 781-4573 (US)
>
> +91-953-899-8962 (IN)
>
> www.CloudxLab.com  (A Hadoop cluster for practicing)
>
>
>
>
>
>
>
>
>
> --
>
> Regards,
>
> Sandeep Giri,
>
> +1-(347) 781-4573 (US)
>
> +91-953-899-8962 (IN)
>
> www.CloudxLab.com
>
>
>
>
>
>
>
>
>
>
>

Mime
View raw message