hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Markovitz, Dudu" <dmarkov...@paypal.com>
Subject RE: Why does the user need write permission on the location of external hive table?
Date Tue, 07 Jun 2016 00:27:11 GMT
I’ll rephrase –
I would strongly recommend not to work with zipped files and MR.
Dudu
From: Mich Talebzadeh [mailto:mich.talebzadeh@gmail.com]
Sent: Tuesday, June 07, 2016 2:59 AM
To: user <user@hive.apache.org>
Subject: Re: Why does the user need write permission on the location of external hive table?

Well your mileage varies as usual.


I have not encountered any issues with it. Why do you want to use map-reduce anyway? Try Spark
engine. It uses DAG + in-memory capabilities and faster compared to map-reduce by order of
magnitue. map-reduce is depreciated and you can configure hive to use spark from 0.14 onwards.

hive> set hive.execution.engine=mr;
Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider
using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.

hive> select count(1) from dummy;
Query ID = hduser_20160607005023_eff56232-1890-4cd3-8999-a0ff87fedfb7
Total jobs = 1
Launching Job 1 out of 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Spark Job = a30151eb-3bab-400f-8ee7-74e4a2578dd8
Query Hive on Spark job[1] stages:
2
3
Status: Running (Hive on Spark job[1])
Job Progress Format
CurrentTime StageId_StageAttemptId: SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount
[StageCost]
2016-06-07 00:52:37,678 Stage-2_0: 0(+2)/22     Stage-3_0: 0/1
2016-06-07 00:52:38,683 Stage-2_0: 2(+2)/22     Stage-3_0: 0/1
2016-06-07 00:52:39,689 Stage-2_0: 6(+2)/22     Stage-3_0: 0/1
2016-06-07 00:52:40,696 Stage-2_0: 8(+2)/22     Stage-3_0: 0/1
2016-06-07 00:52:41,700 Stage-2_0: 11(+0)/22    Stage-3_0: 0/1
2016-06-07 00:52:44,716 Stage-2_0: 11(+2)/22    Stage-3_0: 0/1
2016-06-07 00:52:45,721 Stage-2_0: 13(+2)/22    Stage-3_0: 0/1
2016-06-07 00:52:46,726 Stage-2_0: 15(+2)/22    Stage-3_0: 0/1
2016-06-07 00:52:47,731 Stage-2_0: 17(+2)/22    Stage-3_0: 0/1
2016-06-07 00:52:48,738 Stage-2_0: 21(+1)/22    Stage-3_0: 0/1
2016-06-07 00:52:49,743 Stage-2_0: 22/22 Finished       Stage-3_0: 1/1 Finished
Status: Finished successfully in 13.07 seconds
OK
100000000
Time taken: 13.293 seconds, Fetched: 1 row(s)

As seen from Spark GUI


[Inline images 1]





Dr Mich Talebzadeh



LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com<http://talebzadehmich.wordpress.com/>



On 7 June 2016 at 00:35, Markovitz, Dudu <dmarkovitz@paypal.com<mailto:dmarkovitz@paypal.com>>
wrote:
Hi guys

I would strongly recommend not to work with zipped files.

“Hadoop will not be able to split your file into chunks/blocks and run multiple maps in
parallel”
https://cwiki.apache.org/confluence/display/Hive/CompressedStorage

Dudu

From: Mich Talebzadeh [mailto:mich.talebzadeh@gmail.com<mailto:mich.talebzadeh@gmail.com>]
Sent: Tuesday, June 07, 2016 2:23 AM
To: user <user@hive.apache.org<mailto:user@hive.apache.org>>
Subject: Re: Why does the user need write permission on the location of external hive table?

Hi Igor,

Hive can read from zipped files. If you are getting a lot of external files it makes sense
to zip them and store on staging hdfs directory

1) download say these csv files into your local file system and use bzip2 to zip them as part
of ETL

 ls -l
total 68
-rw-r--r-- 1 hduser hadoop 7334 Apr 25 11:29 nw_2011.csv.bz2
-rw-r--r-- 1 hduser hadoop 6235 Apr 25 11:29 nw_2012.csv.bz2
-rw-r--r-- 1 hduser hadoop 5476 Apr 25 11:29 nw_2013.csv.bz2
-rw-r--r-- 1 hduser hadoop 2725 Apr 25 11:29 nw_2014.csv.bz2
-rw-r--r-- 1 hduser hadoop 1868 Apr 25 11:29 nw_2015.csv.bz2
-rw-r--r-- 1 hduser hadoop  693 Apr 25 11:29 nw_2016.csv.bz2

Then put these files in a staging directory on hdfs usinh a shell script


for FILE in `ls *.*|grep -v .ksh`
do
  echo "Bzipping ${FILE}"
  /usr/bin/bzip2 ${FILE}
   hdfs dfs -copyFromLocal ${FILE}.bz2 ${TARGETDIR}
done

OK now the files are saved in ${TARGETDIR}

Now create the external table looking at this staging directory. No need to tell hive that
these files are compressed. It knows how to handle it. They are stored as textfiles


DROP TABLE IF EXISTS stg_t2;
CREATE EXTERNAL TABLE stg_t2 (
 INVOICENUMBER string
,PAYMENTDATE string
,NET string
,VAT string
,TOTAL string
)
COMMENT 'from csv file from excel sheet nw_10124772'
ROW FORMAT serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
STORED AS TEXTFILE
LOCATION '/data/stg/accounts/nw/10124772'
TBLPROPERTIES ("skip.header.line.count"="1")

Now create the Hive table internally. Note that I want this data to be compressed. You will
tell it to compress the table with ZLIB or SNAPPY


DROP TABLE IF EXISTS t2;
CREATE TABLE t2 (
 INVOICENUMBER          INT
,PAYMENTDATE            date
,NET                    DECIMAL(20,2)
,VAT                    DECIMAL(20,2)
,TOTAL                  DECIMAL(20,2)
)
COMMENT 'from csv file from excel sheet nw_10124772'
CLUSTERED BY (INVOICENUMBER) INTO 256 BUCKETS
STORED AS ORC
TBLPROPERTIES ( "orc.compress"="ZLIB" )

Put data in target table. do the conversion and ignore empty rows

INSERT INTO TABLE t2
SELECT
          INVOICENUMBER
        , TO_DATE(FROM_UNIXTIME(UNIX_TIMESTAMP(paymentdate,'dd/MM/yyyy'),'yyyy-MM-dd')) AS
paymentdate
        , CAST(REGEXP_REPLACE(net,'[^\\d\\.]','') AS DECIMAL(20,2))
        , CAST(REGEXP_REPLACE(vat,'[^\\d\\.]','') AS DECIMAL(20,2))
        , CAST(REGEXP_REPLACE(total,'[^\\d\\.]','') AS DECIMAL(20,2))
FROM
stg_t2
WHERE
--        INVOICENUMBER > 0 AND
        CAST(REGEXP_REPLACE(total,'[^\\d\\.]','') AS DECIMAL(20,2)) > 0.0 -- Exclude empty
rows
;

So pretty straight forward.

Now to your question

"it will affect performance, correct?"


Compression is a well established algorithm. It has been around in databases. Almost all RDBMS
(Oracle, Sybase etc) do compress the data at database and backups through an option. Compression
is more CPU intensive than without it. However, the database will handle the conversion of
data from compressed to none when you read it or whatever. So yes there is a performance price
to pay albeit small using more CPU to uncompress the data and present it. However, that is
a small price to pay to reduce the storage cost for data.

HTH













Dr Mich Talebzadeh



LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com<http://talebzadehmich.wordpress.com/>



On 6 June 2016 at 23:18, Igor Kravzov <igork.inexso@gmail.com<mailto:igork.inexso@gmail.com>>
wrote:
Mich, will Hive automatically detect and unzip zipped files? Ir there is special option in
table configuration?
it will affect performance, correct?

On Mon, Jun 6, 2016 at 4:14 PM, Mich Talebzadeh <mich.talebzadeh@gmail.com<mailto:mich.talebzadeh@gmail.com>>
wrote:
Hi Sandeep.

I tend to use Hive external tables as staffing tables but still I will require access writes
to hdfs.

Zip files work OK as well. For example our CSV files are zipped using bzip2 to save space

However, you may request a temporary solution by disabling permission in $HADOOP_HOME/etc/Hadoop/hdfs-site.xml

<property>
    <name>dfs.permissions</name>
    <value>false</value>
</property>

There are other ways as well.

Check this

http://stackoverflow.com/questions/11593374/permission-denied-at-hdfs

HTH







Dr Mich Talebzadeh



LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com<http://talebzadehmich.wordpress.com/>



On 6 June 2016 at 21:00, Igor Kravzov <igork.inexso@gmail.com<mailto:igork.inexso@gmail.com>>
wrote:
I see file are with extension .gz. Are these zipped?
Did you try with unzipped files?
Maybe in order to read the data hive needs to unzip files but does not have write permission?
Just a wild guess...

On Tue, May 31, 2016 at 4:20 AM, Sandeep Giri <sandeep@cloudxlab.com<mailto:sandeep@cloudxlab.com>>
wrote:
Yes, when I run hadoop fs it gives results correctly.

hadoop fs -ls /data/SentimentFiles/SentimentFiles/upload/data/tweets_raw/
Found 30 items
-rw-r--r--   3 hdfs hdfs       6148 2015-12-04 15:19 /data/SentimentFiles/SentimentFiles/upload/data/tweets_raw/.DS_Store
-rw-r--r--   3 hdfs hdfs     803323 2015-12-04 15:19 /data/SentimentFiles/SentimentFiles/upload/data/tweets_raw/FlumeData.1367523670393.gz
-rw-r--r--   3 hdfs hdfs     284355 2015-12-04 15:19 /data/SentimentFiles/SentimentFiles/upload/data/tweets_raw/FlumeData.1367523670394.gz
....



On Tue, May 31, 2016 at 1:42 PM, Mich Talebzadeh <mich.talebzadeh@gmail.com<mailto:mich.talebzadeh@gmail.com>>
wrote:
is this location correct and valid?

LOCATION '/data/SentimentFiles/SentimentFiles/upload/data/tweets_raw/'


Dr Mich Talebzadeh



LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com<http://talebzadehmich.wordpress.com/>



On 31 May 2016 at 08:50, Sandeep Giri <sandeep@cloudxlab.com<mailto:sandeep@cloudxlab.com>>
wrote:
Hi Hive Team,

As per my understanding, in Hive, you can create two kinds of tables: Managed and External.

In case of managed table, you own the data and hence when you drop the table the data is deleted.

In case of external table, you don't have ownership of the data and hence when you delete
such a table, the underlying data is not deleted. Only metadata is deleted.

Now, recently i have observed that you can not create an external table over a location on
which you don't have write (modification) permissions in HDFS. I completely fail to understand
this.

Use case: It is quite common that the data you are churning is huge and read-only. So, to
churn such data via Hive, will you have to copy this huge data to a location on which you
have write permissions?

Please help.

My data is located in a hdfs folder (/data/SentimentFiles/SentimentFiles/upload/data/tweets_raw/)
 on which I only have readonly permission. And I am trying to execute the following command

CREATE EXTERNAL TABLE tweets_raw (
        id BIGINT,
        created_at STRING,
        source STRING,
        favorited BOOLEAN,
        retweet_count INT,
        retweeted_status STRUCT<
        text:STRING,
        users:STRUCT<screen_name:STRING,name:STRING>>,
        entities STRUCT<
        urls:ARRAY<STRUCT<expanded_url:STRING>>,
        user_mentions:ARRAY<STRUCT<screen_name:STRING,name:STRING>>,
        hashtags:ARRAY<STRUCT<text:STRING>>>,
        text STRING,
        user1 STRUCT<
        screen_name:STRING,
        name:STRING,
        friends_count:INT,
        followers_count:INT,
        statuses_count:INT,
        verified:BOOLEAN,
        utc_offset:STRING, -- was INT but nulls are strings
        time_zone:STRING>,
        in_reply_to_screen_name STRING,
        year int,
        month int,
        day int,
        hour int
        )
        ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
        WITH SERDEPROPERTIES ("ignore.malformed.json" = "true")
        LOCATION '/data/SentimentFiles/SentimentFiles/upload/data/tweets_raw/'
        ;

It throws the following error:

FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:java.security.AccessControlException:
Permission denied: user=sandeep, access=WRITE, inode="/data/SentimentFiles/SentimentFiles/upload/data/tweets_raw":hdfs:hdfs:drwxr-xr-x
        at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:319)
        at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:219)
        at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:190)
        at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1771)
        at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1755)
        at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPathAccess(FSDirectory.java:1729)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkAccess(FSNamesystem.java:8348)
        at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.checkAccess(NameNodeRpcServer.java:1978)
        at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.checkAccess(ClientNamenodeProtocolServerSideTranslatorPB.ja
va:1443)
        at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProto
s.java)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2151)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2147)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2145)



--
Regards,
Sandeep Giri,
+1-(347) 781-4573<tel:%2B1-%28347%29%20781-4573> (US)
+91-953-899-8962 (IN)
www.CloudxLab.com<http://www.CloudxLab.com>  (A Hadoop cluster for practicing)





--
Regards,
Sandeep Giri,
+1-(347) 781-4573<tel:%2B1-%28347%29%20781-4573> (US)
+91-953-899-8962 (IN)
www.CloudxLab.com<http://www.CloudxLab.com>





Mime
View raw message