hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ravi Shetye <ravi.she...@vizury.com>
Subject Re: Hive on EMR on S3 : Beginner
Date Sat, 25 Aug 2012 04:58:48 GMT
Thanks Richin and Pedro,
So a final clarification

/Another way of doing apart from dynamic partition is if you can create 
your directories like below either manually or the ETL process you might 
be doing to get the table data it     is pretty easy./

/|*	s3://  <s3://logs/joined_analyze_files_hive/%27>ravi/logs/adv_id=123/date=2012-01-01/log.gz*|/

/|*	s3://  <s3://logs/joined_analyze_files_hive/%27>ravi/logs/adv_id=456/date=2012-01-02/log.gz*|/

/|*	s3://  <s3://logs/joined_analyze_files_hive/%27>ravi/logs/adv_id=123/date=2012-01-03/log.gz*|/


1)Since I have used|*PARTITIONED BY (adv_id STRING,date STRING) *|Hive 
system will read the bucket name |*adv_id=123 *and understand that the 
data within this bucket can be accessed by a pseudo column adv_id?|
||2) |It would be wrong if I use ||*PARTITIONED BY (date 
STRING*||*,adv_id STRING*||*) *and keep the same bucket structure?|
|||3)Also it wont work if I store data in**||*s3:// 
<s3://logs/joined_analyze_files_hive/%27>ravi/logs/123/2012-01-01/log.gz ?*|

On 08/24/2012 07:44 PM, richin.jain@nokia.com wrote:
>
> Hi Ravi,
>
> Another way of doing apart from dynamic partition is if you can create 
> your directories like below either manually or the ETL process you 
> might be doing to get the table data it is pretty easy.
>
> |*s3://  <s3://logs/joined_analyze_files_hive/%27>ravi/logs/adv_id=123/date=2012-01-01/log.gz*|
> |*s3://  <s3://logs/joined_analyze_files_hive/%27>ravi/logs/adv_id=456/date=2012-01-02/log.gz*|
> |*s3://  <s3://logs/joined_analyze_files_hive/%27>ravi/logs/adv_id=123/date=2012-01-03/log.gz*|
>   
>
> and so on..
>
> and then when you define your table
>
> |*CREATE EXTERNAL TABLE results*|
> |*         (field1 STRING,*|
> |*         field2 STRING)*|
> |*         PARTITIONED BY (adv_id STRING,date STRING)*|
> |*         ROW FORMAT DELIMITED*|
> |*         FIELDS TERMINATED BY '\t'*|
> |*         LOCATION 's3://ravi/logs/';  <s3://logs/joined_analyze_files_hive/%27>*|
> |*  *|
> |*Once the table is created do a*|
> |*  *|
> |*ALTER TABLE results RECOVER PARTITIONS; -- This will load all the data from the partitions
in the results table*|
> |*  *|
> |*SELECT * FROM results; -- should return you all the rows.*|
> |*  *|
> |*For queries by partition you could do something like*|
> |*  *|
> |*SELECT * from results where adv-id=123 and date='2012-01-01'; -- this query will just
read that specific directory/partition*|
> |*   *|
>
> Here is an link to AWS forum, describing this in more detail - 
> https://forums.aws.amazon.com/thread.jspa?threadID=92862
>
> Thanks,
>
> Richin
>
> *From:*ext Ravi Shetye [mailto:ravi.shetye@vizury.com]
> *Sent:* Friday, August 24, 2012 9:09 AM
> *To:* user@hive.apache.org
> *Subject:* Re: Hive on EMR on S3 : Beginner
>
> |  |
> |thanks for the reply|
> |  |
> |Let concentrate on the second case**|
> |*  *|
> |*CREATE EXTERNAL TABLE results (cookie STRING,*|
> |*   d2 STRING,*|
> |*   url STRING,*|
> |*   d4 STRING,*|
> |*   d5 STRING,*|
> |*   d6 STRING,*|
> |*   adv_id_dummy STRING,*|
> |*   timestp STRING,*|
> |*   ip STRING,*|
> |*   userAgent STRING,*|
> |*   stage STRING,*|
> |*   d12 STRING,*|
> |*   d13 STRING)*|
> |*PARTITION BY (adv_id,date) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'*|
> |*LOCATION 's3://somewhere-outside-the-logs-tree';  <s3://logs/joined_analyze_files_hive/%27>*|
> |*  *|
> |*You can then*|
> |*  *|
> |*INSERT OVERWRITE TABLE results PARTITION (adv_id, date)*|
> |*  *|
> |*Is this all I need to do to load the data?*|
> |how will the system know what data will go into what partition?|
> |As I understand the partition columns should be psedo columns and not part of the actual
data.|
> |  |
> |Also if I have to load just 2 of the files say|
> |s3://logs/ad1date1.log.gz  and|
> |s3://logs/ad2date4.log.gz   how do I specify it.|
>
>
> On 08/24/2012 06:16 PM, Pedro Figueiredo wrote:
>
>     Hi,
>
>     On 24 Aug 2012, at 13:26, Ravi Shetye wrote:
>
>
>
>     I have the data in s3 bucket in the following manner
>
>     |s3://logs/ad1date1.log.gz|
>
>     |s3://logs/ad1date2.log.gz|
>
>     |s3://logs/ad1date3.log.gz|
>
>     |s3://logs/ad1date4.log.gz|
>
>     |s3://logs/ad2date1.log.gz|
>
>     |s3://logs/ad2date2.log.gz|
>
>     |s3://logs/ad2date3.log.gz|
>
>     |s3://logs/ad2date4.log.gz|
>
>     If you do
>
>
>
>     |CREATE EXTERNAL TABLE analyze_files_tab (cookie STRING,|
>
>     |d2 STRING,|
>
>     |url STRING,|
>
>     |d4 STRING,|
>
>     |d5 STRING,|
>
>     |d6 STRING,|
>
>     |adv_id_dummy STRING,|
>
>     |timestp STRING,|
>
>     |ip STRING,|
>
>     |userAgent STRING,|
>
>     |stage STRING,|
>
>     |d12 STRING,|
>
>     |d13 STRING)|
>
>     LOCATION 's3n://logs/' <s3n://logs/%27>;
>
>     you'll have all of it in a table. If you then want the results
>     partitioned, you can do
>
>     |CREATE EXTERNAL TABLE results (cookie STRING,|
>
>     |   d2 STRING,|
>
>     |   url STRING,|
>
>     |   d4 STRING,|
>
>     |   d5 STRING,|
>
>     |   d6 STRING,|
>
>     |   adv_id_dummy STRING,|
>
>     |   timestp STRING,|
>
>     |   ip STRING,|
>
>     |   userAgent STRING,|
>
>     |   stage STRING,|
>
>     |   d12 STRING,|
>
>     |   d13 STRING)|
>
>     |PARTITION BY (adv_id,date) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'|
>
>     |LOCATION 's3://somewhere-outside-the-logs-tree';  <s3://logs/joined_analyze_files_hive/%27>|
>
>     |  |
>
>     |You can then|
>
>     |  |
>
>     |INSERT OVERWRITE TABLE results PARTITION (adv_id, date)|
>
>     |<your query>|
>
>     |  |
>
>     |Note that to use dynamic partitions you have to first run|
>
>     |SET hive.exec.dynamic.partition.mode=nonstrict;|
>
>     |SET hive.exec.dynamic.partition=true;|
>
>     Cheers,
>
>     Pedro
>
>     Pedro Figueiredo
>     Skype: pfig.89clouds
>     http://89clouds.com/ - Big Data Consulting
>
>
>
>


Mime
View raw message