hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Grover <mgro...@oanda.com>
Subject Re: Create Partitioned Table w/ Partition= Substring of Raw Data
Date Thu, 22 Mar 2012 13:33:14 GMT
Hi Dan,
What Gabi is right.

To solve your problem, you could have a non-partitioned table on the raw data and run a Hive
query that reads this raw data and inserts it into a partitioned table. Dynamic partitioning
could come in handy in that case. Look at https://cwiki.apache.org/Hive/tutorial.html#Tutorial-DynamicpartitionInsert
for details.

Mark

Mark Grover, Business Intelligence Analyst
OANDA Corporation 

www: oanda.com www: fxtrade.com 

"Best Trading Platform" - World Finance's Forex Awards 2009. 
"The One to Watch" - Treasury Today's Adam Smith Awards 2009. 


----- Original Message -----
From: "Gabi D" <gabid33@gmail.com>
To: user@hive.apache.org
Sent: Thursday, March 22, 2012 4:16:58 AM
Subject: Re: Create Partitioned Table w/ Partition= Substring of Raw Data


Dan, 
the partition value does not look at your raw data, you assign a value to the partition when
you put the data in. 
So what you need to do is this: 

Create table mytable (Time string, OtherData string) 
Partition by (danDate string); (never a good idea to give fields a name that's a reserved
word, such as 'date') 

and when you put the data in using load for example, you need to specify the actual value
you want to assign to 'danDate'.Meaning you should know what are the date values that are
inside your file (and choose only one). 
(e.g., 
load data local inpath '<file location>' into table mytable partition( danDate='01-01-2000'
); 
) 

Note that danDate is not a field from within your files, but it is actually used by hive to
create a subdirectory under your table's hdfs location which will be named: 
'danDate=01-01-2000' (if the value you gave it is 01-01-2000) 
hive always shows it as the last field in your 'describe <table>' commands though,again,
it is not a regular field. 

In this respect, it's better to use a date format that will be comparable, such as 'yyyy-mm-dd'
so you will be able to run selects such as: 
select count(*) form mytable where danDate >='2012-01-01' and danDate <'2012-02-01'


and hive will be able to run this using partition pruning (which means only read files in
the partition directories needed to satisfy your query). 

I hope I didn't go over stuff you already know and that this helps... 



On Wed, Mar 21, 2012 at 5:07 PM, Dan Y < dan.m.yelle@gmail.com > wrote: 


Hi All, 


My raw data looks like this: 
DateTime,OtherData 
01-01-2000-01:00:00,blablabla1 
01-01-2000-04:00:00,blablabla2 
01-02-2000-02:00:00,blablabla3 


I would like to partition on the datepart of DateTime. What does not work, unfortunately,
is this: 

Create table mytable (DateTime string, OtherData string) 
Partition by ( substr(DateTime,1,10) string); 


I wish my raw data instead looked like: 

Date ,Time ,OtherData 

01-01-2000 ,01:00:00 ,blablabla1 
01-01-2000 ,04:00:00 ,blablabla2 
01-02-2000 ,02:00:00 ,blablabla3 


...with Time a distinct field. Then I could use: 
Create table mytable (Time string, OtherData string) 
Partition by (Date string); 


Any ideas for the best way to load my raw data into a hive table partitioned by the datepart
of DateTime? The files are gynormous, so manipulating the raw data outside of Hive is not
feasible for this problem. I would like to avoid using Select in the solution as well since
my hive table will refer to zipped data (and the Select therefore would come with a big runtime
cost). 


Thanks!! 
Dan 

Mime
View raw message