hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gabi D <gabi...@gmail.com>
Subject Re: Create Partitioned Table w/ Partition= Substring of Raw Data
Date Thu, 22 Mar 2012 08:16:58 GMT
Dan,
the partition value does not look at your raw data, you assign a value to
the partition when you put the data in.
So what you need to do is this:
   Create table mytable (Time string, OtherData string)
   Partition by (danDate string);   (never a good idea to give fields a
name that's a reserved word, such as 'date')

and when you put the data in using load for example, you need to specify
the actual value you want to assign to 'danDate'.Meaning you should know
what are the date values that are inside your file (and choose only one).
(e.g.,
load data local inpath '<file location>' into table mytable partition(
danDate='01-01-2000' );
)

Note that danDate is not a field from within your files, but it is actually
used by hive to create a subdirectory under your table's hdfs location
which will be named:
'danDate=01-01-2000'  (if the value you gave it is 01-01-2000)
hive always shows it as the last field in your 'describe <table>' commands
though,again, it is not a regular field.

In this respect, it's better to use a date format that will be comparable,
such as 'yyyy-mm-dd' so you will be able to run selects such as:
select count(*) form mytable where danDate >='2012-01-01' and danDate
<'2012-02-01'

and hive will be able to run this using partition pruning (which means only
read files in the partition directories needed to satisfy your query).

I hope I didn't go over stuff you already know and that this helps...


On Wed, Mar 21, 2012 at 5:07 PM, Dan Y <dan.m.yelle@gmail.com> wrote:

> Hi All,
>
> My raw data looks like this:
>    DateTime,OtherData
>    01-01-2000-01:00:00,blablabla1
>    01-01-2000-04:00:00,blablabla2
>    01-02-2000-02:00:00,blablabla3
>
> I would like to partition on the datepart of DateTime.  What does *not *work,
> unfortunately, is this:
>    Create table mytable (DateTime string, OtherData string)
>    Partition by (*substr(DateTime,1,10)* string);
>
> I *wish* my raw data instead looked like:
>    Date*,Time*,OtherData
>    01-01-2000*,01:00:00*,blablabla1
>    01-01-2000*,04:00:00*,blablabla2
>    01-02-2000*,02:00:00*,blablabla3
>
> ...with Time a distinct field.  Then I could use:
>    Create table mytable (Time string, OtherData string)
>    Partition by (Date string);
>
> Any ideas for the best way to load my raw data into a hive table
> partitioned by the datepart of DateTime?  The files are gynormous, so
> manipulating the raw data outside of Hive is not feasible for this problem.
>  I would like to avoid using Select in the solution as well since my hive
> table will refer to zipped data (and the Select therefore would come with a
> big runtime cost).
>
> Thanks!!
> Dan
>

Mime
View raw message