hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Time Less <timelessn...@gmail.com>
Subject Re: One Schema Per Partition? (Multiple schemas per table?)
Date Mon, 29 Aug 2011 22:26:23 GMT
Hello, Ashutosh,

I did nothing like that... :)

It seems the problem here is I didn't RTFM. Perchance, could you say where
you figured this out? I am going from the Hive DDL page on confluence[1],
and although it mentions partitions and it mentions the "replace columns"
you've mentioned here, it doesn't mention them together that I see. I would
like to document this for future generations. Is that the proper page where
I'd document this?

I would probably explicitly create a section titled "Different Schemas per
Partition" and basically give the syntax you give (from quoted, assuming
after I test it, it works).


[1]
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-AlterTable%2FPartitionStatements

On Wed, Aug 24, 2011 at 6:14 PM, Ashutosh Chauhan <hashutosh@apache.org>wrote:

> Hey Tim,
>
> Hive does support different schema's for different partitions. If your data
> comes out garbled, that seems to be a bug then. In your case, is the
> following sequence of steps resemble what you did:
>
> a) create table tbl (id: int, name: string, level: int) partitioned by
> date;
> b) -- add partitions
> c) alter table tbl replace columns (id: int, level: int, name_id: int)
> d) -- add more partitions.
>
> If you do select * from tbl, then this should work. You need not to rewrite
> any of your data. Can you provide more info about what output you were
> expecting and what you got. Are there any error logs?
>
> Ashutosh
>
>
> On Mon, Aug 22, 2011 at 14:34, Time Less <timelessness@gmail.com> wrote:
>
>> I found a set of slides from Facebook online about Hive that claims you
>> can have a schema per partition in the table, this is exciting to us,
>> because we have a table like so:
>>
>> id     int
>> name   string
>> level  int
>> date   string
>>
>> And it's broken up into partitions by date. However, on a particular date
>> last year, the table dramatically changed its schema to:
>>
>> id       int
>> level    int
>> date     string
>> name_id  int
>>
>> So now if I do "select * from table" in hive, the data is completely
>> garbled for whichever portion of data doesn't fit the Hive schema. We are
>> considering re-writing the datafiles so they're the same before/after that
>> date, but if Hive supports having two entirely different schemas depending
>> on the partition, that'd be really convenient, since these datafiles are
>> hundreds of gigabytes in size (and we do sort of like the idea of knowing
>> how the datafile looked back then...).
>>
>> This page:
>> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-AlterTable%2FPartitionStatementsdoesn't
seem to have an appropriate example, so I'm left wondering.
>>
>> Has anyone done anything like this?
>>
>> --
>> Tim Ellis
>> Data Architect, Riot Games
>>
>>
>


-- 
Tim

Mime
View raw message