hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ashutosh Chauhan <hashut...@apache.org>
Subject Re: One Schema Per Partition? (Multiple schemas per table?)
Date Tue, 30 Aug 2011 05:46:46 GMT
Hi Tim,

I figured that both reading the code and manual. I don't think
its explicitly documented anywhere, so it will be great if you document
this. This page looks right place where this place of information can live.
  Thanks for the help in making Hive better.

Ashutosh
On Mon, Aug 29, 2011 at 15:26, Time Less <timelessness@gmail.com> wrote:

> Hello, Ashutosh,
>
> I did nothing like that... :)
>
> It seems the problem here is I didn't RTFM. Perchance, could you say where
> you figured this out? I am going from the Hive DDL page on confluence[1],
> and although it mentions partitions and it mentions the "replace columns"
> you've mentioned here, it doesn't mention them together that I see. I would
> like to document this for future generations. Is that the proper page where
> I'd document this?
>
> I would probably explicitly create a section titled "Different Schemas per
> Partition" and basically give the syntax you give (from quoted, assuming
> after I test it, it works).
>
>
> [1]
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-AlterTable%2FPartitionStatements
>
>
> On Wed, Aug 24, 2011 at 6:14 PM, Ashutosh Chauhan <hashutosh@apache.org>wrote:
>
>> Hey Tim,
>>
>> Hive does support different schema's for different partitions. If your
>> data comes out garbled, that seems to be a bug then. In your case, is the
>> following sequence of steps resemble what you did:
>>
>> a) create table tbl (id: int, name: string, level: int) partitioned by
>> date;
>> b) -- add partitions
>> c) alter table tbl replace columns (id: int, level: int, name_id: int)
>> d) -- add more partitions.
>>
>> If you do select * from tbl, then this should work. You need not to
>> rewrite any of your data. Can you provide more info about what output you
>> were expecting and what you got. Are there any error logs?
>>
>> Ashutosh
>>
>>
>> On Mon, Aug 22, 2011 at 14:34, Time Less <timelessness@gmail.com> wrote:
>>
>>> I found a set of slides from Facebook online about Hive that claims you
>>> can have a schema per partition in the table, this is exciting to us,
>>> because we have a table like so:
>>>
>>> id     int
>>> name   string
>>> level  int
>>> date   string
>>>
>>> And it's broken up into partitions by date. However, on a particular date
>>> last year, the table dramatically changed its schema to:
>>>
>>> id       int
>>> level    int
>>> date     string
>>> name_id  int
>>>
>>> So now if I do "select * from table" in hive, the data is completely
>>> garbled for whichever portion of data doesn't fit the Hive schema. We are
>>> considering re-writing the datafiles so they're the same before/after that
>>> date, but if Hive supports having two entirely different schemas depending
>>> on the partition, that'd be really convenient, since these datafiles are
>>> hundreds of gigabytes in size (and we do sort of like the idea of knowing
>>> how the datafile looked back then...).
>>>
>>> This page:
>>> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-AlterTable%2FPartitionStatementsdoesn't
seem to have an appropriate example, so I'm left wondering.
>>>
>>> Has anyone done anything like this?
>>>
>>> --
>>> Tim Ellis
>>> Data Architect, Riot Games
>>>
>>>
>>
>
>
> --
> Tim
>

Mime
View raw message