hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Petter von Dolwitz (Hem)" <petter.von.dolw...@gmail.com>
Subject Re: Pointing multiple external tables to the same location
Date Sun, 22 Dec 2013 12:02:00 GMT
Hi Navis,

thank you for sorting this out! I have tried getting around this by using
views towards a single master table instead in combination with UDFs
 instead . Seems to work so far.

/Petter


2013/12/18 Navis류승우 <navis.ryu@nexr.com>

> Hive uses path to table(or partition) mapping internally (you can see that
> in MapredWork, etc.), which might caused first table overwritten by other.
>
> I didn't tried symlink on hdfs, which could be a solution.
>
>
>
> 2013/12/12 Petter von Dolwitz (Hem) <petter.von.dolwitz@gmail.com>
>
> Hi,
>>
>> I have declared several external tables pointing to the same location.
>> The things that tells these tables apart (apart from their names) is that
>> they have unique properties. These properties help me choose the correct
>> rows from the underlying file. I use a single storage handler (accompanied
>> by a single InputFormat and a single Serde) . The first columns in all
>> tables are the same but the last (a struct) is unique and
>> is constructed from the Serde (with help of the serde properties). A
>> simplified version of the tables look like so:
>>
>> CREATE EXTERNAL TABLE Table1 (
>>   column1 STRING,
>>   column2 STRING)
>>   STORED BY 'MyStorageHandler'
>>   WITH SERDEPROPERTIES ('ser.class'='MyStructSerializationClass1')
>>   LOCATION 'mylocation'
>>   TBLPROPERTIES('recordreader.filter'='table1_filter');
>>
>> CREATE EXTERNAL TABLE Table2 (
>>   column1 STRING,
>>   column2 STRING)
>>   STORED BY 'MyStorageHandler'
>>   WITH SERDEPROPERTIES ('ser.class'='MyStructSerializationClass2')
>>   LOCATION 'mylocation'
>>   TBLPROPERTIES('recordreader.filter'='table2_filter');
>>
>>
>> All works well for simple select queries towards the two tables. The
>> following query gives very strange results though:
>>
>> SELECT * FROM (
>>   SELECT column1,'Table1' FROM Table1 WHERE column2 = 'myValue'
>>   union all
>>   SELECT column1,'Table2' FROM Table2 WHERE column2 = 'myValue'
>>   ) my_union
>> ORDER BY my_union.column1
>>
>>
>> It seems like one job task is created per file stored in the table
>> location. This task gets the table properties from the second table and in
>> the SerDe-step later on it seems like the records gets mixed up.
>>
>> I would have expected that hive would need to iterated the source files
>> two times using two different tasks (with the correct table properties
>> passed) in order to get this to work.
>>
>> Anyone here that can shed some light on this scenario?
>>
>> Thanks,
>> Petter
>>
>>
>>
>>
>>
>>
>>
>

Mime
View raw message