hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Navis류승우 <navis....@nexr.com>
Subject Re: Pointing multiple external tables to the same location
Date Tue, 14 Jan 2014 02:10:25 GMT
In my try, it worked (and should be).

CREATE EXTERNAL TABLE MasterTable (
  column1 STRING, column2 STRING)
  LOCATION 'hdfs://localhost:9000/home/navis/my_location';

CREATE VIEW IF NOT EXISTS View1 (column1, column2) AS SELECT column1,
column2 FROM MasterTable WHERE column1<30;

CREATE VIEW IF NOT EXISTS View2 (column1, column2) AS SELECT column1,
column2 FROM MasterTable WHERE column1<100;

SELECT View1.* FROM View1 JOIN View2 ON (View1.column1 = View2.column1);

below is result of explain, which takes single whole scan for master table
and handled by two TS followed by FIL with expected predicates.

        view1:view1:mastertable
          TableScan
            alias: mastertable
            Filter Operator
              predicate:
                  expr: (column1 < 30)
                  type: boolean

        view2:view2:mastertable
          TableScan
            alias: mastertable
            Filter Operator
              predicate:
                  expr: (column1 < 100)
                  type: boolean

      Truncated Path -> Alias:
        hdfs://localhost:9000/home/navis/my_location
[view1:view1:mastertable, view2:view2:mastertable]

Can I ask the version of hive you are using?


2014/1/9 Petter von Dolwitz (Hem) <petter.von.dolwitz@gmail.com>

> Hi Navis (and others),
>
> seems like my solution with views does not work after all. That is, it
> works fine as long as I do not use filter pushdown. My setup is something
> like below:
>
> CREATE EXTERNAL TABLE MasterTable (
>   column1 STRING,
>   column2 STRING,
>   column3 STRING
>   column4 STRING)
>   PARTITIONED BY (partition INT)
>   ROW FORMAT SERDE 'MySerde'
>   STORED AS INPUTFORMAT 'MyInputFormat' OUTPUTFORMAT
> 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
>   LOCATION 'my_location';
>
> CREATE VIEW IF NOT EXISTS View1
>   (column1, column2, column3, column4, partition)
> PARTITIONED ON (partition)
> AS SELECT column1, column2, column3, column4, partition
> FROM MasterEventTable
> WHERE column1='value1' AND column2='value2';
>
> CREATE VIEW IF NOT EXISTS View2
>   (column1, column2, column3, column4, partition)
> PARTITIONED ON (partition)
> AS SELECT column1, column2, column3, column4, partition
> FROM MasterEventTable
> WHERE column1='value3' AND column2='value4';
>
>
> The following query works fine without filter pushdown:
> SELECT View1.* FROM View1 JOIN View2 ON (View1.column3 = View2.column3);
>
> Now if I enable filter pushdown (setting hive.optimize.index.filter=true)
> and apply the filter in my record reader I do not get the correct result. I
> do not get any records back at all. It seems like only the second filter
> (column1='value3' AND column2='value4) is pushed to my record reader. The
> underlying file is only traversed once. I would have expected that I either
> got an OR expression down ((column1='value3' AND column2='value4) OR
> (column1='value1' AND column2='value2)) or that the underlying file was
> scanned twice with each separate expression.
>
> Do you have any thoughts on this?
>
> Thanks,
> Petter
>
>
>
>
>
> 2013/12/22 Petter von Dolwitz (Hem) <petter.von.dolwitz@gmail.com>
>
> Hi Navis,
>>
>> thank you for sorting this out! I have tried getting around this by using
>> views towards a single master table instead in combination with UDFs
>>  instead . Seems to work so far.
>>
>> /Petter
>>
>>
>> 2013/12/18 Navis류승우 <navis.ryu@nexr.com>
>>
>>> Hive uses path to table(or partition) mapping internally (you can see
>>> that in MapredWork, etc.), which might caused first table overwritten by
>>> other.
>>>
>>> I didn't tried symlink on hdfs, which could be a solution.
>>>
>>>
>>>
>>> 2013/12/12 Petter von Dolwitz (Hem) <petter.von.dolwitz@gmail.com>
>>>
>>> Hi,
>>>>
>>>> I have declared several external tables pointing to the same location.
>>>> The things that tells these tables apart (apart from their names) is that
>>>> they have unique properties. These properties help me choose the correct
>>>> rows from the underlying file. I use a single storage handler (accompanied
>>>> by a single InputFormat and a single Serde) . The first columns in all
>>>> tables are the same but the last (a struct) is unique and
>>>> is constructed from the Serde (with help of the serde properties). A
>>>> simplified version of the tables look like so:
>>>>
>>>> CREATE EXTERNAL TABLE Table1 (
>>>>   column1 STRING,
>>>>   column2 STRING)
>>>>   STORED BY 'MyStorageHandler'
>>>>   WITH SERDEPROPERTIES ('ser.class'='MyStructSerializationClass1')
>>>>   LOCATION 'mylocation'
>>>>   TBLPROPERTIES('recordreader.filter'='table1_filter');
>>>>
>>>> CREATE EXTERNAL TABLE Table2 (
>>>>   column1 STRING,
>>>>   column2 STRING)
>>>>   STORED BY 'MyStorageHandler'
>>>>   WITH SERDEPROPERTIES ('ser.class'='MyStructSerializationClass2')
>>>>   LOCATION 'mylocation'
>>>>   TBLPROPERTIES('recordreader.filter'='table2_filter');
>>>>
>>>>
>>>> All works well for simple select queries towards the two tables. The
>>>> following query gives very strange results though:
>>>>
>>>> SELECT * FROM (
>>>>   SELECT column1,'Table1' FROM Table1 WHERE column2 = 'myValue'
>>>>   union all
>>>>   SELECT column1,'Table2' FROM Table2 WHERE column2 = 'myValue'
>>>>   ) my_union
>>>> ORDER BY my_union.column1
>>>>
>>>>
>>>> It seems like one job task is created per file stored in the table
>>>> location. This task gets the table properties from the second table and in
>>>> the SerDe-step later on it seems like the records gets mixed up.
>>>>
>>>> I would have expected that hive would need to iterated the source files
>>>> two times using two different tasks (with the correct table properties
>>>> passed) in order to get this to work.
>>>>
>>>> Anyone here that can shed some light on this scenario?
>>>>
>>>> Thanks,
>>>> Petter
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>
>

Mime
View raw message