hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alan Gates <alanfga...@gmail.com>
Subject Re: ACID ORC file reader issue with uncompacted data
Date Thu, 30 Apr 2015 16:40:44 GMT
Are you using OrcInputFormat.getReader to get a reader?  If so, it 
should take care of these anomalies for you and mask your need to worry 
about delta versus base files.

Alan.

> Elliot West <mailto:teabot@gmail.com>
> April 29, 2015 at 9:40
> Hi,
>
> I'm implementing a tap to read Hive ORC ACID date into Cascading jobs 
> and I've hit a couple of issues for a particular scenario. The case I 
> have is when data has been written into a transactional table and a 
> compaction has not yet occurred. This can be recreated like so:
>
>     CREATE TABLE test_table ( id int, message string )
>       PARTITIONED BY ( continent string, country string )
>       CLUSTERED BY (id) INTO 1 BUCKETS
>       STORED AS ORC
>       TBLPROPERTIES ('transactional' = 'true')
>     );
>
>     INSERT INTO TABLE test_table
>     PARTITION (continent = 'Asia', country = 'India')
>     VALUES (1, 'x'), (2, 'y'), (3, 'z');
>
>
> This results in a dataset that contains only a delta file:
>
>     warehouse/test_table/continent=Asia/country=India/delta_0000060_0000060/bucket_00000
>
>
> I'm assuming that this scenario is valid - a user might insert new 
> data into a table and want to read it back at a time prior to the 
> first compaction. I can select the data back from this table in Hive 
> with no problem. However, for a number of reasons I'm finding it 
> rather tricky to do so programmatically. At this point I should 
> mention that reading base files or base+deltas is trouble free. The 
> issues I've encountered are as follows:
>
>  1. org.apache.hadoop.hive.ql.io.orc.OrcFile.createReader(Path,
>     ReaderOptions) fails if the directory specified by the path
>     ('warehouse/test_table/continent=Asia/country=India' in this case)
>     contains only a delta. Specifically it attempts to access
>     'delta_0000060_0000060' as if it were a file and therefore fails.
>     It appears to function correctly if the directory also contains a
>     base. We use this method to extract the typeInfo from the ORCFile
>     and build a mapping between the user's declared fields.
>  2. org.apache.hadoop.hive.ql.io.orc.OrcSplit.getPath() is seemingly
>     inconsistent in that it returns the path of the base if present,
>     otherwise the parent. This presents issues within cascading (and I
>     assume other frameworks) that expect the paths returned by splits
>     to be at the same depth and for them to contain some kind of
>     'part' file leaf. In my example the path returned is
>     'warehouse/test_table/continent=Asia/country=India', if I had also
>     had a base I'd have seen
>     'warehouse/test_table/continent=Asia/country=India/base_0000006'.
>  3. The footers of the delta files do not contain the true field names
>     of the table. In my example I see '_col0:int,_col1:string' where
>     I'd expect 'id:int,message:string'. A base file, if present
>     correctly declares the field names. We chose to access values by
>     field name rather than position so that users of our reader do not
>     need to declare the full schema to read partial data, however this
>     behaviour trips this up.
>
> I have (horrifically :) worked around issues 1 and 2 in my own code 
> and have some ideas to circumvent 3 but I wanted to get a feeling as 
> to whether I'm going against the tide and if my life might be easier 
> if I approached this another way.
>
> Thanks - Elliot.
>
>

Mime
View raw message