hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Elliot West <tea...@gmail.com>
Subject ACID ORC file reader issue with uncompacted data
Date Wed, 29 Apr 2015 16:40:53 GMT
Hi,

I'm implementing a tap to read Hive ORC ACID date into Cascading jobs and
I've hit a couple of issues for a particular scenario. The case I have is
when data has been written into a transactional table and a compaction has
not yet occurred. This can be recreated like so:

CREATE TABLE test_table ( id int, message string )
  PARTITIONED BY ( continent string, country string )
  CLUSTERED BY (id) INTO 1 BUCKETS
  STORED AS ORC
  TBLPROPERTIES ('transactional' = 'true')
);

INSERT INTO TABLE test_table
PARTITION (continent = 'Asia', country = 'India')
VALUES (1, 'x'), (2, 'y'), (3, 'z');


This results in a dataset that contains only a delta file:

warehouse/test_table/continent=Asia/country=India/delta_0000060_0000060/bucket_00000


I'm assuming that this scenario is valid - a user might insert new data
into a table and want to read it back at a time prior to the first
compaction. I can select the data back from this table in Hive with no
problem. However, for a number of reasons I'm finding it rather tricky to
do so programmatically. At this point I should mention that reading base
files or base+deltas is trouble free. The issues I've encountered are as
follows:

   1. org.apache.hadoop.hive.ql.io.orc.OrcFile.createReader(Path,
   ReaderOptions) fails if the directory specified by the path ('
   warehouse/test_table/continent=Asia/country=India' in this case)
   contains only a delta. Specifically it attempts to access
   'delta_0000060_0000060' as if it were a file and therefore fails. It
   appears to function correctly if the directory also contains a base. We use
   this method to extract the typeInfo from the ORCFile and build a mapping
   between the user's declared fields.
   2. org.apache.hadoop.hive.ql.io.orc.OrcSplit.getPath() is seemingly
   inconsistent in that it returns the path of the base if present, otherwise
   the parent. This presents issues within cascading (and I assume other
   frameworks) that expect the paths returned by splits to be at the same
   depth and for them to contain some kind of 'part' file leaf. In my example
   the path returned is 'warehouse/test_table/continent=Asia/country=India',
   if I had also had a base I'd have seen '
   warehouse/test_table/continent=Asia/country=India/base_0000006'.
   3. The footers of the delta files do not contain the true field names of
   the table. In my example I see '_col0:int,_col1:string' where I'd expect
   'id:int,message:string'. A base file, if present correctly declares the
   field names. We chose to access values by field name rather than position
   so that users of our reader do not need to declare the full schema to read
   partial data, however this behaviour trips this up.

I have (horrifically :) worked around issues 1 and 2 in my own code and
have some ideas to circumvent 3 but I wanted to get a feeling as to whether
I'm going against the tide and if my life might be easier if I approached
this another way.

Thanks - Elliot.

Mime
View raw message