From Eugene Koifman <>
Subject Re: ACID ORC file reader issue with uncompacted data
Date Wed, 29 Apr 2015 17:37:35 GMT
This is not an answer to your question, but FYI.  The work in
will change how the delta files are named which may affect your work.
Once that work is complete, the deltas will be named delta_xxx_yyy_zz, so you may have delta_002_002_1,delta_002_002_2,

Reading data before 1st compaction is definitely a valid use case.

From: Elliot West
Reply-To: "<>" <<>>
Date: Wednesday, April 29, 2015 at 9:40 AM
To: "<>" <<>>
Subject: ACID ORC file reader issue with uncompacted data


I'm implementing a tap to read Hive ORC ACID date into Cascading jobs and I've hit a couple
of issues for a particular scenario. The case I have is when data has been written into a
transactional table and a compaction has not yet occurred. This can be recreated like so:

CREATE TABLE test_table ( id int, message string )
  PARTITIONED BY ( continent string, country string )
  TBLPROPERTIES ('transactional' = 'true')

PARTITION (continent = 'Asia', country = 'India')
VALUES (1, 'x'), (2, 'y'), (3, 'z');

This results in a dataset that contains only a delta file:


I'm assuming that this scenario is valid - a user might insert new data into a table and want
to read it back at a time prior to the first compaction. I can select the data back from this
table in Hive with no problem. However, for a number of reasons I'm finding it rather tricky
to do so programmatically. At this point I should mention that reading base files or base+deltas
is trouble free. The issues I've encountered are as follows:

  1., ReaderOptions) fails if
the directory specified by the path ('warehouse/test_table/continent=Asia/country=India' in
this case) contains only a delta. Specifically it attempts to access 'delta_0000060_0000060'
as if it were a file and therefore fails. It appears to function correctly if the directory
also contains a base. We use this method to extract the typeInfo from the ORCFile and build
a mapping between the user's declared fields.
  2. is seemingly inconsistent in that
it returns the path of the base if present, otherwise the parent. This presents issues within
cascading (and I assume other frameworks) that expect the paths returned by splits to be at
the same depth and for them to contain some kind of 'part' file leaf. In my example the path
returned is 'warehouse/test_table/continent=Asia/country=India', if I had also had a base
I'd have seen 'warehouse/test_table/continent=Asia/country=India/base_0000006'.
  3.  The footers of the delta files do not contain the true field names of the table. In
my example I see '_col0:int,_col1:string' where I'd expect 'id:int,message:string'. A base
file, if present correctly declares the field names. We chose to access values by field name
rather than position so that users of our reader do not need to declare the full schema to
read partial data, however this behaviour trips this up.

I have (horrifically :) worked around issues 1 and 2 in my own code and have some ideas to
circumvent 3 but I wanted to get a feeling as to whether I'm going against the tide and if
my life might be easier if I approached this another way.

Thanks - Elliot.

