hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Elliot West <tea...@gmail.com>
Subject Re: ACID ORC file reader issue with uncompacted data
Date Mon, 18 May 2015 10:08:09 GMT
Thanks for the reply Alan.

I see your point regarding the multiple delta directories and why it would
not make sense to include one of them as a leaf. However, it seems that
with this scheme one cannot abstractly work with such paths. One must have
knowledge of the underlying format to understand why some paths have 'part'
leaves and others do not. Conversely it could be argued that Cascading is
making an incorrect assumption about the structure of paths (i.e. there
will always have a 'part' leaf.)

Regarding your suggestion, unfortunately I think this is out of my control
as I have to work with the way in which Cascading operates. We configure
Cascading to use OrcInputFormat and internally it calls
OrcInputFormat.getSplits(). This returns OrcSplits and cascading then calls
getPath on whichever splits are returned. Ideally this mechanism has to
work for any InputFormat and any InputSplit type that returns a path.

Currently I have a patch to massage the paths before they are used but this
places ORC/ACID specific code in what should be a very general purpose
cascading class:
https://github.com/HotelsDotCom/cascading-hive/blob/acid/src/main/java/cascading/tap/hive/HivePartitionTap.java#L105

Conceptually it seems odd that that the path has a different number of
elements in the event that the dataset is ACID, contains one or more
operations, and has not yet been compacted. It feels like an internal
implementation detail that is leaked into the publicly used path. In this
scenario would it not instead be possible for the path to contain an empty
base folder or perhaps a faux place-holder part leaf:

warehouse/test_table/continent=Asia/country=India/base_0000000 // contains
nothing, ignored by ORC
warehouse/test_table/continent=Asia/country=India/deltas_only  // not
actually a folder on disk, ignored by ORC


In this way any framework using elements within the split path can happily
be oblivious to the leaf structure and specifically ORC+ACID. They can rely
on split paths for a given data set always containing a fixed number of
elements.

Cheers - Elliot.

On 14 May 2015 at 18:27, Alan Gates <alanfgates@gmail.com> wrote:

> Ok, I think I understand now.  I also get why OrcSplit.getPath returns
> just up to the partition keys and not the delta directories.  In most cases
> there will be more than one delta directory, so which one would it pick?
>
> It seems you already know the file type you are working on before you call
> this (since you're calling OrcSplit.getPath rather than
> FileSplit.getPath).  The best way forward might be to make a utility method
> in Hive that takes the file type and the result of getPath and then returns
> you the partition keys.  That way you're not left putting ORC specific code
> in Cascading.
>
> Alan.
>
>   Elliot West <teabot@gmail.com>
>  May 1, 2015 at 3:04
> Yes and no :-) We're initially using OrcFile.createReader to create a
> Reader so that we can obtain the schema (StructTypeInfo) from the file. I
> don't believe this is possible with OrcInputFormat.getReader(?):
>
> Reader orcReader = OrcFile.createReader(path,
> OrcFile.readerOptions(conf));
>
> ObjectInspector inspector = orcReader.getObjectInspector();
> StructTypeInfo typeInfo = (StructTypeInfo)
> TypeInfoUtils.getTypeInfoFromObjectInspector(inspector);
>
>
> In the case of transactional datasets we've worked around this by
> generating the StructTypeInfo from schema data retrieved from the meta
> store as we need to interact with the meta store anyway to correct read the
> data. Even if OrcFile.createReader were to transparently read delta only
> datasets, It wouldn't get us much further currently as the delta files lack
> the correct column names and the Reader would thus return an unusable
> StructTypeInfo.
>
> The org.apache.hadoop.hive.ql.io.orc.OrcSplit.getPath() issue is
> currently our biggest pain point as it requires us to place Orc+Atomic
> specific code in what should be a general framework. To illustrate the
> problem further, somewhere in cascading there is some code that extracts
> partition keys from split paths. It extracts keys by chopping off the
> 'part' leaf and removing the preceding parent:
>
> *Text etc:*
> OrcSplit.getPath() returns:
> 'warehouse/test_table/continent=Asia/country=India/part-000001'
> Partition keys derived as: 'continent=Asia/country=India' (CORRECT)
>
> *Orc base+delta:*
> OrcSplit.getPath() returns:
> warehouse/test_table/continent=Asia/country=India/base_0000006'
> Partition keys derived as: 'continent=Asia/country=India' (CORRECT)
>
> *Orc delta only etc:*
> OrcSplit.getPath() returns:
> warehouse/test_table/continent=Asia/country=India
> Partition keys derived as: 'continent=Asia' (INCORRECT)
>
> Cheers - Elliot.
>
>
>
>
>
> On 30 April 2015 at 17:40, Alan Gates <alanfgates@gmail.com> wrote:
>
>> Are you using OrcInputFormat.getReader to get a reader?  If so, it should
>> take care of these anomalies for you and mask your need to worry about
>> delta versus base files.
>>
>> Alan.
>>
>>   Elliot West <teabot@gmail.com>
>>  April 29, 2015 at 9:40
>> Hi,
>>
>> I'm implementing a tap to read Hive ORC ACID date into Cascading jobs and
>> I've hit a couple of issues for a particular scenario. The case I have is
>> when data has been written into a transactional table and a compaction has
>> not yet occurred. This can be recreated like so:
>>
>> CREATE TABLE test_table ( id int, message string )
>>   PARTITIONED BY ( continent string, country string )
>>   CLUSTERED BY (id) INTO 1 BUCKETS
>>   STORED AS ORC
>>   TBLPROPERTIES ('transactional' = 'true')
>> );
>>
>> INSERT INTO TABLE test_table
>> PARTITION (continent = 'Asia', country = 'India')
>> VALUES (1, 'x'), (2, 'y'), (3, 'z');
>>
>>
>> This results in a dataset that contains only a delta file:
>>
>>
>> warehouse/test_table/continent=Asia/country=India/delta_0000060_0000060/bucket_00000
>>
>>
>> I'm assuming that this scenario is valid - a user might insert new data
>> into a table and want to read it back at a time prior to the first
>> compaction. I can select the data back from this table in Hive with no
>> problem. However, for a number of reasons I'm finding it rather tricky to
>> do so programmatically. At this point I should mention that reading base
>> files or base+deltas is trouble free. The issues I've encountered are as
>> follows:
>>
>>    1. org.apache.hadoop.hive.ql.io.orc.OrcFile.createReader(Path,
>>    ReaderOptions) fails if the directory specified by the path ('
>>    warehouse/test_table/continent=Asia/country=India' in this case)
>>    contains only a delta. Specifically it attempts to access
>>    'delta_0000060_0000060' as if it were a file and therefore fails. It
>>    appears to function correctly if the directory also contains a base. We use
>>    this method to extract the typeInfo from the ORCFile and build a mapping
>>    between the user's declared fields.
>>    2. org.apache.hadoop.hive.ql.io.orc.OrcSplit.getPath() is seemingly
>>    inconsistent in that it returns the path of the base if present, otherwise
>>    the parent. This presents issues within cascading (and I assume other
>>    frameworks) that expect the paths returned by splits to be at the same
>>    depth and for them to contain some kind of 'part' file leaf. In my example
>>    the path returned is '
>>    warehouse/test_table/continent=Asia/country=India', if I had also had
>>    a base I'd have seen '
>>    warehouse/test_table/continent=Asia/country=India/base_0000006'.
>>    3. The footers of the delta files do not contain the true field names
>>    of the table. In my example I see '_col0:int,_col1:string' where I'd
>>    expect 'id:int,message:string'. A base file, if present correctly
>>    declares the field names. We chose to access values by field name rather
>>    than position so that users of our reader do not need to declare the full
>>    schema to read partial data, however this behaviour trips this up.
>>
>> I have (horrifically :) worked around issues 1 and 2 in my own code and
>> have some ideas to circumvent 3 but I wanted to get a feeling as to whether
>> I'm going against the tide and if my life might be easier if I approached
>> this another way.
>>
>> Thanks - Elliot.
>>
>>
>>
>   Alan Gates <alanfgates@gmail.com>
>  April 30, 2015 at 9:40
>  Are you using OrcInputFormat.getReader to get a reader?  If so, it should
> take care of these anomalies for you and mask your need to worry about
> delta versus base files.
>
> Alan.
>
>   Elliot West <teabot@gmail.com>
>  April 29, 2015 at 9:40
> Hi,
>
> I'm implementing a tap to read Hive ORC ACID date into Cascading jobs and
> I've hit a couple of issues for a particular scenario. The case I have is
> when data has been written into a transactional table and a compaction has
> not yet occurred. This can be recreated like so:
>
> CREATE TABLE test_table ( id int, message string )
>   PARTITIONED BY ( continent string, country string )
>   CLUSTERED BY (id) INTO 1 BUCKETS
>   STORED AS ORC
>   TBLPROPERTIES ('transactional' = 'true')
> );
>
> INSERT INTO TABLE test_table
> PARTITION (continent = 'Asia', country = 'India')
> VALUES (1, 'x'), (2, 'y'), (3, 'z');
>
>
> This results in a dataset that contains only a delta file:
>
>
> warehouse/test_table/continent=Asia/country=India/delta_0000060_0000060/bucket_00000
>
>
> I'm assuming that this scenario is valid - a user might insert new data
> into a table and want to read it back at a time prior to the first
> compaction. I can select the data back from this table in Hive with no
> problem. However, for a number of reasons I'm finding it rather tricky to
> do so programmatically. At this point I should mention that reading base
> files or base+deltas is trouble free. The issues I've encountered are as
> follows:
>
>    1. org.apache.hadoop.hive.ql.io.orc.OrcFile.createReader(Path,
>    ReaderOptions) fails if the directory specified by the path ('
>    warehouse/test_table/continent=Asia/country=India' in this case)
>    contains only a delta. Specifically it attempts to access
>    'delta_0000060_0000060' as if it were a file and therefore fails. It
>    appears to function correctly if the directory also contains a base. We use
>    this method to extract the typeInfo from the ORCFile and build a mapping
>    between the user's declared fields.
>    2. org.apache.hadoop.hive.ql.io.orc.OrcSplit.getPath() is seemingly
>    inconsistent in that it returns the path of the base if present, otherwise
>    the parent. This presents issues within cascading (and I assume other
>    frameworks) that expect the paths returned by splits to be at the same
>    depth and for them to contain some kind of 'part' file leaf. In my example
>    the path returned is 'warehouse/test_table/continent=Asia/country=India',
>    if I had also had a base I'd have seen '
>    warehouse/test_table/continent=Asia/country=India/base_0000006'.
>    3. The footers of the delta files do not contain the true field names
>    of the table. In my example I see '_col0:int,_col1:string' where I'd
>    expect 'id:int,message:string'. A base file, if present correctly
>    declares the field names. We chose to access values by field name rather
>    than position so that users of our reader do not need to declare the full
>    schema to read partial data, however this behaviour trips this up.
>
> I have (horrifically :) worked around issues 1 and 2 in my own code and
> have some ideas to circumvent 3 but I wanted to get a feeling as to whether
> I'm going against the tide and if my life might be easier if I approached
> this another way.
>
> Thanks - Elliot.
>
>
>

Mime
View raw message