hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Elliot West <tea...@gmail.com>
Subject Re: Interpretation of transactional table base file format
Date Mon, 30 Mar 2015 11:03:45 GMT
Ok, so both the source and Javadoc for
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat answer most of these
questions.

Apologies for the spam.

Thanks - Elliot.

On 30 March 2015 at 11:52, Elliot West <teabot@gmail.com> wrote:

> I've been looking at the structure of the ORCFiles that back transaction
> tables in Hive. After a compaction I was surprised to find that the base
> file structure is identical to the delta structure:
>
>   struct<
>     operation:int,
>     originalTransaction:bigint,
>     bucket:int,
>     rowId:bigint,
>     currentTransaction:bigint,
>     row:struct<
>       // row fields
>     >
>   >
>
> This raises a few questions:
>
>    - How should I interpret the operation and originalTransaction values
>    in these compacted rows?
>    - Are the values in the operation and originalTransaction fields
>    required for the application of later deltas?
>    - Does this structure in anyway inhibit the ability to perform partial
>    reads of the row data (i.e. specific columns)?
>    - How does this structure relate to the RecordIdentier class which
>    contains only a subset of the meta-data fields, and the
>    AcidInputFormat.Options.recordIdColumn() which seems to imply a meta
>    data column that is alongside the row columns, not the nested structure
>    that we see in practice.
>
> I suppose that I might find the answers to some of these myself by simply
> reading in the data with the appropriate input format, which leads me to my
> final question: is there already a input format available that will
> seamlessly and transparently apply any deltas on read (for consuming the
> data in an M/R job for example).
>
> Apologies for so many questions.
>
> Thanks - Elliot.
>

Mime
View raw message