orc-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dain Sundstrom <d...@iq80.com>
Subject Re: [DISCUSS] ORC 2.0
Date Fri, 04 Aug 2017 17:40:56 GMT
+1 to all of the ideas


If we are cool with incompatible changes…
 * Allow dictionary for VARBINARY
 * Disallow old encodings in new files (e.g., no v1)
 * Fix DATE encoding epoch
 * Rearrange stripe so index is next to footer so a single IOP can get all data
 * Change metastore properties so there is a logical mapping from column names to physical
column identifiers so columns can be renamed
 * New timestamp encoding with fixed size per file.. similar to decimal
 * For compression like zstd, we may want to ship a compression dictionary for a stream

Stuff we could do today
 * A flag that says if CHAR or VHARCHAR contain any multi byte characters (isAsciiOnly)
 * Max character count for CHAR or VARCHAR (so we don’t need to check length for schema
changes)
 * Max length for VARBINARY (easier to estimate memory usage)
 * Truncated MIN/MAX for VARBINARY/CHAR/VARCHAR

For the new encodings, we should pick encodings that play well with vectorization which is
coming in Java 10 (Java 9 also has vastly improved auto vectorization).

-dain

> On Aug 4, 2017, at 9:29 AM, Owen O'Malley <owen.omalley@gmail.com> wrote:
> 
> All,
>  We've started the process of updating the encodings for ORC. These
> changes are going to extend the format in ways that aren't forward
> compatible. (eg. The ORC 1.4 readers won't be able to read the new format.)
> 
> The changes that I've heard about are:
> * Decimal encoding - this will like be separated in to two categories
>   + precision <= 18
>   + precision > 18
>  In both cases the precision and scale will be fixed for the entire file
> rather than per value.
> * a new Float/Double encoding
> * a new RLE encoding
> 
> Are there other encodings that we should consider adding?
> 
> We haven't made forward incompatible changes in a while. Currently the ORC
> Writer can write either:
> * Hive 0.11 ORC files
> * Hive 0.12 ORC files
> 
> So I'd like to propose that we add a new ORC 2.0 file version and all of
> these changes need to be so tagged.
> 
> The new ORC writers will maintain the ability to write the old versions of
> the files (Hive 0.11 ORC and Hive 0.12 ORC) as well as the ORC 2.0 files.
> The new reader will automatically read all three versions.
> 
> Thoughts?
> 
>  Owen


Mime
View raw message