orc-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dain Sundstrom <d...@iq80.com>
Subject Documentations issues
Date Fri, 16 Jun 2017 19:19:31 GMT
Recently I have been working on a custom writer for Presto and during this I kept notes on
sections of the documentation that might have problems.  Some of these may have already been

## Compression
see https://orc.apache.org/docs/compression.html

I think the hex sequence for 100000 compressed is [0x41 0x0D 0x03].  Also, it is not clear
if compressed length is 2 bytes, or .
Each header is 3 bytes long with (compressedLength * 2 + isOriginal) stored as a little endian
value.   For example, the header for a chunk that compressed to 100,000 bytes would be [0x40,
0x0d, 0x03]. The header for 5 bytes that did not compress would be [0x0b, 0x00, 0x00]. 

This section is not clear:
The default compression chunk size is 256K, but writers can choose their own value less than
Should the that be 223K?  If so, that seems strange since I would assume any value smaller
than 256K is legit.

## String encodings
see https://orc.apache.org/docs/encodings.html#string-char-and-varchar-columns

This first sentence seems to be describing a heuristic used by the default implementation.

## File tail
The docs should make it clear that the maximum length stored for archer and char are the maximum
number of unicode characters and specifically not byte count and not UTF-16 sequences (like
Java does by default).
// the maximum length of the type for varchar or char
 optional uint32 maximumLength = 4;

View raw message