orc-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gopal Vijayaraghavan <gop...@apache.org>
Subject Re: ORC double encoding optimization proposal
Date Mon, 26 Mar 2018 04:59:06 GMT

> Since Split creates two separated streams, reading one data batch will need an additional
seek in order to reconstruct the column data

If you are seeing a seek like that, we've messed up something else higher up in the pipeline
& that can be fixed.

ORC columnar reads only do random IO at the column level, not the stream level (except for
non-column streams like the bloom filters) - adjacent streams are read together as a single
IO op.

DiskRangeList produce a merged read plan before firing off any read, so the actual IO layer
will (or should) never a seek between adjacent streams.

There's a possibility that someone will add an extra byte or something to a stream which they
do not read ever, which might be a problem.

In early 2016 Rajesh & I went through each read IOP and tuned ORC for S3, which performs
very poorly if you add irrelevant seeks.

If you do find a similar case in Apache ORC (not Hive-orc), I'll file a corresponding ticket
to this


That was actually about reading 2 columns with an entirely NULL column in the middle, not
exactly about splitting streams.

The next giant leap of IO performance for seeks is expected from a new HDFS API, which allows
for the scatter-gather to be pushed-down further into the IO layer.


This mainly intended for reading ORC files from Erasure coded streams, where the IO layer
can reorganize and align the reads along the Erasure Coding boundaries (not so much about
actual IOPs), instead of assuming normal read-ahead for the block reader.


View raw message