orc-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kyle Dunn <kdunn...@gmail.com>
Subject Seeking to and reading stripe data
Date Fri, 22 Apr 2016 00:20:03 GMT
I'm trying to implement a parallel ORC reader with WebHDFS using the C++
client library. My process is as follows:
1) Download the 16kB tail using WebHDFS offset and length
2) From the tail, determine the offsets and lengths for stripes of interest
3) Use stripe information from 2) as WebHDFS offset and length parameters
to read data sections to a local file
4) Append the tail to a local file
5) Use ORC C++ Reader to print contents of local file

I'd like to clarify a couple items:

1) The Hive configuration parameter "orc.stripe.size" seems to suggest the
stripe size is configurable, but constant for all stripes in a given file.
Can someone clarify this? Is orc.stripe.size an upper bound?

2) The Reader class in the C++ client allows me to determine the byte
offset and length for a given stripe yet if I do a partial download of an
ORC file by isolating that offset and length, I get Zlib logic exceptions
when deserializing data from partial stripe downloads (tail is also
appended to stripe data). I've also seen an exception related to a buffer
being undersized.

Is there something I'm missing? Do I need to rewrite the tail? Specify an
offset in the ORC Reader class as well?

Thanks in advance for the help,

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message