orc-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Denis Mikhalkin <deni...@yahoo.com.INVALID>
Subject Fw: Extending ORC
Date Wed, 14 Jun 2017 08:49:46 GMT

I have been working on a custom compression algorithm for market trading data. The data is
quite big (goes to PBs) and so savings in storage are some visible cost savings. I used ORC
as the baseline, and extended it by creating custom encoders for different types of data.
The encoders are not meant as the replacement for the standard ORC encoders but rather are
use-case specific, exploiting known redundancies in the data (eg. predicting value of one
fields based on the others). I was able to achieve pretty good improvements (about 48%) over
standard ORC for my type of data.
Currently, I had to fork and create my own version of the ORC library (Java), which is not
ideal. If there are any improvements, it will require merge. Also, it's hard to integrate
this into other higher-level frameworks, such as Spark. And other people can't use my work.
My target is actually to be able to use this codec in Databricks.
By looking at the implementation I had a thought that it would be nice to have some sort of
extensibility mechanism standard as part of ORC (Java). Based on a column type, and perhaps
some configuration, to be able to overwrite the standard "Writer" for certain types. For example,
I have an improved "Timestamp" writer which exploits some patterns in the data (see-saw pattern),
which could be applicable to other data as well. It would be nice if I could replace the standard
writer for certain fields without the need to modify the ORC library, or people could opt-out
to use my encoder for their data. And ideally, be able to simply load my library alongside
the default ORC implementation into Spark, and have my "plugins" or "extensions" automatically
discovered by ORC and integrated.
Has anybody thought about anything similar? Would that work? Would it be beneficial? What's
the best way to implement something like that, where would you start?

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message