hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Luke Lu <...@apache.org>
Subject Re: Introducing Parquet: efficient columnar storage for Hadoop.
Date Tue, 12 Mar 2013 22:40:18 GMT
IMO, it'll be enlightening to Hadoop users to compare Parquet with Trevni
and ORCFile, all of which are columnar formats for Hadoop that are
relatively new. Do we really need 3 columnar formats?

On Tue, Mar 12, 2013 at 8:45 AM, Dmitriy Ryaboy <dvryaboy@gmail.com> wrote:

> Fellow Hadoopers,
> We'd like to introduce a joint project between Twitter and Cloudera
> engineers -- a new columnar storage format for Hadoop called Parquet (
> http://parquet.github.com).
> We created Parquet to make the advantages of compressed, efficient columnar
> data representation available to any project in the Hadoop ecosystem,
> regardless of the choice of data processing framework, data model, or
> programming language.
> Parquet is built from the ground up with complex nested data structures in
> mind. We adopted the repetition/definition level approach to encoding such
> data structures, as described in Google's Dremel paper; we have found this
> to be a very efficient method of encoding data in non-trivial object
> schemas.
> Parquet is built to support very efficient compression and encoding
> schemes. Parquet allows compression schemes to be specified on a per-column
> level, and is future-proofed to allow adding more encodings as they are
> invented and implemented. We separate the concepts of encoding and
> compression, allowing parquet consumers to implement operators that work
> directly on encoded data without paying decompression and decoding penalty
> when possible.
> Parquet is built to be used by anyone. The Hadoop ecosystem is rich with
> data processing frameworks, and we are not interested in playing favorites.
> We believe that an efficient, well-implemented columnar storage substrate
> should be useful to all frameworks without the cost of extensive and
> difficult to set up dependencies.
> The initial code, available at https://github.com/Parquet, defines the
> file
> format, provides Java building blocks for processing columnar data, and
> implements Hadoop Input/Output Formats, Pig Storers/Loaders, and an example
> of a complex integration -- Input/Output formats that can convert
> Parquet-stored data directly to and from Thrift objects.
> A preview version of Parquet support will be available in Cloudera's Impala
> 0.7.
> Twitter is starting to convert some of its major data source to Parquet in
> order to take advantage of the compression and deserialization savings.
> Parquet is currently under heavy development. Parquet's near-term roadmap
> includes:
> * Hive SerDes (Criteo)
> * Cascading Taps (Criteo)
> * Support for dictionary encoding, zigzag encoding, and RLE encoding of
> data (Cloudera and Twitter)
> * Further improvements to Pig support (Twitter)
> Company names in parenthesis indicate whose engineers signed up to do the
> work -- others can feel free to jump in too, of course.
> We've also heard requests to provide an Avro container layer, similar to
> what we do with Thrift. Seeking volunteers!
> We welcome all feedback, patches, and ideas; to foster community
> development, we plan to contribute Parquet to the Apache Incubator when the
> development is farther along.
> Regards,
> Nong Li, Julien Le Dem, Marcel Kornacker, Todd Lipcon, Dmitriy Ryaboy,
> Jonathan Coveney, and friends.

View raw message