hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitriy Ryaboy <dvrya...@gmail.com>
Subject Re: Introducing Parquet: efficient columnar storage for Hadoop.
Date Wed, 13 Mar 2013 17:25:04 GMT
Hi folks,
Thanks for your interest. The Cloudera blog post has a few additional
bullet points about the difference between Trevni and Parquet:


On Tue, Mar 12, 2013 at 3:40 PM, Luke Lu <llu@apache.org> wrote:

> IMO, it'll be enlightening to Hadoop users to compare Parquet with Trevni
> and ORCFile, all of which are columnar formats for Hadoop that are
> relatively new. Do we really need 3 columnar formats?
> On Tue, Mar 12, 2013 at 8:45 AM, Dmitriy Ryaboy <dvryaboy@gmail.com>wrote:
>> Fellow Hadoopers,
>> We'd like to introduce a joint project between Twitter and Cloudera
>> engineers -- a new columnar storage format for Hadoop called Parquet (
>> http://parquet.github.com).
>> We created Parquet to make the advantages of compressed, efficient
>> columnar
>> data representation available to any project in the Hadoop ecosystem,
>> regardless of the choice of data processing framework, data model, or
>> programming language.
>> Parquet is built from the ground up with complex nested data structures in
>> mind. We adopted the repetition/definition level approach to encoding such
>> data structures, as described in Google's Dremel paper; we have found this
>> to be a very efficient method of encoding data in non-trivial object
>> schemas.
>> Parquet is built to support very efficient compression and encoding
>> schemes. Parquet allows compression schemes to be specified on a
>> per-column
>> level, and is future-proofed to allow adding more encodings as they are
>> invented and implemented. We separate the concepts of encoding and
>> compression, allowing parquet consumers to implement operators that work
>> directly on encoded data without paying decompression and decoding penalty
>> when possible.
>> Parquet is built to be used by anyone. The Hadoop ecosystem is rich with
>> data processing frameworks, and we are not interested in playing
>> favorites.
>> We believe that an efficient, well-implemented columnar storage substrate
>> should be useful to all frameworks without the cost of extensive and
>> difficult to set up dependencies.
>> The initial code, available at https://github.com/Parquet, defines the
>> file
>> format, provides Java building blocks for processing columnar data, and
>> implements Hadoop Input/Output Formats, Pig Storers/Loaders, and an
>> example
>> of a complex integration -- Input/Output formats that can convert
>> Parquet-stored data directly to and from Thrift objects.
>> A preview version of Parquet support will be available in Cloudera's
>> Impala
>> 0.7.
>> Twitter is starting to convert some of its major data source to Parquet in
>> order to take advantage of the compression and deserialization savings.
>> Parquet is currently under heavy development. Parquet's near-term roadmap
>> includes:
>> * Hive SerDes (Criteo)
>> * Cascading Taps (Criteo)
>> * Support for dictionary encoding, zigzag encoding, and RLE encoding of
>> data (Cloudera and Twitter)
>> * Further improvements to Pig support (Twitter)
>> Company names in parenthesis indicate whose engineers signed up to do the
>> work -- others can feel free to jump in too, of course.
>> We've also heard requests to provide an Avro container layer, similar to
>> what we do with Thrift. Seeking volunteers!
>> We welcome all feedback, patches, and ideas; to foster community
>> development, we plan to contribute Parquet to the Apache Incubator when
>> the
>> development is farther along.
>> Regards,
>> Nong Li, Julien Le Dem, Marcel Kornacker, Todd Lipcon, Dmitriy Ryaboy,
>> Jonathan Coveney, and friends.

View raw message