arrow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wes McKinney <...@cloudera.com>
Subject Re: Comparing with Parquet
Date Thu, 25 Feb 2016 17:11:54 GMT
We wrote about this in a recent blog post:

http://blog.cloudera.com/blog/2016/02/introducing-apache-arrow-a-fast-interoperable-in-memory-columnar-data-structure-standard/

"Apache Parquet is a compact, efficient columnar data storage designed
for storing large amounts of data stored in HDFS. Arrow is an ideal
in-memory “container” for data that has been deserialized from a
Parquet file, and similarly in-memory Arrow data can be serialized to
Parquet and written out to a filesystem like HDFS or Amazon S3. Arrow
and Parquet are thus companion projects."

For example, one of my personal motivations for being involved in both
Arrow and Parquet is to use Arrow as the in-memory container for data
deserialized from Parquet for use in Python and R.

- Wes

On Thu, Feb 25, 2016 at 8:20 AM, Henry Robinson <henry@cloudera.com> wrote:
> Think of Parquet as a format well-suited to writing very large datasets to disk, whereas
Arrow is a format most suited to efficient storage in memory. You might read Parquet files
from disk, and then materialize them in memory in Arrow's format.
>
> Both formats are designed around the idiosyncrasies of the target medium: Parquet is
not designed to support efficient random access because disks aren't good at that, but Arrow
has fast random access  as a core design principle, to give just one example.
>
> Henry
>
>> On Feb 25, 2016, at 8:10 AM, Sourav Mazumder <sourav.mazumder00@gmail.com>
wrote:
>>
>> Hi All,
>>
>> New to this. And still trying to figure out where exactly Arrow fits in the
>> ecosystem of various Big Data technologies.
>>
>> In that respect first thing which came to my mind is how does Arrow compare
>> with parquet.
>>
>> In my understanding Parquet also supports a very efficient columnar format
>> (with support for nested structure). It is already embraced (supported) by
>> various technologies like Impala (origin), Spark, Drill etc.
>>
>> The only think I see missing in Parquet is support for SIMD based
>> vectorized operations.
>>
>> Am I right or am I missing many other differences between Arrow and parquet
>> ?
>>
>> Regards,
>> Sourav

Mime
View raw message