arrow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Venkat Krishnamurthy <nivik...@gmail.com>
Subject Re: Comparing with Parquet
Date Fri, 26 Feb 2016 01:43:43 GMT
I think they're fundamentally orthogonal.

Tachyon provides a full-fledged storage system that uses memory as one of
its tiers, which leaves it up to applications as to how they represent
their data structures. When applications want to use Tachyon, it's up to
them to decide how to serialize their data structures.

Arrow, OTOH, leaves it up to the applications as to how they manage
storage. It is an in-memory data layout that exploits vectorization, and
(today) provides a library for defining data structures used in analytics
frameworks that need to be shared ideally with minimal overhead.



On Thu, Feb 25, 2016 at 5:11 PM, Pedro Miguel Duarte <
duarte.gelvez.pedromiguel@gmail.com> wrote:

> I was wondering if someone could also elaborate in the comparison with
> Tachyon (now called Alluxio)
> On Feb 25, 2016 5:08 PM, "Chenliang (Liang, DataSight)" <
> chenliang613@huawei.com> wrote:
>
> > In favor of Henry Robinson's points.
> >
> > In addition. Arrow is suitable for exchanging data high efficiently, but
> > the data size may just support TB level. Parquet can support more bigger
> > data, but the performance couldn't support fast query.
> >
> > So for PB level data and interactively query(second level), both couldn't
> > solve?
> >
> > Regards
> > Liang
> > -----邮件原件-----
> > 发件人: Henry Robinson [mailto:henry@cloudera.com]
> > 发送时间: 2016年2月26日 0:20
> > 收件人: dev@arrow.apache.org
> > 主题: Re: Comparing with Parquet
> >
> > Think of Parquet as a format well-suited to writing very large datasets
> to
> > disk, whereas Arrow is a format most suited to efficient storage in
> memory.
> > You might read Parquet files from disk, and then materialize them in
> memory
> > in Arrow's format.
> >
> > Both formats are designed around the idiosyncrasies of the target medium:
> > Parquet is not designed to support efficient random access because disks
> > aren't good at that, but Arrow has fast random access  as a core design
> > principle, to give just one example.
> >
> > Henry
> >
> > > On Feb 25, 2016, at 8:10 AM, Sourav Mazumder <
> > sourav.mazumder00@gmail.com> wrote:
> > >
> > > Hi All,
> > >
> > > New to this. And still trying to figure out where exactly Arrow fits
> > > in the ecosystem of various Big Data technologies.
> > >
> > > In that respect first thing which came to my mind is how does Arrow
> > > compare with parquet.
> > >
> > > In my understanding Parquet also supports a very efficient columnar
> > > format (with support for nested structure). It is already embraced
> > > (supported) by various technologies like Impala (origin), Spark, Drill
> > etc.
> > >
> > > The only think I see missing in Parquet is support for SIMD based
> > > vectorized operations.
> > >
> > > Am I right or am I missing many other differences between Arrow and
> > > parquet ?
> > >
> > > Regards,
> > > Sourav
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message