arrow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Micah Kornfield <emkornfi...@gmail.com>
Subject Re: Depending on non-released Apache projects (C++ Avro)
Date Thu, 07 Mar 2019 06:49:48 GMT
Thanks for the input Wes and Uwe, given no one from the Avro community has
chimed in,  I will try to reach out on there dev mailing list.

Uwe, I'm not sure I understand what type of support/help you are thinking
of.  Could you elaborate a little bit more before I reach out?

-Micah

On Tue, Mar 5, 2019 at 4:53 PM Wes McKinney <wesmckinn@gmail.com> wrote:

> I am OK with that, but if we find ourselves making compromises that
> affect performance or memory efficiency (where possibly invasive
> refactoring may be required) perhaps we should reconsider option #3.
>
> On Tue, Mar 5, 2019 at 11:29 AM Uwe L. Korn <uwelk@xhochy.com> wrote:
> >
> > I'm leaning a bit towards 1) but I would love to get some input from the
> Avro community as 1) depends also on their side as we will submit some
> patches upstream that need to be reviewed and someday also released.
> >
> > Are AVRO committers subscribed here or should we reach out to them on
> their ML? Given that we are quite active in the C++ space currently, I feel
> that we can contribute quite some infrastructure in building and packaging
> that we do eitherway for Arrow. This might be quite helpful for a project.
> We have seen with Parquet where much of the development is just happening
> as it is part of Arrow. (Not suggesting to merge/fork the Avro codebase but
> just to apply some of the  best practices we learned while building Arrow).
> >
> > Uwe
> >
> > On Tue, Mar 5, 2019, at 4:57 PM, Wes McKinney wrote:
> > > I'd be +0.5 in favor of forking in this particular case. Since Avro is
> > > not vectorized (unlike Parquet and ORC) I suspect it may be more
> > > difficult to get the best performance using a general purpose API
> > > versus one that is more specialized to producing Arrow record batches.
> > > Given that has been relatively light C++ development activity in
> > > Apache Avro and no releases for 2 years it does give me pause.
> > >
> > > We might want to look at Impala's Avro scanner, they are doing some
> > > LLVM IR cross-compilation also (they're using the Avro C++ library
> > > though)
> > >
> > >
> https://github.com/apache/impala/blob/master/be/src/exec/hdfs-avro-scanner-ir.cc
> > >
> https://github.com/apache/impala/blob/master/be/src/exec/hdfs-avro-scanner.cc
> > >
> > > On Tue, Mar 5, 2019 at 1:01 AM Micah Kornfield <emkornfield@gmail.com>
> wrote:
> > > >
> > > > I'm looking at incorporating Avro in Arrow C++ [1]. It  seems that
> the Avro
> > > > C++ library APIs  have improved from the last release.  However, it
> is not
> > > > clear when a new release will be available (I asked on the  JIRA
> Item for
> > > > the next release [2] and received no response).
> > > >
> > > > I was wondering if there is a policy governing using other Apache
> projects
> > > > or how people felt about the following options:
> > > > 1.  Depend on a specific git commit through the third-party library
> system.
> > > > 2.  Copy the necessary source code temporarily to our project, and
> change
> > > > to using the next release when it is available.
> > > > 3.  Fork the code we need (the main benefit I see here is being able
> to
> > > > refactor it to avoid having to deal with exceptions, easier
> integration
> > > > with our IO system and one less 3rd party dependency to deal with).
> > > > 4.  Wait on the 1.9 release before proceeding.
> > > >
> > > > Thanks,
> > > > Micah
> > > >
> > > > [1] https://issues.apache.org/jira/browse/ARROW-1209
> > > > [2] https://issues.apache.org/jira/browse/AVRO-2250
> > >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message