arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wes McKinney <wesmck...@gmail.com>
Subject Re: Streaming use cases
Date Mon, 29 Jun 2020 21:24:31 GMT
On Mon, Jun 29, 2020 at 4:15 PM Cindy McMullen <cmcmullen@twitter.com> wrote:
>
> Hi, Wes -
>
> Yes, we're using Java/Scala, but also have a good Python code base for our data scientists.
 Our goal is to replace storage/representation of Thrift for ML features with some more OSS-friendly
format, such as Parquet or Avro, and avoid writing multiple adapters.
>
> Ideally, we could stream data from Parquet disk in batches into Arrow-compatible consumers.
 Is this a reasonable fit for something like Arrow Flight?

Yes, Flight is definitely designed for that -- fast / efficient
delivery of Arrow record batches over TCP.

>
> On Mon, Jun 29, 2020 at 2:37 PM Wes McKinney <wesmckinn@gmail.com> wrote:
>>
>> hi Cindy,
>>
>> Could you clarify which PL you are working in (though assuming Scala /
>> Java judging by your e-mail address)?
>>
>> In C++ we have reasonably mature Parquet->Arrow reading but not yet
>> conversion from Arrow to Avro. In Java, I am not sure what is the
>> state of the art for getting Parquet into Arrow but this code does not
>> live in Apache Arrow -- I know that Apache Iceberg has done some work
>> around this but I'm not sure how consumable it is as a library.
>> Java-Arrow does have some preliminary support for converting Arrow to
>> Avro, I believe. So there's some engineering here to do in any case.
>>
>> best,
>> Wes
>>
>> On Mon, Jun 29, 2020 at 2:45 PM Cindy McMullen <cmcmullen@twitter.com> wrote:
>> >
>> > Can I use Arrow to stream data from a Parquet file source and consume it via
Avro?

Mime
View raw message