arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Micah Kornfield <emkornfi...@gmail.com>
Subject Re: Java Parquet to Arrow Conversion
Date Tue, 25 Aug 2020 17:33:45 GMT
ARROW-1644 and its children.

On Tue, Aug 25, 2020 at 10:30 AM Anoop Johnson <anoop.k.johnson@gmail.com>
wrote:

> Thanks Micah. Is there a Jira or pull request I could follow for the C++
> implementation for arbitrary nesting? How about maps?
>
> On Tue, Aug 25, 2020 at 9:10 AM Micah Kornfield <emkornfield@gmail.com>
> wrote:
>
>> Also does the C++ Parquet to Arrow reader have any such limitations?
>>
>>
>> The C++ implementation can currently either read nested structs or nested
>> lists but not a combination of the two.  It is actively being worked on to
>> be able to handle arbitrary nesting.
>>
>> On Tue, Aug 25, 2020 at 1:15 AM Anoop Johnson <anoop.k.johnson@gmail.com>
>> wrote:
>>
>>> If I read the Iceberg vectorized reader code right, it does not support
>>> nested types (same limitation as Spark's built-in vectorized parquet
>>> reader). Is that correct? Also does the C++ Parquet to Arrow reader have
>>> any such limitations?
>>>
>>> On Wed, Aug 19, 2020 at 9:37 AM Jacques Nadeau <jacques@apache.org>
>>> wrote:
>>>
>>>> I believe there is code in the iceberg project to do this in pure Java
>>>> [1]. Right now, there isn't a pure java implementation in the Arrow project.
>>>>
>>>> [1]
>>>> https://github.com/apache/iceberg/tree/master/arrow/src/main/java/org/apache/iceberg/arrow/vectorized
>>>>
>>>> On Wed, Aug 19, 2020 at 5:18 AM Chris Nuernberger <chris@techascent.com>
>>>> wrote:
>>>>
>>>>> Also, javacpp has prepackaged C++ bindings to arrow for multiple OS's:
>>>>>
>>>>> http://bytedeco.org/javacpp-presets/arrow/apidocs/
>>>>>
>>>>> We have had success with javacpp
>>>>> <https://github.com/techascent/tech.opencv> in the past and it
is
>>>>> much better now that their preprocess is based on Clang.
>>>>>
>>>>> On Tue, Aug 18, 2020 at 4:16 PM Chris Nuernberger <
>>>>> chris@techascent.com> wrote:
>>>>>
>>>>>> Thanks, that is helpful.
>>>>>>
>>>>>> Chris
>>>>>>
>>>>>> On Tue, Aug 18, 2020 at 10:24 AM Micah Kornfield <
>>>>>> emkornfield@gmail.com> wrote:
>>>>>>
>>>>>>> Hi Chris,
>>>>>>> There is an open PR to support this through C++'s Dataset
>>>>>>> functionality [1]. There was also a prior attempt that went stale
and I
>>>>>>> can't find at the moment.
>>>>>>>
>>>>>>> IIUC the main missing component at this point before the PR gets
>>>>>>> merged is integration to honor "-XX:MaxDirectMemorySize" settings.
>>>>>>>
>>>>>>> -Micah
>>>>>>>
>>>>>>> [1] https://github.com/apache/arrow/pull/7030
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> [1] https://github.com/apache/arrow/pull/7030
>>>>>>>
>>>>>>> On Tue, Aug 18, 2020 at 6:48 AM Chris Nuernberger <
>>>>>>> chris@techascent.com> wrote:
>>>>>>>
>>>>>>>> Hey,
>>>>>>>>
>>>>>>>> We were wondering what the best way to convert a parquet
file to an
>>>>>>>> arrow file would be via a java pathway.  I notice that the
c++ layer
>>>>>>>> appears to have this conversion.
>>>>>>>>
>>>>>>>> The best hint I have see so far is this gist:
>>>>>>>>
>>>>>>>> https://gist.github.com/animeshtrivedi/76de64f9dab1453958e1d4f8eca1605f
>>>>>>>>
>>>>>>>> I also found this jni pathway for ORC files:
>>>>>>>> https://github.com/apache/arrow/tree/master/cpp/src/jni
>>>>>>>>
>>>>>>>> Another thought I had was to use the JNA or JNR and bind
to the C
>>>>>>>> glib pathway.
>>>>>>>>
>>>>>>>> Thanks for any help,
>>>>>>>>
>>>>>>>> Chris
>>>>>>>>
>>>>>>>

Mime
View raw message