orc-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Matamoros, Ronald" <ronald.matamo...@accenture.com>
Subject RE: [External] Re: Creating a Reader from a Java InputStream
Date Wed, 26 Feb 2020 20:19:11 GMT
Hi Owen, 

Thanks a lot for having a solution so quickly, the approach would work for me.
Before commenting on the pull request, wanted to make sure I was understanding correctly a
couple of details: 

- It is crucial to know the file size beforehand, hence the fileSize parameter. This is due
to the underlying seek/position implementation, correct?
- the  'new Path("foo")' is just a placeholder to meet the underlying method signatures. Looking
at the code, it would not be needed for anything, correct?

Ronald Matamoros

From: Owen O'Malley <owen.omalley@gmail.com> 
Sent: Monday, February 24, 2020 4:50 PM
To: Matamoros, Ronald <ronald.matamoros@accenture.com>
Cc: user@orc.apache.org; Sobrado Barquero, H. <h.sobrado.barquero@accenture.com>; Ortega
Ugalde, Ronny <ronny.ortega.ugalde@accenture.com>
Subject: Re: [External] Re: Creating a Reader from a Java InputStream

Ok, for the short term, what I'd propose is a wrapper that creates a FileSystem that only
knows about the stream that you want to read as an ORC file. Take a look at https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_orc_pull_486&d=DwMFaQ&c=eIGjsITfXP_y-DLLX0uEHXJvU8nOHrUK8IrwNKOtkVU&r=tBsXSo4NJM19Wqjkx1fQL1JPREMPERjRQCgc40wLnzw&m=QT8OReGS4Q3i9GXJmI_4dSeuzfPQHQ3EvC8XREoXmic&s=GJbXlhIMkX6TcIV-4NkYnH4v0Gz0Ov8ACDFp6ECbLIE&e=

The usage looks like:

FileSystem fs = new StreamWrapperFileSystem(stream, new Path("foo"), fileSize, conf);
try (Reader reader = OrcFile.createReader(new Path("foo"),

Please comment on the pull request, if this matches what you need.

.. Owen

On Thu, Feb 20, 2020 at 8:16 PM Owen O'Malley <mailto:owen.omalley@gmail.com> wrote:
If you are using HDFS, we could add an API to allow you to pass in a https://urldefense.proofpoint.com/v2/url?u=https-3A__hadoop.apache.org_docs_stable_api_org_apache_hadoop_fs_FSDataInputStream.html&d=DwMFaQ&c=eIGjsITfXP_y-DLLX0uEHXJvU8nOHrUK8IrwNKOtkVU&r=tBsXSo4NJM19Wqjkx1fQL1JPREMPERjRQCgc40wLnzw&m=QT8OReGS4Q3i9GXJmI_4dSeuzfPQHQ3EvC8XREoXmic&s=1EEt2K4KgIktYQE0MHzmgBThOMcyQK7VNfweKEiZ0jc&e=,
which is a subclass of InputStream. That class is returned from Hadoop's https://urldefense.proofpoint.com/v2/url?u=https-3A__hadoop.apache.org_docs_stable_api_org_apache_hadoop_fs_FileSystem.html-23open-2Dorg.apache.hadoop.fs.Path-2D&d=DwMFaQ&c=eIGjsITfXP_y-DLLX0uEHXJvU8nOHrUK8IrwNKOtkVU&r=tBsXSo4NJM19Wqjkx1fQL1JPREMPERjRQCgc40wLnzw&m=QT8OReGS4Q3i9GXJmI_4dSeuzfPQHQ3EvC8XREoXmic&s=6eljA-3ZTcxfh3zi-TZPTZRr975I2rSO5OYph-84xpc&e=
and it does allow the reader to do positioned reads in the stream. I have an additional concern
that there are a set of users who really would like to have a need for an ORC reader without
a dependence on Hadoop, so I'm hesitant to add yet another Hadoop class to the API.

Let me think about this a bit and come up with a proposal for the new API.

.. Owen

On Thu, Feb 20, 2020 at 7:21 PM Matamoros, Ronald <mailto:ronald.matamoros@accenture.com>
Hi Owen,

We have a custom connector that pulls all different sorts of files from a remote Hadoop/HDFS.
One of the types we have to support is Orc, among others. 
Each record from the Orc file will be processed at a later stage individually. So, I am implementing
the record extractor (in the middle).
At that later stage - it could be a record from Parquet, Orc, CSV, RDBMS, etc.

The connector already does the work of referencing the path and reading the file into a Java
InputStream .
By the time my record extractor gets the file it is already an InputStream instance. 
First thought was, since the InputStream is already available might as well use it.
Of course there are performance and memory-usage considerations. 
I can always go with the option of writing the stream temporarily to local disk to traverse
the records (especially since these files can be large).

Appreciate any insights and if this approach is completely wrong please let me know. 

Ronald Matamoros

-----Original Message-----
From: Owen O'Malley <mailto:owen.omalley@gmail.com> 
Sent: Thursday, February 20, 2020 8:12 PM
To: Matamoros, Ronald <mailto:ronald.matamoros@accenture.com>
Cc: mailto:user@orc.apache.org; Sobrado Barquero, H. <mailto:h.sobrado.barquero@accenture.com>;
Ortega Ugalde, Ronny <mailto:ronny.ortega.ugalde@accenture.com>
Subject: Re: [External] Re: Creating a Reader from a Java InputStream

What is the use case that you are working on that only provides you with an InputStram?

.. Owen

> On Feb 20, 2020, at 13:09, Matamoros, Ronald <mailto:ronald.matamoros@accenture.com>
> Hi Owen, thanks for the feedback and recommendations .
> In the current requirement it is a one shot deal, capture all records in the ORC file
to be consumed individually by another phase down the solution's pipeline (read once).
> I guess the seek/position is required even if the read operation is just going forward
over the records?
> Will try making the wrapper and watching out for your PositionedReadable extension.
> Regards,
> Ronald Matamoros
> From: Owen O'Malley <mailto:owen.omalley@gmail.com>
> Sent: Thursday, February 20, 2020 2:18 PM
> To: Matamoros, Ronald <mailto:ronald.matamoros@accenture.com>
> Cc: mailto:user@orc.apache.org; Sobrado Barquero, H. <mailto:h.sobrado.barquero@accenture.com>;
Ortega Ugalde, Ronny <mailto:ronny.ortega.ugalde@accenture.com>
> Subject: [External] Re: Creating a Reader from a Java InputStream
> This message is from an EXTERNAL SENDER - be CAUTIOUS, particularly with links and attachments.
> ________________________________________
> Just to be a little more clear, Java’s InputStream doesn’t provide the primitive
methods that we need.  We’d always need a sub interface That provides positioned reads
and there hasn’t been any consensus about which extension to use.
> Effectively what we need is Hadoop’s PositionedReadable  with ByteBuffers. I’m actually
currently defining the extension to PositionedReadable  to add an async read method with
> https://urldefense.proofpoint.com/v2/url?u=https-3A__hadoop.apache.org_docs_current_api_org_apache_hadoop_fs_PositionedReadable.html&d=DwMFaQ&c=eIGjsITfXP_y-DLLX0uEHXJvU8nOHrUK8IrwNKOtkVU&r=tBsXSo4NJM19Wqjkx1fQL1JPREMPERjRQCgc40wLnzw&m=cizXgFgcv3-hG0M8l-YOTBDYsgkCaVzPEJ7Fwtd0CNU&s=SLTclrj_O57UHFpsX6GGvoQepTVhIPz3Uw2SfJ71I74&e=
> .. Owen
> On Feb 20, 2020, at 11:01, Matamoros, Ronald <mailto:mailto:ronald.matamoros@accenture.com>
> ________________________________
> This message is for the designated recipient only and may contain privileged, proprietary,
or otherwise confidential information. If you have received it in error, please notify the
sender immediately and delete the original. Any other use of the e-mail by you is prohibited.
Where allowed by local law, electronic communications with Accenture and its affiliates, including
e-mail and instant messaging (including content), may be scanned by our systems for the purposes
of information security and assessment of internal compliance with Accenture policy. Your
privacy is important to us. Accenture uses your personal data only in compliance with data
protection laws. For further information on how Accenture processes your personal data, please
see our privacy statement at https://www.accenture.com/us-en/privacy-policy.
> ______________________________________________________________________________________
> http://www.accenture.com
View raw message