orc-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Matamoros, Ronald" <ronald.matamo...@accenture.com>
Subject RE: [External] Re: Creating a Reader from a Java InputStream
Date Fri, 21 Feb 2020 03:21:26 GMT
Hi Owen,

We have a custom connector that pulls all different sorts of files from a remote Hadoop/HDFS.
One of the types we have to support is Orc, among others. 
Each record from the Orc file will be processed at a later stage individually. So, I am implementing
the record extractor (in the middle).
At that later stage - it could be a record from Parquet, Orc, CSV, RDBMS, etc.

The connector already does the work of referencing the path and reading the file into a Java
InputStream .
By the time my record extractor gets the file it is already an InputStream instance. 
First thought was, since the InputStream is already available might as well use it.
Of course there are performance and memory-usage considerations. 
I can always go with the option of writing the stream temporarily to local disk to traverse
the records (especially since these files can be large).

Appreciate any insights and if this approach is completely wrong please let me know. 

Regards
Ronald Matamoros

-----Original Message-----
From: Owen O'Malley <owen.omalley@gmail.com> 
Sent: Thursday, February 20, 2020 8:12 PM
To: Matamoros, Ronald <ronald.matamoros@accenture.com>
Cc: user@orc.apache.org; Sobrado Barquero, H. <h.sobrado.barquero@accenture.com>; Ortega
Ugalde, Ronny <ronny.ortega.ugalde@accenture.com>
Subject: Re: [External] Re: Creating a Reader from a Java InputStream

What is the use case that you are working on that only provides you with an InputStram?

.. Owen

> On Feb 20, 2020, at 13:09, Matamoros, Ronald <ronald.matamoros@accenture.com> wrote:
> 
> Hi Owen, thanks for the feedback and recommendations .
> 
> In the current requirement it is a one shot deal, capture all records in the ORC file
to be consumed individually by another phase down the solution's pipeline (read once).
> I guess the seek/position is required even if the read operation is just going forward
over the records?
> 
> Will try making the wrapper and watching out for your PositionedReadable extension.
> 
> Regards,
> Ronald Matamoros
> 
> From: Owen O'Malley <owen.omalley@gmail.com>
> Sent: Thursday, February 20, 2020 2:18 PM
> To: Matamoros, Ronald <ronald.matamoros@accenture.com>
> Cc: user@orc.apache.org; Sobrado Barquero, H. <h.sobrado.barquero@accenture.com>;
Ortega Ugalde, Ronny <ronny.ortega.ugalde@accenture.com>
> Subject: [External] Re: Creating a Reader from a Java InputStream
> 
> This message is from an EXTERNAL SENDER - be CAUTIOUS, particularly with links and attachments.
> ________________________________________
> 
> Just to be a little more clear, Java’s InputStream doesn’t provide the primitive
methods that we need.  We’d always need a sub interface That provides positioned reads and
there hasn’t been any consensus about which extension to use.
> 
> Effectively what we need is Hadoop’s PositionedReadable  with ByteBuffers. I’m actually
currently defining the extension to PositionedReadable  to add an async read method with ByteBuffers.
> https://urldefense.proofpoint.com/v2/url?u=https-3A__hadoop.apache.org_docs_current_api_org_apache_hadoop_fs_PositionedReadable.html&d=DwMFaQ&c=eIGjsITfXP_y-DLLX0uEHXJvU8nOHrUK8IrwNKOtkVU&r=tBsXSo4NJM19Wqjkx1fQL1JPREMPERjRQCgc40wLnzw&m=cizXgFgcv3-hG0M8l-YOTBDYsgkCaVzPEJ7Fwtd0CNU&s=SLTclrj_O57UHFpsX6GGvoQepTVhIPz3Uw2SfJ71I74&e=
> 
> .. Owen
> 
> 
> On Feb 20, 2020, at 11:01, Matamoros, Ronald <mailto:ronald.matamoros@accenture.com>
wrote:
> 
> ________________________________
> 
> This message is for the designated recipient only and may contain privileged, proprietary,
or otherwise confidential information. If you have received it in error, please notify the
sender immediately and delete the original. Any other use of the e-mail by you is prohibited.
Where allowed by local law, electronic communications with Accenture and its affiliates, including
e-mail and instant messaging (including content), may be scanned by our systems for the purposes
of information security and assessment of internal compliance with Accenture policy. Your
privacy is important to us. Accenture uses your personal data only in compliance with data
protection laws. For further information on how Accenture processes your personal data, please
see our privacy statement at https://www.accenture.com/us-en/privacy-policy.
> ______________________________________________________________________________________
> 
> www.accenture.com
Mime
View raw message