orc-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Owen O'Malley" <owen.omal...@gmail.com>
Subject Re: [External] Re: Creating a Reader from a Java InputStream
Date Mon, 24 Feb 2020 22:50:27 GMT
Ok, for the short term, what I'd propose is a wrapper that creates a
FileSystem that only knows about the stream that you want to read as an ORC
file. Take a look at https://github.com/apache/orc/pull/486 .

The usage looks like:

FileSystem fs = new StreamWrapperFileSystem(stream, new Path("foo"),
fileSize, conf);
try (Reader reader = OrcFile.createReader(new Path("foo"),

OrcFile.readerOptions(conf).filesystem(fs)))
{
   ...
}

Please comment on the pull request, if this matches what you need.

.. Owen

On Thu, Feb 20, 2020 at 8:16 PM Owen O'Malley <owen.omalley@gmail.com>
wrote:

> If you are using HDFS, we could add an API to allow you to pass in a
> FSDataInputStream
> <https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/fs/FSDataInputStream.html>,
> which is a subclass of InputStream. That class is returned from Hadoop's
> fs.open(path)
> <https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/fs/FileSystem.html#open-org.apache.hadoop.fs.Path->
> and it does allow the reader to do positioned reads in the stream. I have
> an additional concern that there are a set of users who really would like
> to have a need for an ORC reader without a dependence on Hadoop, so I'm
> hesitant to add yet another Hadoop class to the API.
>
> Let me think about this a bit and come up with a proposal for the new API.
>
> .. Owen
>
> On Thu, Feb 20, 2020 at 7:21 PM Matamoros, Ronald <
> ronald.matamoros@accenture.com> wrote:
>
>> Hi Owen,
>>
>> We have a custom connector that pulls all different sorts of files from a
>> remote Hadoop/HDFS.
>> One of the types we have to support is Orc, among others.
>> Each record from the Orc file will be processed at a later stage
>> individually. So, I am implementing the record extractor (in the middle).
>> At that later stage - it could be a record from Parquet, Orc, CSV, RDBMS,
>> etc.
>>
>> The connector already does the work of referencing the path and reading
>> the file into a Java InputStream .
>> By the time my record extractor gets the file it is already an
>> InputStream instance.
>> First thought was, since the InputStream is already available might as
>> well use it.
>> Of course there are performance and memory-usage considerations.
>> I can always go with the option of writing the stream temporarily to
>> local disk to traverse the records (especially since these files can be
>> large).
>>
>> Appreciate any insights and if this approach is completely wrong please
>> let me know.
>>
>> Regards
>> Ronald Matamoros
>>
>> -----Original Message-----
>> From: Owen O'Malley <owen.omalley@gmail.com>
>> Sent: Thursday, February 20, 2020 8:12 PM
>> To: Matamoros, Ronald <ronald.matamoros@accenture.com>
>> Cc: user@orc.apache.org; Sobrado Barquero, H. <
>> h.sobrado.barquero@accenture.com>; Ortega Ugalde, Ronny <
>> ronny.ortega.ugalde@accenture.com>
>> Subject: Re: [External] Re: Creating a Reader from a Java InputStream
>>
>> What is the use case that you are working on that only provides you with
>> an InputStram?
>>
>> .. Owen
>>
>> > On Feb 20, 2020, at 13:09, Matamoros, Ronald <
>> ronald.matamoros@accenture.com> wrote:
>> >
>> > Hi Owen, thanks for the feedback and recommendations .
>> >
>> > In the current requirement it is a one shot deal, capture all records
>> in the ORC file to be consumed individually by another phase down the
>> solution's pipeline (read once).
>> > I guess the seek/position is required even if the read operation is
>> just going forward over the records?
>> >
>> > Will try making the wrapper and watching out for your
>> PositionedReadable extension.
>> >
>> > Regards,
>> > Ronald Matamoros
>> >
>> > From: Owen O'Malley <owen.omalley@gmail.com>
>> > Sent: Thursday, February 20, 2020 2:18 PM
>> > To: Matamoros, Ronald <ronald.matamoros@accenture.com>
>> > Cc: user@orc.apache.org; Sobrado Barquero, H. <
>> h.sobrado.barquero@accenture.com>; Ortega Ugalde, Ronny <
>> ronny.ortega.ugalde@accenture.com>
>> > Subject: [External] Re: Creating a Reader from a Java InputStream
>> >
>> > This message is from an EXTERNAL SENDER - be CAUTIOUS, particularly
>> with links and attachments.
>> > ________________________________________
>> >
>> > Just to be a little more clear, Java’s InputStream doesn’t provide the
>> primitive methods that we need.  We’d always need a sub interface That
>> provides positioned reads and there hasn’t been any consensus about which
>> extension to use.
>> >
>> > Effectively what we need is Hadoop’s PositionedReadable  with
>> ByteBuffers. I’m actually currently defining the extension to
>> PositionedReadable  to add an async read method with ByteBuffers.
>> >
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__hadoop.apache.org_docs_current_api_org_apache_hadoop_fs_PositionedReadable.html&d=DwMFaQ&c=eIGjsITfXP_y-DLLX0uEHXJvU8nOHrUK8IrwNKOtkVU&r=tBsXSo4NJM19Wqjkx1fQL1JPREMPERjRQCgc40wLnzw&m=cizXgFgcv3-hG0M8l-YOTBDYsgkCaVzPEJ7Fwtd0CNU&s=SLTclrj_O57UHFpsX6GGvoQepTVhIPz3Uw2SfJ71I74&e=
>> >
>> > .. Owen
>> >
>> >
>> > On Feb 20, 2020, at 11:01, Matamoros, Ronald <mailto:
>> ronald.matamoros@accenture.com> wrote:
>> >
>> > ________________________________
>> >
>> > This message is for the designated recipient only and may contain
>> privileged, proprietary, or otherwise confidential information. If you have
>> received it in error, please notify the sender immediately and delete the
>> original. Any other use of the e-mail by you is prohibited. Where allowed
>> by local law, electronic communications with Accenture and its affiliates,
>> including e-mail and instant messaging (including content), may be scanned
>> by our systems for the purposes of information security and assessment of
>> internal compliance with Accenture policy. Your privacy is important to us.
>> Accenture uses your personal data only in compliance with data protection
>> laws. For further information on how Accenture processes your personal
>> data, please see our privacy statement at
>> https://www.accenture.com/us-en/privacy-policy.
>> >
>> ______________________________________________________________________________________
>> >
>> > www.accenture.com
>>
>

Mime
View raw message