avro-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Matthew Stowe (JIRA)" <j...@apache.org>
Subject [jira] [Created] (AVRO-2098) Avro OCF support for non-seekable stream.
Date Fri, 20 Oct 2017 13:24:00 GMT
Matthew Stowe created AVRO-2098:
-----------------------------------

             Summary: Avro OCF support for non-seekable stream.
                 Key: AVRO-2098
                 URL: https://issues.apache.org/jira/browse/AVRO-2098
             Project: Avro
          Issue Type: New Feature
          Components: csharp
    Affects Versions: 1.8.2
         Environment: csharp
Azure Data Lake Analytics
            Reporter: Matthew Stowe
            Priority: Minor


The Microsoft Azure environment supports saving Apache Avro files from an Event Hub via a
feature called Event Hub Capture.  The Event Hub Capture feature can be configured to Azure
Data Lake Storage (ADLS).

When saving files to ADLS it is common to use Azure Data Lake Analytics (ADLA) to run batch
processing jobs in U-SQL over the raw storage files.  When doing this ADLA supports extractors
that can deal with the format of the file (e.g. Avro OCF) and extract file contents for downstream
manipulation and filtering.

An issue I have encountered with the existing csharp implementation is that the DataFileReader
relies on the provided stream to support seeking.  However, the stream provided by ADLA does
not support seeking.  This leaves the integrating developer with 2 options...

1 is to read the entire stream in to memory and provide a memory backed stream to the DataFileReader.
 This is not ideal as files can be large and consuming a lot of memory at once during processing
may have undesired affects on ADLA's ability to process files in parallel, as resources are
of course limited.

2 is to enhance the DataFileReader to be able to work with streams that are not seekable.
 With respect to this option I have implemented a short-term workaround that can wrap a non-seekeable
stream and allow seeking in the pattern employed by the DataFileReader until this feature
has been reviewed and potentially implemented.  My workaround is brittle and subject to breaking
as the DataFileReader evolves and is not the desired long term approach to dealing with this
issue.

[AvroDataFileReaderStream.cs|https://github.com/thebothead/apache-avro-adla/blob/master/src/Avro.IO.ADLA/AvroDataFileReaderStream.cs]

Cheers,
Matt



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message