beam-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tibor Kiss (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (BEAM-778) Make fileio._CompressedFile seekable.
Date Fri, 24 Mar 2017 12:32:41 GMT

    [ https://issues.apache.org/jira/browse/BEAM-778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15940239#comment-15940239
] 

Tibor Kiss commented on BEAM-778:
---------------------------------

The implementation today maintains a local {{_read_buffer}} object which is used all the way
on the read path.
I suspect that the _read_buffer is created to bridge the gap between zlib module's functionality
(provides only block
decompress and compress) and the required operations of the file object (read bytes and read
line).

My impression is that if we would replace zlib module with gzip (which builds on top of gzip)
then we could simply 
bridge the read operations to gzip's respective methods without the need of having local buffer.

Bzip2 module also supports read operations.
Bonus would be that seek() functionality would come for 'free' as both gzip and bzip2 supports
seek() and tell().

[~robertwb] / [~sbilac] / [~katsiapis@google.com] / [~altay]: 
Wondering if you considered using gzip module?
What are your thoughts on ditching read_buffer by bridged file ops to bzip2/gzip?

> Make fileio._CompressedFile seekable.
> -------------------------------------
>
>                 Key: BEAM-778
>                 URL: https://issues.apache.org/jira/browse/BEAM-778
>             Project: Beam
>          Issue Type: Improvement
>          Components: sdk-py
>            Reporter: Chamikara Jayalath
>            Assignee: Tibor Kiss
>             Fix For: Not applicable
>
>
> We have a TODO to make fileio._CompressedFile seekable.
> https://github.com/apache/incubator-beam/blob/python-sdk/sdks/python/apache_beam/io/fileio.py#L692
> Without this, compressed file objects produce for FileBasedSource implementations may
not be able to use libraries that utilize methods seek() and tell().
> For example tarfile.open().



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message