flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gyula Fora (Jira)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-20276) Transparent DeCompression of streams missing on new File Source
Date Sun, 22 Nov 2020 18:39:00 GMT

    [ https://issues.apache.org/jira/browse/FLINK-20276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17236994#comment-17236994
] 

Gyula Fora commented on FLINK-20276:
------------------------------------

Some compressed formats like Bzip2 are splittable at block boundaries (when using certain
codecs like Hadoop's bzip2 codec) but this seems to be fairly tricky to integrate with the
current FileInputFormat. The problem is that the InputFormat itself tracks the read number
of bytes instead of getting the actual offsets of the compressed file splits.

I wonder if this is something that is worth thinking about at this point (for the new File
Source) or we can simply deal with it later. What do you think [~sewen]?

> Transparent DeCompression of streams missing on new File Source
> ---------------------------------------------------------------
>
>                 Key: FLINK-20276
>                 URL: https://issues.apache.org/jira/browse/FLINK-20276
>             Project: Flink
>          Issue Type: Bug
>          Components: Connectors / FileSystem
>            Reporter: Stephan Ewen
>            Assignee: Stephan Ewen
>            Priority: Critical
>             Fix For: 1.12.0
>
>
> The existing {{FileInputFormat}} applies decompression (gzip, xy, ...) automatically
on the file input stream, based on the file extension.
> We need to add similar functionality for the {{StreamRecordFormat}} of the new FileSource
to be on par with this functionality.
> This can be easily applied in the {{StreamFormatAdapter}} when opening the file stream.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Mime
View raw message