flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mikhail Lipkovich (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-5944) Flink should support reading Snappy Files
Date Mon, 04 Sep 2017 09:00:16 GMT

    [ https://issues.apache.org/jira/browse/FLINK-5944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16152293#comment-16152293
] 

Mikhail Lipkovich commented on FLINK-5944:
------------------------------------------

I started working on this issue and I would like to get your opinion about one question.
Desired codec for InputFormat is selected based on file extension (e.g. '.gzip' or '.snappy').
So the question is how we can distinguish whether the Hadoop Snappy codec or Java Snappy codec
is needed.

I can propose the following options:
1. Add new config option to flink-conf.yaml like fs.hadoop-snappy and select InputStreamFactory
based on this option
2. Add flag parameter to API method readTextFile whether the file is Hadoop Snappy
3. Add separate API method for reading snappy-compressed files
4. Ask users to use '.snappy' extension for Java Snappy and some other extension like '.hsnappy'
for Hadoop Snappy

> Flink should support reading Snappy Files
> -----------------------------------------
>
>                 Key: FLINK-5944
>                 URL: https://issues.apache.org/jira/browse/FLINK-5944
>             Project: Flink
>          Issue Type: New Feature
>          Components: Batch Connectors and Input/Output Formats
>            Reporter: Ilya Ganelin
>            Assignee: Mikhail Lipkovich
>              Labels: features
>
> Snappy is an extremely performant compression format that's widely used offering fast
decompression/compression. 
> This can be easily implemented by creating a SnappyInflaterInputStreamFactory and updating
the initDefaultInflateInputStreamFactories in FileInputFormat.
> Flink already includes the Snappy dependency in the project. 
> There is a minor gotcha in this. If we wish to use this with Hadoop, then we must provide
two separate implementations since Hadoop uses a different version of the snappy format than
Snappy Java (which is the xerial/snappy included in Flink). 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message