flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-1396) Add hadoop input formats directly to the user API.
Date Wed, 04 Feb 2015 20:47:36 GMT

    [ https://issues.apache.org/jira/browse/FLINK-1396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14305914#comment-14305914
] 

ASF GitHub Bot commented on FLINK-1396:
---------------------------------------

Github user fhueske commented on a diff in the pull request:

    https://github.com/apache/flink/pull/363#discussion_r24117856
  
    --- Diff: docs/hadoop_compatibility.md ---
    @@ -52,56 +63,70 @@ Add the following dependency to your `pom.xml` to use the Hadoop Compatibility
L
     
     ### Using Hadoop Data Types
     
    -Flink supports all Hadoop `Writable` and `WritableComparable` data types out-of-the-box.
You do not need to include the Hadoop Compatibility dependency, if you only want to use your
Hadoop data types. See the [Programming Guide](programming_guide.html#data-types) for more
details.
    +Flink supports all Hadoop `Writable` and `WritableComparable` data types
    +out-of-the-box. You do not need to include the Hadoop Compatibility dependency,
    +if you only want to use your Hadoop data types. See the
    +[Programming Guide](programming_guide.html#data-types) for more details.
     
     ### Using Hadoop InputFormats
     
    -Flink provides a compatibility wrapper for Hadoop `InputFormats`. Any class that implements
`org.apache.hadoop.mapred.InputFormat` or extends `org.apache.hadoop.mapreduce.InputFormat`
is supported. Thus, Flink can handle Hadoop built-in formats such as `TextInputFormat` as
well as external formats such as Hive's `HCatInputFormat`. Data read from Hadoop InputFormats
is converted into a `DataSet<Tuple2<KEY,VALUE>>` where `KEY` is the key and `VALUE`
is the value of the original Hadoop key-value pair.
    -
    -Flink's InputFormat wrappers are 
    -
    -- `org.apache.flink.hadoopcompatibility.mapred.HadoopInputFormat` and 
    -- `org.apache.flink.hadoopcompatibility.mapreduce.HadoopInputFormat`
    +Hadoop input formats can be used to create a data source by using
    +on of the methods `readHadoopFile` or `createHadoopInput` of the
    +`ExecutionEnvironment`. The former is used for input formats derived
    +from `FileInputFormat` while the latter has to be used for general purpose
    +input formats.
     
    -and can be used as regular Flink [InputFormats](programming_guide.html#data-sources).
    +The resulting `DataSet` contains 2-tuples where the first field
    +is the key and the second field is the value retrieved from the Hadoop
    +InputFormat.
     
     The following example shows how to use Hadoop's `TextInputFormat`.
     
    +<div class="codetabs" markdown="1">
    +<div data-lang="java" markdown="1">
    +
     ~~~java
     ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
    -		
    -// Set up the Hadoop TextInputFormat.
    -Job job = Job.getInstance();
    -HadoopInputFormat<LongWritable, Text> hadoopIF = 
    -  // create the Flink wrapper.
    -  new HadoopInputFormat<LongWritable, Text>(
    -    // create the Hadoop InputFormat, specify key and value type, and job.
    -    new TextInputFormat(), LongWritable.class, Text.class, job
    -  );
    -TextInputFormat.addInputPath(job, new Path(inputPath));
    -		
    -// Read data using the Hadoop TextInputFormat.
    -DataSet<Tuple2<LongWritable, Text>> text = env.createInput(hadoopIF);
    +
    +DataSet<Tuple2<LongWritable, Text>> input =
    +    env.readHadoopFile(new TextInputFormat(), LongWritable.class, Text.class, textPath);
     
     // Do something with the data.
     [...]
     ~~~
     
    -### Using Hadoop OutputFormats
    +</div>
    +<div data-lang="scala" markdown="1">
     
    -Flink provides a compatibility wrapper for Hadoop `OutputFormats`. Any class that implements
`org.apache.hadoop.mapred.OutputFormat` or extends `org.apache.hadoop.mapreduce.OutputFormat`
is supported. The OutputFormat wrapper expects its input data to be a `DataSet<Tuple2<KEY,VALUE>>`
where `KEY` is the key and `VALUE` is the value of the Hadoop key-value pair that is processed
by the Hadoop OutputFormat.
    +~~~scala
    +val env = ExecutionEnvironment.getExecutionEnvironment
    +		
    +val input: DataSet[(LongWritable, Text)] =
    +  env.readHadoopFile(new TextInputFormat, classOf[LongWritable], classOf[Text], textPath)
     
    -Flink's OUtputFormat wrappers are
    +// Do something with the data.
    +[...]
    +~~~
    +
    +</div>
     
    -- `org.apache.flink.hadoopcompatibility.mapred.HadoopOutputFormat` and 
    -- `org.apache.flink.hadoopcompatibility.mapreduce.HadoopOutputFormat`
    +</div>
    +
    +### Using Hadoop OutputFormats
     
    -and can be used as regular Flink [OutputFormats](programming_guide.html#data-sinks).
    +Flink provides a compatibility wrapper for Hadoop `OutputFormats`. Any class
    +that implements `org.apache.hadoop.mapred.OutputFormat` or extend
    --- End diff --
    
    extend -> extends


> Add hadoop input formats directly to the user API.
> --------------------------------------------------
>
>                 Key: FLINK-1396
>                 URL: https://issues.apache.org/jira/browse/FLINK-1396
>             Project: Flink
>          Issue Type: Bug
>            Reporter: Robert Metzger
>            Assignee: Aljoscha Krettek
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message