flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Fabian Hueske (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-838) GSoC Summer Project: Implement full Hadoop Compatibility Layer for Stratosphere
Date Thu, 07 Aug 2014 22:11:14 GMT

    [ https://issues.apache.org/jira/browse/FLINK-838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14089930#comment-14089930
] 

Fabian Hueske commented on FLINK-838:
-------------------------------------

I created a working prototype to support custom partitioner, sorter, and grouping comparators
for Hadoop ReduceFunctions.

>From an API point-of-view, the integration is quite nice. You call {{DataSet.reduce(HadoopReduceFunction)}}
and the system will automatically extract the partitioner, sorter, and grouper from the Hadoop
jobConf, i.e., the user does not need to define the grouping key. This might be not the optimal
solution if only the function (and not the partitioner, sorter, grouper) should be used, however,
we can extend the API for this case.

This ease-of-use comes at a price. Internally, we have a special HadoopReduceDriver at the
execution level, which sorts the data with the (wrapped) custom sort comparator and groups
it with the (wrapped) custom grouping comparator. The partitioning is done similar to the
KeySelector in the regular API. We also need special operators (java-api and generic), an
optimizer node, driver strategy, etc. for the HadoopReduce. These are only a few lines of
code (due to the fantastic design of the optimizer!!!) but still a lot of classes only to
support Hadoop ReduceFunctions.

A few things haven't been done yet, such as proper resource allocation (the sorter takes always
16MB or so) and Combiners do not work either (I guess we need the custom sorter and comparator
there as well. This should be checked). Code needs to be cleaned as well, but it is compiling
and all tests pass.

I pushed my code into my repo at: [https://github.com/fhueske/incubator-flink/commit/f8c5ff07936580d529f52a8c7907f3f1b5b8229b].


I would like to get some feedback on the prototype. Should I continue or do some things differently?

If we do not support these custom partitioners, sorter, groupers, I think we do not need to
continue with the HadoopCompatibility mode because any MR job with custom data types will
fail...

[~StephanEwen]: Any comments?
Artem: can you check which sorters and groupers are used for combining within the Hadoop code.
I saw something about a special CombineGrouper in the JobConf.

> GSoC Summer Project: Implement full Hadoop Compatibility Layer for Stratosphere
> -------------------------------------------------------------------------------
>
>                 Key: FLINK-838
>                 URL: https://issues.apache.org/jira/browse/FLINK-838
>             Project: Flink
>          Issue Type: Improvement
>            Reporter: GitHub Import
>              Labels: github-import
>             Fix For: pre-apache
>
>
> This is a meta issue for tracking @atsikiridis progress with implementing a full Hadoop
Compatibliltiy Layer for Stratosphere.
> Some documentation can be found in the Wiki: https://github.com/stratosphere/stratosphere/wiki/%5BGSoC-14%5D-A-Hadoop-abstraction-layer-for-Stratosphere-(Project-Map-and-Notes)
> As well as the project proposal: https://github.com/stratosphere/stratosphere/wiki/GSoC-2014-Project-Proposal-Draft-by-Artem-Tsikiridis
> Most importantly, there is the following **schedule**:
> *19 May - 27 June (Midterm)*
> 1) Work on the Hadoop tasks, their Context and the mapping of Hadoop's Configuration
to the one of Stratosphere. By successfully bridging the Hadoop tasks with Stratosphere, we
already cover the most basic Hadoop Jobs. This can be determined by running some popular Hadoop
examples on Stratosphere (e.g. WordCount, k-means, join) (4 - 5 weeks)
> 2) Understand how the running of these jobs works (e.g. command line interface) for the
wrapper. Implement how will the user run them. (1 - 2 weeks).
> *27 June - 11 August*
> 1) Continue wrapping more "advanced" Hadoop Interfaces (Comparators, Partitioners, Distributed
Cache etc.) There are quite a few interfaces and it will be a challenge to support all of
them. (5 full weeks)
> 2) Profiling of the application and optimizations (if applicable)
> *11 August - 18 August*
> Write documentation on code, write a README with care and add more unit-tests. (1 week)
> ---------------- Imported from GitHub ----------------
> Url: https://github.com/stratosphere/stratosphere/issues/838
> Created by: [rmetzger|https://github.com/rmetzger]
> Labels: core, enhancement, parent-for-major-feature, 
> Milestone: Release 0.7 (unplanned)
> Created at: Tue May 20 10:11:34 CEST 2014
> State: open



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message