flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Artem Tsikiridis (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (FLINK-838) GSoC Summer Project: Implement full Hadoop Compatibility Layer for Stratosphere
Date Wed, 30 Jul 2014 00:54:39 GMT

    [ https://issues.apache.org/jira/browse/FLINK-838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14078717#comment-14078717
] 

Artem Tsikiridis edited comment on FLINK-838 at 7/30/14 12:53 AM:
------------------------------------------------------------------

Hi Fabian! Nice suggestions!

1) Although it doesn't make much difference, for us it would be https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapred/lib/MultipleInputs.html

Looking in the implementation of it, I notice it uses a {{DelegatingInputFormat}} and a {{DelegatingMapper}}
by keeping a mapping of {inputsplit, mapper to use} in the {{JobConf}} . Shouldn't this work
out of the box just by configuring the {{Configurable}} interfaces (which we already do successfully)?
 I'll try a relevant test case.

2) That would be cool. It requires that the code of {{FlinkHadoopJobClient}} gets more modular
though, as at the moment there is no clean separation of the additional steps. This should
be done in any case.

Here's a status update:

Although, initially there was some work done on custom partitioning and grouping on the reducer's
values and in my tests it seemed to be correct, the implementation is very flawed and inefficient
and Fabian suggested that it is not mature yet to be merged. To avoid repeating computations
and completely discarding the partitioner's shuffling here is a suggestion:

the {{groupBy}} call acts as a partitioner. And then we need to extend the {{GroupReduceOperator}}
api (an additional operation by KeySelector? ) to group values before the reducer. This way,
the partitioner and the grouper of reducer values can be completely independent. Do you have
any suggestions for that? They would be very useful.

A general overview of other things I'm doing:

* Working on code review comments in https://github.com/apache/incubator-flink/pull/37 (sorry
I squashed, that was not very bright :( )
* Wrapping up with Counters - Accumulators. Have supported more functionality for the {{JobClient}}
and the {{JobConf}}. Not that many unsupported  ops (of course there are some, which are totally
hadoop specific).
* Dealing with the custom ClassLoader. I am a bit stuck in this one ( although Robert's email
helped). As it is probably a minor thing I'm missing I have it on the side. Generally, passing
the classloader via a JVM arg does not seem to help in my case.

Cheers,
A.


was (Author: atsikiridis):
Hi Fabian! Nice suggestions!

1) Although it doesn't make much difference for us it would be https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapred/lib/MultipleInputs.html

Looking in the implementation of it, I notice it uses a {{DelegatingInputFormat}} and a {{DelegatingMapper}}
by keeping a mapping of {inputsplit, mapper to use} in the {{JobConf}} . Shouldn't this work
out of the box just by configuring the {{Configurable}} interfaces (which we already do successfully)?
 I'll try a relevant test case.

2) That would be cool. It requires that the code of {{FlinkHadoopJobClient}} gets more modular
though, as at the moment there is no clean separation of the additional steps. This should
be done in any case.

Here's a status update:

Although, initially there was some work done on custom partitioning and grouping on the reducer's
values and in my tests it seemed to be correct, the implementation is very flawed and inefficient
and Fabian suggested that it is not mature yet to be merged. To avoid repeating computations
and completely discarding the partitioner's shuffling here is a suggestion:

the {{groupBy}} call acts as a partitioner. And then we need to extend the {{GroupReduceOperator}}
api (an additional operation by KeySelector? ) to group values before the reducer. This way,
the partitioner and the grouper of reducer values can be completely independent. Do you have
any suggestions for that? They would be very useful.

A general overview of other things I'm doing:

* Working on code review comments in https://github.com/apache/incubator-flink/pull/37 (sorry
I squashed, that was not very bright :( )
* Wrapping up with Counters - Accumulators. Have supported more functionality for the {{JobClient}}
and the {{JobConf}}. Not that many unsupported  ops (of course there are some, which are totally
hadoop specific).
* Dealing with the custom ClassLoader. I am a bit stuck in this one ( although Robert's email
helped). As it is probably a minor thing I'm missing I have it on the side. Generally, passing
the classloader via a JVM arg does not seem to help in my case.

Cheers,
A.

> GSoC Summer Project: Implement full Hadoop Compatibility Layer for Stratosphere
> -------------------------------------------------------------------------------
>
>                 Key: FLINK-838
>                 URL: https://issues.apache.org/jira/browse/FLINK-838
>             Project: Flink
>          Issue Type: Improvement
>            Reporter: GitHub Import
>              Labels: github-import
>             Fix For: pre-apache
>
>
> This is a meta issue for tracking @atsikiridis progress with implementing a full Hadoop
Compatibliltiy Layer for Stratosphere.
> Some documentation can be found in the Wiki: https://github.com/stratosphere/stratosphere/wiki/%5BGSoC-14%5D-A-Hadoop-abstraction-layer-for-Stratosphere-(Project-Map-and-Notes)
> As well as the project proposal: https://github.com/stratosphere/stratosphere/wiki/GSoC-2014-Project-Proposal-Draft-by-Artem-Tsikiridis
> Most importantly, there is the following **schedule**:
> *19 May - 27 June (Midterm)*
> 1) Work on the Hadoop tasks, their Context and the mapping of Hadoop's Configuration
to the one of Stratosphere. By successfully bridging the Hadoop tasks with Stratosphere, we
already cover the most basic Hadoop Jobs. This can be determined by running some popular Hadoop
examples on Stratosphere (e.g. WordCount, k-means, join) (4 - 5 weeks)
> 2) Understand how the running of these jobs works (e.g. command line interface) for the
wrapper. Implement how will the user run them. (1 - 2 weeks).
> *27 June - 11 August*
> 1) Continue wrapping more "advanced" Hadoop Interfaces (Comparators, Partitioners, Distributed
Cache etc.) There are quite a few interfaces and it will be a challenge to support all of
them. (5 full weeks)
> 2) Profiling of the application and optimizations (if applicable)
> *11 August - 18 August*
> Write documentation on code, write a README with care and add more unit-tests. (1 week)
> ---------------- Imported from GitHub ----------------
> Url: https://github.com/stratosphere/stratosphere/issues/838
> Created by: [rmetzger|https://github.com/rmetzger]
> Labels: core, enhancement, parent-for-major-feature, 
> Milestone: Release 0.7 (unplanned)
> Created at: Tue May 20 10:11:34 CEST 2014
> State: open



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message