flink-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Artem Tsikiridis <tseki...@gmail.com>
Subject Re: [stratosphere-dev] Re: Project for GSoC
Date Wed, 27 Aug 2014 11:00:58 GMT
Hi Anirvan,

Thank you very much for your interest.

I would just like to add to what Fabian has said, that If you plan to use
Job-Level-Source Compatibility, we have so far focused on the mapred API
 (there is also mapreduce). You should be able to use jobs that consist of
Mappers, Reducers and configure these tasks with a JobConf (meaning you can
adjust parallelism and use the DistributedCache) in the code that should be
merged in short notice. Here is an example:
https://github.com/atsikiridis/incubator-flink/blob/hadoop-docs/docs/hadoop_compatability.md
(it has not been merged yet)

Cheers,
Artem


On Wed, Aug 27, 2014 at 10:04 AM, Fabian Hueske <fhueske@apache.org> wrote:

> Hi Anirvan,
>
> we have a JIRA that tracks the HadoopCompatibility feature:
> https://apache.org/jira/browse/FLINK-838
> <https://issues.apache.org/jira/browse/FLINK-838>
> The basic mechanism is done, but has not been merged because we found a
> cleaner way to integrate the feature.
>
> There are different levels of Hadoop Compatibility and I did not fully
> understand which kind of compatibility you need.
>
> Conceptual Compatibility:
> Flink already offers UDF interfaces which are very similar to Hadoop's Map,
> Combine, and Reduce functions (FlatMap, FlatCombine, GroupReduce). These
> interfaces are not source-code compatible, but porting the code should be
> trivial. This is already there. If you're fine with porting your code from
> Hadoop to Flink (which should be a very small effort) you're good to go.
>
> Function-Level-Source Compatibility:
> Providing UDF wrappers to use Hadoop Map and Reduce functions in Flink
> programs is not difficult (since they are very similar to their Flink
> versions). This is not yet included in the codebase but could be added at
> rather short notice. We do already have interface compatibility for Hadoop
> Input- and OutputFormats. So you can use these without changing the code.
>
> Job-Level-Source Compatibility:
> This is what we aimed for with the GSoC project. Here we want to support
> the execution of a full Hadoop MR Job in Flink without changing the code.
> However, this turned out to be a bit tricky if custom partitioner, sort,
> and grouping-comparators are used in the job. Adding this feature will take
> a bit more time.
>
> Function- and Job-Level-Source compatibility will enable the use of already
> existing Hadoop code. If you are implementing new analysis jobs anyways,
> I'd go for a Flink implementation which eases many things such as secondary
> sort, unions of multiple inputs, etc.
>
> Cheers,
> Fabian
>
>
>
>
> 2014-08-26 11:29 GMT+02:00 Robert Metzger <rmetzger@apache.org>:
>
> > Hi Anirvan,
> >
> > I'm forwarding this message to dev@flink.incubator.apache.org. You need
> to
> > send a (empty) message to dev-subscribe@flink.incubator.apache.org to
> > subscribe to the dev list.
> > The dev@ list is for discussions with the developers, planning etc. The
> > user@flink.incubator.apache.org list is for user questions (for example
> > troubles using the API, conceptual questions etc.)
> > I think the message below is more suited for the dev@ list, since its
> > basically a feature request.
> >
> > Regarding the names: We don't use Stratosphere anymore. Our codebase has
> > been renamed to Flink and the "org.apache.flink" namespace. So ideally
> this
> > confusion is finally out of the world.
> >
> > For those who want to have a look into the history of the message, see
> the
> > Google Groups archive here:
> > https://groups.google.com/forum/#!topic/stratosphere-dev/qYvJRSoMYWQ
> >
> > ---------- Forwarded message ----------
> > From: Nirvanesque <nirvanesque.paris@gmail.com>
> > Date: Tue, Aug 26, 2014 at 11:12 AM
> > Subject: [stratosphere-dev] Re: Project for GSoC
> > To: stratosphere-dev@googlegroups.com
> > Cc: tsekis79@gmail.com
> >
> >
> > Hello Artem and mentors,
> >
> > First of all nice greetings from INRIA, France.
> > Hope you had an enjoyable experience in GSOC!
> > Thanks to Robert (rmetzger) for forwarding me here ...
> >
> > At INRIA, we are starting to adopt Stratosphere / Flink.
> > The top-level goal is to enhance performance in User Defined Functions
> > (UDFs) with long workflows using multiple M-R, by using the larger set of
> > Second Order Functions (SOFs) in Stratosphere / Flink.
> > We will demonstrate this improvement by implementing some Use Cases for
> > business purposes.
> > For this purpose, we have chosen some customer analysis Use Cases using
> > weblogs and related data, for 2 companies (who appeared interested to try
> > using Stratosphere / Flink )
> > - a mobile phone app developer: http://www.tribeflame.com
> > - an anti-virus & Internet security software company: www.f-secure.com
> > I will be happy to share with you these Use Cases, if you are interested.
> > Just ask me here.
> >
> > At present, we are typically in the profiles of Alice-Bob-Sam, as
> described
> > in your GSoC proposal
> > <
> >
> https://github.com/stratosphere/stratosphere/wiki/GSoC-2014-Project-Proposal-Draft-by-Artem-Tsikiridis
> > >.
> > :-)
> > Hadoop seems to be the starting square for the Stratosphere / Flink
> > journey.
> > Same is the situation with developers in the above 2 companies :-)
> >
> > Briefly,
> > We have installed and run some example programmes from Flink /
> Stratosphere
> > (versions 0.5.2 and 0.6). We use a cluster (the grid5000 for our Hadoop &
> > Stratosphere installations)
> > We have some good understanding of Hadoop and its use in Streaming and
> > Pipes in conjunction with scripting languages (Python & R specifically)
> > In the first phase, we would like to run some "Hadoop-like" jobs (mainly
> > multiple M-R workflows) on Stratosphere, preferably with extensive Java
> or
> > Scala programming.
> > I refer to your GSoC project map
> > <
> >
> https://github.com/stratosphere/stratosphere/wiki/%5BGSoC-14%5D-A-Hadoop-abstraction-layer-for-Stratosphere-%28Project-Map-and-Notes%29
> > >
> > which seems very interesting.
> > If we could have a Hadoop abstraction as you have mentioned, that would
> be
> > ideal for our first phase.
> > In later phases, when we implement complex join and group operations, we
> > would dive deeper into Stratosphere / Flink Java or Scala APIs
> >
> > Hence, I would like to know, what is the current status in this
> direction?
> > What has been implemented already? In which version onwards? How to try
> > them?
> > What is yet to be implemented? When - which versions?
> >
> > You may also like to see my discussion with Robert on this page
> > <
> >
> http://flink.incubator.apache.org/docs/0.6-incubating/cli.html#comment-1558297261
> > >.
> >
> > I am still mining in different discussions - here as well as on JIRA.
> > Please do refer me to the relevant links, JIRA tickets, etc if that saves
> > your time in re-typing large replies.
> > It will also help us to understand the train of collective thinking in
> the
> > Stratosphere / Flink roadmap.
> >
> > Thanks in advance,
> > Anirvan
> > PS : Apologies for using names / rechristened names (e.g. Flink /
> > Stratosphere) as I am not sure, which name exactly to use currently.
> >
> >
> > On Tuesday, February 25, 2014 10:23:09 PM UTC+1, Artem Tsikiridis wrote:
> > >
> > > Hello Fabian,
> > >
> > > On Tuesday, February 25, 2014 11:20:10 AM UTC+2, fhu...@gmail.com
> wrote:
> > > > Hi Artem,
> > > >
> > > > thanks a lot for your interest in Stratosphere and participating in
> our
> > > GSoC projects!
> > > >
> > > > As you know, Hadoop is the big elephant out there in the Big Data
> > jungle
> > > and widely adopted. Therefore, a Hadoop compatibility layer is a very!
> > > important feature for any large scale data processing system.
> > > > Stratosphere builds on foundations of MapReduce but generalizes its
> > > concepts and provides a more efficient runtime.
> > >
> > > Great!
> > >
> > > > When you have a look at the Stratosphere WordCount example program,
> you
> > > will see, that the programming principles of Stratosphere and Hadoop
> > > MapReduce are quite similar, although Stratosphere is not compatible
> with
> > > the Hadoop interfaces.
> > >
> > > Yes, I've looked into the example (Wordcount, k-means) I also run the
> big
> > > test job you have locally and it seems to be ok.
> > >
> > > > With the proposed project we want to achieve, that Hadoop MapReduce
> > jobs
> > > can be executed on Stratosphere without changing a line of code (if
> > > possible).
> > > >
> > > > We have already some pieces for that in place. InputFormats are done
> > > (see https://github.com/stratosphere/stratosphere/
> > > tree/master/stratosphere-addons/hadoop-compatibility), OutputFormats
> are
> > > work in progress. The biggest missing piece is executing Hadoop Map and
> > > Reduce tasks in Stratosphere. Hadoop provides quite a few interfaces
> > (e.g.,
> > > overwriting partitioning function and sorting comparators, counters,
> > > distributed cache, ...). It would of course be desirable to support as
> > many
> > > of these interfaces as possible, but they can by added step-by-step
> once
> > > the first Hadoop jobs are running on Stratosphere.
> > >
> > > So If I understand correctly, the idea is to create logical wrappers
> for
> > > all interfaces used by Hadoop Jobs (the way it has been done with the
> > > hadoop datatypes) so it can be run as completely transparently as
> > possible
> > > on Stratosphere in an efficient way. I agree, there are many
> interfaces,
> > > but it's very interesting considering the way Stratosphere defines
> tasks,
> > > which is a bit different (though, as you said, the principle is
> similar).
> > >
> > > I assume the focus is on the YARN version of Hadoop (new api)?
> > >
> > > And one last question, serialization for Stratosphere is java's default
> > > mechanism, right?
> > >
> > > >
> > > > Regarding your question about cloud deployment scripts, one of our
> team
> > > members is currently working on this (see this thread:
> > > https://groups.google.com/forum/#!topic/stratosphere-dev/QZPYu9fpjMo).
> > > > I am not sure, if this is still in the making or already done. If you
> > > are interested in this as well, just drop a line to the thread.
> > Although, I
> > > am not very familiar with the detail of this, my gut feeling is that
> this
> > > would be a bit too less for an individual project. However, there might
> > be
> > > ways to extend this. So if you have any ideas, share them with us and
> we
> > > will be happy to discuss them.
> > >
> > > Thank you for pointing up the topic. I will let you know if I come up
> > with
> > > anything for this. Probably after I try deploying it on openstack.
> > >
> > > >
> > > > Again, thanks a lot for your interest and please don't hesitate to
> ask
> > > questions. :-)
> > >
> > > Thank you for the helpful answers.
> > >
> > > Kind regards,
> > > Artem
> > >
> > >
> > > >
> > > > Best,
> > > > Fabian
> > > >
> > > >
> > > > On Tuesday, February 25, 2014 9:12:10 AM UTC+1, tsek...@gmail.com
> > > wrote:
> > > > Dear Stratosphere devs and fellow GSoC potential students,
> > > > Hello!
> > > > I'm Artem, an undergraduate student from Athens, Greece. You can find
> > me
> > > on github (https://github.com/atsikiridis) and occasionally on
> > > stackoverflow (http://stackoverflow.com/users/2568511/artem-tsikiridis
> ).
> > > Currently, however, I'm in Switzerland where I am doing my internship
> at
> > > CERN as back-end software developer for INSPIRE, a library for High
> > Energy
> > > Physics (we're running on http://inspirehep.net/). The service is in
> > > python( based on the open-source project http://invenio.net) and my
> > > responsibilities are mostly the integration with Redis, database
> > > abstractions, testing (unit, regression) and helping
> > > > our team to integrate modern technologies and frameworks to the
> current
> > > code base.
> > > > Moreover, I am very interested in big data technologies, therefore
> > > before coming to CERN I've been trying  to make my first steps in
> > research
> > > at the Big Data lab of AUEB, my home university. Mostly, the main
> > objective
> > > of the project I had been involved with, was the implementation of a
> > > dynamic caching mechanism for Hadoop (in a way trying our cache instead
> > of
> > > the built-in distributed cache). Other techs involved where Redis,
> > > Memcached, Ehacache (Terracotta). With this project we gained some
> > insights
> > > about the internals of hadoop (new api. old api, how tasks work, hadoop
> > > serialization, the daemons running etc.) and hdfs, deployed clusters on
> > > cloud computing platforms (Openstack with Nova,  Amazon EC2 with boto).
> > We
> > > also used the Java Remote API for some tests.
> > > > Unfortunately, I have not used Stratosphere before in a research
> /prod
> > > environment. I have only played with the examples on my local machine.
> It
> > > is very interesting and I would love to learn more.
> > > > There will probably be a learning curve for me on the Stratosphere
> side
> > > but implementing a Hadoop Compatibility Layer seems like a very
> > interesting
> > > project and I believe I can be of use :)
> > > > Finally, I was wondering whether there are some command-line tools
> for
> > > deploying Stratosphere automatically for EC2 or Openstack clouds (for
> > > example, Stratosphere specific abstractions on top of python boto api).
> > Do
> > > you that would make sense as a project?
> > > > Pardon me for the length of this.
> > > > Kind regards,
> > > > Artem
> >
> >  --
> > You received this message because you are subscribed to the Google Groups
> > "stratosphere-dev" group.
> > To unsubscribe from this group and stop receiving emails from it, send an
> > email to stratosphere-dev+unsubscribe@googlegroups.com.
> > Visit this group at http://groups.google.com/group/stratosphere-dev.
> > For more options, visit https://groups.google.com/d/optout.
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message