flink-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Anirvan BASU <anirvan.b...@inria.fr>
Subject Re: [stratosphere-dev] Re: Project for GSoC
Date Wed, 27 Aug 2014 14:48:20 GMT
Hello Artem, Fabian and Robert,

Thank you for your valuable inputs.

Here is what we are trying to do currently (scratching the surface):

Consider typical Hadoop commands of the form:
hadoop jar /opt/hadoop/contrib/streaming/hadoop-streaming-1.2.1.jar -D stream.num.map.output.key.fields=2
-D mapred.text.key.partitioner.options=-1,1 -inputformat org.apache.hadoop.mapred.TextInputFormat
-input stocks/input/stocks.txt -apache.hadoop.mapred.lib.KeyFieldBasedPartitioner
Here we run a hadoop job with some parameters and keys specified.

hadoop jar /opt/hadoop/contrib/streaming/hadoop-streaming-1.2.1.jar -input /ngrams/input -output
/ngrams/output -mapper ./python/mapper.py -file ./python/mapper.py -combiner ./python/reducer.py
-reducer ./python/reducer.py -file ./python/reducer.py -jobconf stream.num.map.output.key.fields=3
-jobconf stream.num.reduce.output.key.fields=3 -jobconf mapred.reduce.tasks=10
Here we run a hadoop job with some parameters specified and using external scripts (in Python
- as you have developed a Python connector, I cite this example) for mapper, combiner, reducer.

The above are slightly complicated than "hello world" type wordcount examples, ain't it?

Hence, 
We would like to know how to wrap and run the above jobs (if possible) using flink.
In which level of compatibility would you classify the above jobs?

If possible, could you help us by:
- showing how to "wrap" around such commands?
- what dependencies, sources, classes (to archetype in the pom.xml if required to make a package)
OR (to import if required to make a jar file)?

Honestly, after drowning in all the information, we are trying to understand the (flink) elephant
as blind mice. We understand it has legs, trunk, etc ... but don't understand where to start
to mount and ride it.

Happy to clarify again & again until we reach success.

Thank you for your understanding and all your kind help,
Anirvan

----- Original Message -----
> From: "Fabian Hueske" <fhueske@apache.org>
> To: dev@flink.incubator.apache.org
> Sent: Wednesday, August 27, 2014 10:04:18 AM
> Subject: Re: [stratosphere-dev] Re: Project for GSoC
> 
> Hi Anirvan,
> 
> we have a JIRA that tracks the HadoopCompatibility feature:
> https://apache.org/jira/browse/FLINK-838
> <https://issues.apache.org/jira/browse/FLINK-838>
> The basic mechanism is done, but has not been merged because we found a
> cleaner way to integrate the feature.
> 
> There are different levels of Hadoop Compatibility and I did not fully
> understand which kind of compatibility you need.
> 
> Conceptual Compatibility:
> Flink already offers UDF interfaces which are very similar to Hadoop's Map,
> Combine, and Reduce functions (FlatMap, FlatCombine, GroupReduce). These
> interfaces are not source-code compatible, but porting the code should be
> trivial. This is already there. If you're fine with porting your code from
> Hadoop to Flink (which should be a very small effort) you're good to go.
> 
> Function-Level-Source Compatibility:
> Providing UDF wrappers to use Hadoop Map and Reduce functions in Flink
> programs is not difficult (since they are very similar to their Flink
> versions). This is not yet included in the codebase but could be added at
> rather short notice. We do already have interface compatibility for Hadoop
> Input- and OutputFormats. So you can use these without changing the code.
> 
> Job-Level-Source Compatibility:
> This is what we aimed for with the GSoC project. Here we want to support
> the execution of a full Hadoop MR Job in Flink without changing the code.
> However, this turned out to be a bit tricky if custom partitioner, sort,
> and grouping-comparators are used in the job. Adding this feature will take
> a bit more time.
> 
> Function- and Job-Level-Source compatibility will enable the use of already
> existing Hadoop code. If you are implementing new analysis jobs anyways,
> I'd go for a Flink implementation which eases many things such as secondary
> sort, unions of multiple inputs, etc.
> 
> Cheers,
> Fabian
> 
> 
> 
> 
> 2014-08-26 11:29 GMT+02:00 Robert Metzger <rmetzger@apache.org>:
> 
> > Hi Anirvan,
> >
> > I'm forwarding this message to dev@flink.incubator.apache.org. You need to
> > send a (empty) message to dev-subscribe@flink.incubator.apache.org to
> > subscribe to the dev list.
> > The dev@ list is for discussions with the developers, planning etc. The
> > user@flink.incubator.apache.org list is for user questions (for example
> > troubles using the API, conceptual questions etc.)
> > I think the message below is more suited for the dev@ list, since its
> > basically a feature request.
> >
> > Regarding the names: We don't use Stratosphere anymore. Our codebase has
> > been renamed to Flink and the "org.apache.flink" namespace. So ideally this
> > confusion is finally out of the world.
> >
> > For those who want to have a look into the history of the message, see the
> > Google Groups archive here:
> > https://groups.google.com/forum/#!topic/stratosphere-dev/qYvJRSoMYWQ
> >
> > ---------- Forwarded message ----------
> > From: Nirvanesque <nirvanesque.paris@gmail.com>
> > Date: Tue, Aug 26, 2014 at 11:12 AM
> > Subject: [stratosphere-dev] Re: Project for GSoC
> > To: stratosphere-dev@googlegroups.com
> > Cc: tsekis79@gmail.com
> >
> >
> > Hello Artem and mentors,
> >
> > First of all nice greetings from INRIA, France.
> > Hope you had an enjoyable experience in GSOC!
> > Thanks to Robert (rmetzger) for forwarding me here ...
> >
> > At INRIA, we are starting to adopt Stratosphere / Flink.
> > The top-level goal is to enhance performance in User Defined Functions
> > (UDFs) with long workflows using multiple M-R, by using the larger set of
> > Second Order Functions (SOFs) in Stratosphere / Flink.
> > We will demonstrate this improvement by implementing some Use Cases for
> > business purposes.
> > For this purpose, we have chosen some customer analysis Use Cases using
> > weblogs and related data, for 2 companies (who appeared interested to try
> > using Stratosphere / Flink )
> > - a mobile phone app developer: http://www.tribeflame.com
> > - an anti-virus & Internet security software company: www.f-secure.com
> > I will be happy to share with you these Use Cases, if you are interested.
> > Just ask me here.
> >
> > At present, we are typically in the profiles of Alice-Bob-Sam, as described
> > in your GSoC proposal
> > <
> > https://github.com/stratosphere/stratosphere/wiki/GSoC-2014-Project-Proposal-Draft-by-Artem-Tsikiridis
> > >.
> > :-)
> > Hadoop seems to be the starting square for the Stratosphere / Flink
> > journey.
> > Same is the situation with developers in the above 2 companies :-)
> >
> > Briefly,
> > We have installed and run some example programmes from Flink / Stratosphere
> > (versions 0.5.2 and 0.6). We use a cluster (the grid5000 for our Hadoop &
> > Stratosphere installations)
> > We have some good understanding of Hadoop and its use in Streaming and
> > Pipes in conjunction with scripting languages (Python & R specifically)
> > In the first phase, we would like to run some "Hadoop-like" jobs (mainly
> > multiple M-R workflows) on Stratosphere, preferably with extensive Java or
> > Scala programming.
> > I refer to your GSoC project map
> > <
> > https://github.com/stratosphere/stratosphere/wiki/%5BGSoC-14%5D-A-Hadoop-abstraction-layer-for-Stratosphere-%28Project-Map-and-Notes%29
> > >
> > which seems very interesting.
> > If we could have a Hadoop abstraction as you have mentioned, that would be
> > ideal for our first phase.
> > In later phases, when we implement complex join and group operations, we
> > would dive deeper into Stratosphere / Flink Java or Scala APIs
> >
> > Hence, I would like to know, what is the current status in this direction?
> > What has been implemented already? In which version onwards? How to try
> > them?
> > What is yet to be implemented? When - which versions?
> >
> > You may also like to see my discussion with Robert on this page
> > <
> > http://flink.incubator.apache.org/docs/0.6-incubating/cli.html#comment-1558297261
> > >.
> >
> > I am still mining in different discussions - here as well as on JIRA.
> > Please do refer me to the relevant links, JIRA tickets, etc if that saves
> > your time in re-typing large replies.
> > It will also help us to understand the train of collective thinking in the
> > Stratosphere / Flink roadmap.
> >
> > Thanks in advance,
> > Anirvan
> > PS : Apologies for using names / rechristened names (e.g. Flink /
> > Stratosphere) as I am not sure, which name exactly to use currently.
> >
> >
> > On Tuesday, February 25, 2014 10:23:09 PM UTC+1, Artem Tsikiridis wrote:
> > >
> > > Hello Fabian,
> > >
> > > On Tuesday, February 25, 2014 11:20:10 AM UTC+2, fhu...@gmail.com wrote:
> > > > Hi Artem,
> > > >
> > > > thanks a lot for your interest in Stratosphere and participating in our
> > > GSoC projects!
> > > >
> > > > As you know, Hadoop is the big elephant out there in the Big Data
> > jungle
> > > and widely adopted. Therefore, a Hadoop compatibility layer is a very!
> > > important feature for any large scale data processing system.
> > > > Stratosphere builds on foundations of MapReduce but generalizes its
> > > concepts and provides a more efficient runtime.
> > >
> > > Great!
> > >
> > > > When you have a look at the Stratosphere WordCount example program, you
> > > will see, that the programming principles of Stratosphere and Hadoop
> > > MapReduce are quite similar, although Stratosphere is not compatible with
> > > the Hadoop interfaces.
> > >
> > > Yes, I've looked into the example (Wordcount, k-means) I also run the big
> > > test job you have locally and it seems to be ok.
> > >
> > > > With the proposed project we want to achieve, that Hadoop MapReduce
> > jobs
> > > can be executed on Stratosphere without changing a line of code (if
> > > possible).
> > > >
> > > > We have already some pieces for that in place. InputFormats are done
> > > (see https://github.com/stratosphere/stratosphere/
> > > tree/master/stratosphere-addons/hadoop-compatibility), OutputFormats are
> > > work in progress. The biggest missing piece is executing Hadoop Map and
> > > Reduce tasks in Stratosphere. Hadoop provides quite a few interfaces
> > (e.g.,
> > > overwriting partitioning function and sorting comparators, counters,
> > > distributed cache, ...). It would of course be desirable to support as
> > many
> > > of these interfaces as possible, but they can by added step-by-step once
> > > the first Hadoop jobs are running on Stratosphere.
> > >
> > > So If I understand correctly, the idea is to create logical wrappers for
> > > all interfaces used by Hadoop Jobs (the way it has been done with the
> > > hadoop datatypes) so it can be run as completely transparently as
> > possible
> > > on Stratosphere in an efficient way. I agree, there are many interfaces,
> > > but it's very interesting considering the way Stratosphere defines tasks,
> > > which is a bit different (though, as you said, the principle is similar).
> > >
> > > I assume the focus is on the YARN version of Hadoop (new api)?
> > >
> > > And one last question, serialization for Stratosphere is java's default
> > > mechanism, right?
> > >
> > > >
> > > > Regarding your question about cloud deployment scripts, one of our team
> > > members is currently working on this (see this thread:
> > > https://groups.google.com/forum/#!topic/stratosphere-dev/QZPYu9fpjMo).
> > > > I am not sure, if this is still in the making or already done. If you
> > > are interested in this as well, just drop a line to the thread.
> > Although, I
> > > am not very familiar with the detail of this, my gut feeling is that this
> > > would be a bit too less for an individual project. However, there might
> > be
> > > ways to extend this. So if you have any ideas, share them with us and we
> > > will be happy to discuss them.
> > >
> > > Thank you for pointing up the topic. I will let you know if I come up
> > with
> > > anything for this. Probably after I try deploying it on openstack.
> > >
> > > >
> > > > Again, thanks a lot for your interest and please don't hesitate to ask
> > > questions. :-)
> > >
> > > Thank you for the helpful answers.
> > >
> > > Kind regards,
> > > Artem
> > >
> > >
> > > >
> > > > Best,
> > > > Fabian
> > > >
> > > >
> > > > On Tuesday, February 25, 2014 9:12:10 AM UTC+1, tsek...@gmail.com
> > > wrote:
> > > > Dear Stratosphere devs and fellow GSoC potential students,
> > > > Hello!
> > > > I'm Artem, an undergraduate student from Athens, Greece. You can find
> > me
> > > on github (https://github.com/atsikiridis) and occasionally on
> > > stackoverflow (http://stackoverflow.com/users/2568511/artem-tsikiridis).
> > > Currently, however, I'm in Switzerland where I am doing my internship at
> > > CERN as back-end software developer for INSPIRE, a library for High
> > Energy
> > > Physics (we're running on http://inspirehep.net/). The service is in
> > > python( based on the open-source project http://invenio.net) and my
> > > responsibilities are mostly the integration with Redis, database
> > > abstractions, testing (unit, regression) and helping
> > > > our team to integrate modern technologies and frameworks to the current
> > > code base.
> > > > Moreover, I am very interested in big data technologies, therefore
> > > before coming to CERN I've been trying  to make my first steps in
> > research
> > > at the Big Data lab of AUEB, my home university. Mostly, the main
> > objective
> > > of the project I had been involved with, was the implementation of a
> > > dynamic caching mechanism for Hadoop (in a way trying our cache instead
> > of
> > > the built-in distributed cache). Other techs involved where Redis,
> > > Memcached, Ehacache (Terracotta). With this project we gained some
> > insights
> > > about the internals of hadoop (new api. old api, how tasks work, hadoop
> > > serialization, the daemons running etc.) and hdfs, deployed clusters on
> > > cloud computing platforms (Openstack with Nova,  Amazon EC2 with boto).
> > We
> > > also used the Java Remote API for some tests.
> > > > Unfortunately, I have not used Stratosphere before in a research /prod
> > > environment. I have only played with the examples on my local machine. It
> > > is very interesting and I would love to learn more.
> > > > There will probably be a learning curve for me on the Stratosphere side
> > > but implementing a Hadoop Compatibility Layer seems like a very
> > interesting
> > > project and I believe I can be of use :)
> > > > Finally, I was wondering whether there are some command-line tools for
> > > deploying Stratosphere automatically for EC2 or Openstack clouds (for
> > > example, Stratosphere specific abstractions on top of python boto api).
> > Do
> > > you that would make sense as a project?
> > > > Pardon me for the length of this.
> > > > Kind regards,
> > > > Artem
> >
> >  --
> > You received this message because you are subscribed to the Google Groups
> > "stratosphere-dev" group.
> > To unsubscribe from this group and stop receiving emails from it, send an
> > email to stratosphere-dev+unsubscribe@googlegroups.com.
> > Visit this group at http://groups.google.com/group/stratosphere-dev.
> > For more options, visit https://groups.google.com/d/optout.
> >
> 

Mime
View raw message