Return-Path: X-Original-To: apmail-flink-dev-archive@www.apache.org Delivered-To: apmail-flink-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id C27DE17489 for ; Wed, 7 Jan 2015 10:44:38 +0000 (UTC) Received: (qmail 93640 invoked by uid 500); 7 Jan 2015 10:44:39 -0000 Delivered-To: apmail-flink-dev-archive@flink.apache.org Received: (qmail 93584 invoked by uid 500); 7 Jan 2015 10:44:39 -0000 Mailing-List: contact dev-help@flink.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@flink.incubator.apache.org Delivered-To: mailing list dev@flink.incubator.apache.org Received: (qmail 93573 invoked by uid 99); 7 Jan 2015 10:44:39 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 07 Jan 2015 10:44:39 +0000 X-ASF-Spam-Status: No, hits=-1997.8 required=5.0 tests=ALL_TRUSTED,HTML_MESSAGE,T_RP_MATCHES_RCVD X-Spam-Check-By: apache.org Received: from [140.211.11.3] (HELO mail.apache.org) (140.211.11.3) by apache.org (qpsmtpd/0.29) with SMTP; Wed, 07 Jan 2015 10:44:31 +0000 Received: (qmail 93517 invoked by uid 99); 7 Jan 2015 10:44:10 -0000 Received: from mail-relay.apache.org (HELO mail-relay.apache.org) (140.211.11.15) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 07 Jan 2015 10:44:10 +0000 Received: from mail-la0-f51.google.com (mail-la0-f51.google.com [209.85.215.51]) by mail-relay.apache.org (ASF Mail Server at mail-relay.apache.org) with ESMTPSA id A193C1A0041 for ; Wed, 7 Jan 2015 10:44:09 +0000 (UTC) Received: by mail-la0-f51.google.com with SMTP id ms9so2592279lab.10 for ; Wed, 07 Jan 2015 02:44:08 -0800 (PST) X-Received: by 10.112.132.67 with SMTP id os3mr3089014lbb.90.1420627448180; Wed, 07 Jan 2015 02:44:08 -0800 (PST) MIME-Version: 1.0 Received: by 10.153.5.36 with HTTP; Wed, 7 Jan 2015 02:43:46 -0800 (PST) In-Reply-To: <20150107104034.45D51AC003E@hades.apache.org> References: <20150107104034.45D51AC003E@hades.apache.org> From: Robert Metzger Date: Wed, 7 Jan 2015 11:43:46 +0100 Message-ID: Subject: Re: svn commit: r1650029 - in /flink: _posts/2015-01-06-december-in-flink.md site/blog/index.html site/blog/page2/index.html site/blog/page3/index.html site/news/2015/ site/news/2015/01/ site/news/2015/01/06/ site/news/2015/01/06/december-in-flink.html To: "dev@flink.incubator.apache.org" Content-Type: multipart/alternative; boundary=047d7b3a8192683abf050c0d9a41 X-Virus-Checked: Checked by ClamAV on apache.org --047d7b3a8192683abf050c0d9a41 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Just FYI, the svnpubsub for the website is currently not working. This is the respective issue for the website migration: https://issues.apache.org/jira/browse/INFRA-8915 On Wed, Jan 7, 2015 at 11:40 AM, wrote: > Author: ktzoumas > Date: Wed Jan 7 10:40:31 2015 > New Revision: 1650029 > > URL: http://svn.apache.org/r1650029 > Log: > Added blog post - December 2014 in the Flink community > > Added: > flink/_posts/2015-01-06-december-in-flink.md > flink/site/news/2015/ > flink/site/news/2015/01/ > flink/site/news/2015/01/06/ > flink/site/news/2015/01/06/december-in-flink.html > Modified: > flink/site/blog/index.html > flink/site/blog/page2/index.html > flink/site/blog/page3/index.html > > Added: flink/_posts/2015-01-06-december-in-flink.md > URL: > http://svn.apache.org/viewvc/flink/_posts/2015-01-06-december-in-flink.md= ?rev=3D1650029&view=3Dauto > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D > --- flink/_posts/2015-01-06-december-in-flink.md (added) > +++ flink/_posts/2015-01-06-december-in-flink.md Wed Jan 7 10:40:31 2015 > @@ -0,0 +1,62 @@ > +--- > +layout: post > +title: 'December 2014 in the Flink community' > +date: 2015-01-06 10:00:00 > +categories: news > +--- > + > +This is the first blog post of a =C3=A2=E2=82=AC=C5=93newsletter=C3=A2= =E2=82=AC like series where we > give a summary of the monthly activity in the Flink community. As the Fli= nk > project grows, this can serve as a "tl;dr" for people that are not > following the Flink dev and user mailing lists, or those that are simply > overwhelmed by the traffic. > + > + > +###Flink graduation > + > +The biggest news is that the Apache board approved Flink as a top-level > Apache project! The Flink team is working closely with the Apache press > team for an official announcement, so stay tuned for details! > + > +###New Flink website > + > +The [Flink website](http://flink.apache.org) got a total make-over, both > in terms of appearance and content. > + > +###Flink IRC channel > + > +A new IRC channel called #flink was created at irc.freenode.org. An easy > way to access the IRC channel is through the [web client]( > http://webchat.freenode.net/). Feel free to stop by to ask anything or > share your ideas about Apache Flink! > + > +###Meetups and Talks > + > +Apache Flink was presented in the [Amsterdam Hadoop User Group]( > http://www.meetup.com/Netherlands-Hadoop-User-Group/events/218635152) > + > +##Notable code contributions > + > +**Note:** Code contributions listed here may not be part of a release or > even the current snapshot yet. > + > +###[Streaming Scala API]( > https://github.com/apache/incubator-flink/pull/275) > + > +The Flink Streaming Java API recently got its Scala counterpart. Once > merged, Flink Streaming users can use both Scala and Java for their > development. The Flink Streaming Scala API is built as a thin layer on to= p > of the Java API, making sure that the APIs are kept easily in sync. > + > +###[Intermediate datasets]( > https://github.com/apache/incubator-flink/pull/254) > + > +This pull request introduces a major change in the Flink runtime. > Currently, the Flink runtime is based on the notion of operators that > exchange data through channels. With the PR, intermediate data sets that > are produced by operators become first-class citizens in the runtime. Whi= le > this does not have any user-facing impact yet, it lays the groundwork for= a > slew of future features such as blocking execution, fine-grained > fault-tolerance, and more efficient data sharing between cluster and clie= nt. > + > +###[Configurable execution mode]( > https://github.com/apache/incubator-flink/pull/259) > + > +This pull request allows the user to change the object-reuse behaviour. > Before this pull request, some operations would reuse objects passed to t= he > user function while others would always create new objects. This introduc= es > a system wide switch and changes all operators to either reuse objects or > don=C3=A2=E2=82=AC=E2=84=A2t reuse objects. > + > +###[Distributed Coordination via Akka]( > https://github.com/apache/incubator-flink/pull/149) > + > +Another major change is a complete rewrite of the JobManager / > TaskManager components in Scala. In addition to that, the old RPC service > was replaced by Actors, using the Akka framework. > + > +###[Sorting of very large records]( > https://github.com/apache/incubator-flink/pull/249 ) > + > +Flink's internal sort-algorithms were improved to better handle large > records (multiple 100s of megabytes or larger). Previously, the system di= d > in some cases hold instances of multiple large records, resulting in high > memory consumption and JVM heap thrashing. Through this fix, large record= s > are streamed through the operators, reducing the memory consumption and G= C > pressure. The system now requires much less memory to support algorithms > that work on such large records. > + > +###[Kryo Serialization as the new default fallback]( > https://github.com/apache/incubator-flink/pull/271) > + > +Flink=C3=A2=E2=82=AC=E2=84=A2s build-in type serialization framework is = handles all common > types very efficiently. Prior versions uses Avro to serialize types that > the built-in framework could not handle. > +Flink serialization system improved a lot over time and by now surpasses > the capabilities of Avro in many cases. Kryo now serves as the default > fallback serialization framework, supporting a much broader range of type= s. > + > +###[Hadoop FileSystem support]( > https://github.com/apache/incubator-flink/pull/268) > + > +This change permits users to use all file systems supported by Hadoop > with Flink. In practice this means that users can use Flink with Tachyon, > Google Cloud Storage (also out of the box Flink YARN support on Google > Compute Cloud), FTP and all the other file system implementations for > Hadoop. > + > +##Heading to the 0.8.0 release > + > +The community is working hard together with the Apache infra team to > migrate the Flink infrastructure to a top-level project. At the same time= , > the Flink community is working on the Flink 0.8.0 release which should be > out very soon. > \ No newline at end of file > > Modified: flink/site/blog/index.html > URL: > http://svn.apache.org/viewvc/flink/site/blog/index.html?rev=3D1650029&r1= =3D1650028&r2=3D1650029&view=3Ddiff > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D > --- flink/site/blog/index.html (original) > +++ flink/site/blog/index.html Wed Jan 7 10:40:31 2015 > @@ -131,6 +131,68 @@ >
> >
> +

href=3D"/news/2015/01/06/december-in-flink.html">December 2014 in the Fli= nk > community

> +

06 Jan 2015

> + > +

This is the first blog post of a > =C3=A2=E2=82=AC=C5=93newsletter=C3=A2=E2=82=AC like series where we give= a summary of the monthly > activity in the Flink community. As the Flink project grows, this can ser= ve > as a "tl;dr" for people that are not following the Flink dev an= d > user mailing lists, or those that are simply overwhelmed by the traffic.<= /p> > + > +

> + > +

The biggest news is that the Apache board approved Flink as a > top-level Apache project! The Flink team is working closely with the Apac= he > press team for an official announcement, so stay tuned for details!

> + > + > + > +

The Flink website got a total > make-over, both in terms of appearance and content.

> + > + > + > +

A new IRC channel called #flink was created at irc.freenode.org. An > easy way to access the IRC channel is through the http://webchat.freenode.net/">web client. Feel free to stop by to > ask anything or share your ideas about Apache Flink!

> + > +

Meetups and Talks

> + > +

Apache Flink was presented in the http://www.meetup.com/Netherlands-Hadoop-User-Group/events/218635152">Ams= terdam > Hadoop User Group

> + > +

Notable code contributions

> + > +

Note: Code contributions listed here may not be part > of a release or even the current snapshot yet.

> + > +

https://github.com/apache/incubator-flink/pull/275">Streaming Scala > API

> + > +

The Flink Streaming Java API recently got its Scala counterpart. Once > merged, Flink Streaming users can use both Scala and Java for their > development. The Flink Streaming Scala API is built as a thin layer on to= p > of the Java API, making sure that the APIs are kept easily in sync.

> + > +

https://github.com/apache/incubator-flink/pull/254">Intermediate > datasets

> + > +

This pull request introduces a major change in the Flink runtime. > Currently, the Flink runtime is based on the notion of operators that > exchange data through channels. With the PR, intermediate data sets that > are produced by operators become first-class citizens in the runtime. Whi= le > this does not have any user-facing impact yet, it lays the groundwork for= a > slew of future features such as blocking execution, fine-grained > fault-tolerance, and more efficient data sharing between cluster and > client.

> + > +

https://github.com/apache/incubator-flink/pull/259">Configurable > execution mode

> + > +

This pull request allows the user to change the object-reuse > behaviour. Before this pull request, some operations would reuse objects > passed to the user function while others would always create new objects. > This introduces a system wide switch and changes all operators to either > reuse objects or don=C3=A2=E2=82=AC=E2=84=A2t reuse objects.

> + > +

https://github.com/apache/incubator-flink/pull/149">Distributed > Coordination via Akka

> + > +

Another major change is a complete rewrite of the JobManager / > TaskManager components in Scala. In addition to that, the old RPC service > was replaced by Actors, using the Akka framework.

> + > +

https://github.com/apache/incubator-flink/pull/249">Sorting of very large > records

> + > +

Flink's internal sort-algorithms were improved to better handle > large records (multiple 100s of megabytes or larger). Previously, the > system did in some cases hold instances of multiple large records, > resulting in high memory consumption and JVM heap thrashing. Through this > fix, large records are streamed through the operators, reducing the memor= y > consumption and GC pressure. The system now requires much less memory to > support algorithms that work on such large records.

> + > +

https://github.com/apache/incubator-flink/pull/271">Kryo Serialization as > the new default fallback

> + > +

Flink=C3=A2=E2=82=AC=E2=84=A2s build-in type serialization framework = is handles all common > types very efficiently. Prior versions uses Avro to serialize types that > the built-in framework could not handle. > +Flink serialization system improved a lot over time and by now surpasses > the capabilities of Avro in many cases. Kryo now serves as the default > fallback serialization framework, supporting a much broader range of > types.

> + > +

https://github.com/apache/incubator-flink/pull/268">Hadoop FileSystem > support

> + > +

This change permits users to use all file systems supported by Hadoop > with Flink. In practice this means that users can use Flink with Tachyon, > Google Cloud Storage (also out of the box Flink YARN support on Google > Compute Cloud), FTP and all the other file system implementations for > Hadoop.

> + > +

Heading to the 0.8.0 release > + > +

The community is working hard together with the Apache infra team to > migrate the Flink infrastructure to a top-level project. At the same time= , > the Flink community is working on the Flink 0.8.0 release which should be > out very soon.

> +

> + href=3D"/news/2015/01/06/december-in-flink.html#disqus_thread">December 2= 014 > in the Flink community > +
> + > + > > -
> -

href=3D"/news/2014/01/28/querying_mongodb.html">Accessing Data Stored in > MongoDB with Stratosphere

> -

28 Jan 2014

> - > -

We recently merged a https://github.com/stratosphere/stratosphere/pull/437">pull request > that allows you to use any existing Hadoop http://developer.yahoo.com/hadoop/tutorial/module5.html#inputformat">Inpu= tFormat > with Stratosphere. So you can now (in the 0.5-SNAPSHOT and > upwards versions) define a Hadoop-based data source:

> -
 data-lang=3D"java">HadoopDataSource  class=3D"n">source =3D =
new
> HadoopDataSource( class=3D"k">new TextInputFormat class=3D"o">(), new  class=3D"nf">JobConf(),  class=3D"s">"Input Lines");
> -TextInputFormat. class=3D"na">addInputPath( class=3D"n">source. class=3D"na">getJobConf(),  class=3D"k">new Path class=3D"o">(dataInput)=
);
> -
> -

We describe in the following article how to access data stored in href=3D"http://www.mongodb.org/">MongoDB with Stratosphere. This allo= ws > users to join data from multiple sources (e.g. MonogDB and HDFS) or perfo= rm > machine learning with the documents stored in MongoDB.

> - > -

The approach here is to use the MongoInputFormat that wa= s > developed for Apache Hadoop but now also runs with Stratosphere.

> -
 data-lang=3D"java">JobConf con=
f
> =3D new  class=3D"nf">JobConf();
> -conf. class=3D"na">set( class=3D"s">"mongo.input.uri",<=
span
> class=3D"s">"mongodb://localhost:27017/enron_mail.messages" class=3D"o">);
> -HadoopDataSource src <=
span
> class=3D"o">=3D new  class=3D"nf">HadoopDataSource( class=3D"k">new MongoInputFormat class=3D"o">(), conf,
> "Read from Mongodb" class=3D"o">, new  class=3D"nf">WritableWrapperConverter());
> -
> -

Example Program

> - > -

The example program reads data from the http://www.cs.cmu.edu/%7Eenron/">enron dataset that contains about > 500k internal e-mails. The data is stored in MongoDB and the Stratosphere > program counts the number of e-mails per day.

> - > -

The complete code of this sample program is available on https://github.com/stratosphere/stratosphere-mongodb-example > ">GitHub.

> - > -

Prepare MongoDB and the Data > - > - > -

> bunzip2 enron_mongo.tar.bz2
> - tar xvf enron_mongo.tar
> - mongorestore dump/enron_mail/messages.bson
> -
> -

We used Robomongo to visually > examine the dataset stored in MongoDB.

> - > -

=

> - > -

Build MongoInputFormat > - > -

MongoDB offers an InputFormat for Hadoop on their https://github.com/mongodb/mongo-hadoop">GitHub page. The code is not > available in any Maven repository, so we have to build the jar file on ou= r > own.

> - > -
    > -
  • Check out the repository
  • > -
> -
 data-lang=3D"text">git clone https://github.com/mongodb/mongo-hadoop.git
> -cd mongo-hadoop
> -
> -
    > -
  • Set the appropriate Hadoop version in the build.sbt, we > used 1.1.
  • > -
> -
 data-lang=3D"bash">hadoopRelease in ThisBuild :=3D  class=3D"s2">"1.1"
> -
> -
    > -
  • Build the input format
  • > -
> -
 data-lang=3D"bash">./sbt package
> -
> -

The jar-file is now located in core/target.

> - > -

The Stratosphere Program

> - > -

Now we have everything prepared to run the Stratosphere program. I > only ran it on my local computer, out of Eclipse. To do that, check out t= he > code ...

> -
 data-lang=3D"bash">git clone
> https://github.com/stratosphere/stratosphere-mongodb-example.git
> -
> -

... and import it as a Maven project into your Eclipse. You have to > manually add the previously built mongo-hadoop jar-file as a dependency. > -You can now press the "Run" button and see how Stratosphere > executes the little program. It was running for about 8 seconds on the 1.= 5 > GB dataset.

> - > -

The result (located in /tmp/enronCountByDay) now looks > like this.

> -
 data-lang=3D"text">11,Fri Sep 26 10:00:00 CEST 1997
> -154,Tue Jun 29 10:56:00 CEST 1999
> -292,Tue Aug 10 12:11:00 CEST 1999
> -185,Thu Aug 12 18:35:00 CEST 1999
> -26,Fri Mar 19 12:33:00 CET 1999
> -
> -

There is one thing left I want to point out here. MongoDB represents > objects stored in the database as JSON-documents. Since Stratosphere'= s > standard types do not support JSON documents, I was using the > WritableWrapper here. This wrapper allows to use any Hadoop > datatype with Stratosphere.

> - > -

The following code example shows how the JSON-documents are accessed > in Stratosphere.

> -
 data-lang=3D"java">public  class=3D"kt">void map class=3D"o">(Record  class=3D"n">record,  class=3D"n">Collector< class=3D"n">Record>  class=3D"n">out) throw=
s
> Exception {
> -    Writable valWr  class=3D"o">=3D record class=3D"o">.getField class=3D"o">(1,=
  class=3D"n">WritableWrapper. class=3D"na">class). class=3D"na">value();
> -    BSONWritable value
> =3D ( class=3D"n">BSONWritable)  class=3D"n">valWr;
> -    Object headers  class=3D"o">=3D value class=3D"o">.getDoc class=3D"o">().get class=3D"o">("headers" class=3D"o">);
> -    BasicDBObject headerOb
> =3D ( class=3D"n">BasicDBObject)  class=3D"n">headers;
> -    String date  class=3D"o">=3D ( class=3D"n">String)  class=3D"n">headerOb. class=3D"na">get( class=3D"s">"Date");
> -    // further date processing
> -}
> -
> -

Please use the comments if you have questions or if you want to > showcase your own MongoDB-Stratosphere integration. > -

> -Written by Robert Metzger ( ">@rmetzger_).

> -
> - href=3D"/news/2014/01/28/querying_mongodb.html#disqus_thread">Accessing D= ata > Stored in MongoDB with Stratosphere > -
> - >
>
> > > Modified: flink/site/blog/page2/index.html > URL: > http://svn.apache.org/viewvc/flink/site/blog/page2/index.html?rev=3D16500= 29&r1=3D1650028&r2=3D1650029&view=3Ddiff > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D > --- flink/site/blog/page2/index.html (original) > +++ flink/site/blog/page2/index.html Wed Jan 7 10:40:31 2015 > @@ -131,6 +131,98 @@ >
> >
> +

href=3D"/news/2014/01/28/querying_mongodb.html">Accessing Data Stored in > MongoDB with Stratosphere

> +

28 Jan 2014

> + > +

We recently merged a https://github.com/stratosphere/stratosphere/pull/437">pull request > that allows you to use any existing Hadoop http://developer.yahoo.com/hadoop/tutorial/module5.html#inputformat">Inpu= tFormat > with Stratosphere. So you can now (in the 0.5-SNAPSHOT and > upwards versions) define a Hadoop-based data source:

> +
 data-lang=3D"java">HadoopDataSource  class=3D"n">source =3D =
new
> HadoopDataSource( class=3D"k">new TextInputFormat class=3D"o">(), new  class=3D"nf">JobConf(),  class=3D"s">"Input Lines");
> +TextInputFormat. class=3D"na">addInputPath( class=3D"n">source. class=3D"na">getJobConf(),  class=3D"k">new Path class=3D"o">(dataInput)=
);
> +
> +

We describe in the following article how to access data stored in href=3D"http://www.mongodb.org/">MongoDB with Stratosphere. This allo= ws > users to join data from multiple sources (e.g. MonogDB and HDFS) or perfo= rm > machine learning with the documents stored in MongoDB.

> + > +

The approach here is to use the MongoInputFormat that wa= s > developed for Apache Hadoop but now also runs with Stratosphere.

> +
 data-lang=3D"java">JobConf con=
f
> =3D new  class=3D"nf">JobConf();
> +conf. class=3D"na">set( class=3D"s">"mongo.input.uri",<=
span
> class=3D"s">"mongodb://localhost:27017/enron_mail.messages" class=3D"o">);
> +HadoopDataSource src <=
span
> class=3D"o">=3D new  class=3D"nf">HadoopDataSource( class=3D"k">new MongoInputFormat class=3D"o">(), conf,
> "Read from Mongodb" class=3D"o">, new  class=3D"nf">WritableWrapperConverter());
> +
> +

Example Program

> + > +

The example program reads data from the http://www.cs.cmu.edu/%7Eenron/">enron dataset that contains about > 500k internal e-mails. The data is stored in MongoDB and the Stratosphere > program counts the number of e-mails per day.

> + > +

The complete code of this sample program is available on https://github.com/stratosphere/stratosphere-mongodb-example > ">GitHub.

> + > +

Prepare MongoDB and the Data > + > + > +

> bunzip2 enron_mongo.tar.bz2
> + tar xvf enron_mongo.tar
> + mongorestore dump/enron_mail/messages.bson
> +
> +

We used Robomongo to visually > examine the dataset stored in MongoDB.

> + > +

=

> + > +

Build MongoInputFormat > + > +

MongoDB offers an InputFormat for Hadoop on their https://github.com/mongodb/mongo-hadoop">GitHub page. The code is not > available in any Maven repository, so we have to build the jar file on ou= r > own.

> + > +
    > +
  • Check out the repository
  • > +
> +
 data-lang=3D"text">git clone https://github.com/mongodb/mongo-hadoop.git
> +cd mongo-hadoop
> +
> +
    > +
  • Set the appropriate Hadoop version in the build.sbt, we > used 1.1.
  • > +
> +
 data-lang=3D"bash">hadoopRelease in ThisBuild :=3D  class=3D"s2">"1.1"
> +
> +
    > +
  • Build the input format
  • > +
> +
 data-lang=3D"bash">./sbt package
> +
> +

The jar-file is now located in core/target.

> + > +

The Stratosphere Program

> + > +

Now we have everything prepared to run the Stratosphere program. I > only ran it on my local computer, out of Eclipse. To do that, check out t= he > code ...

> +
 data-lang=3D"bash">git clone
> https://github.com/stratosphere/stratosphere-mongodb-example.git
> +
> +

... and import it as a Maven project into your Eclipse. You have to > manually add the previously built mongo-hadoop jar-file as a dependency. > +You can now press the "Run" button and see how Stratosphere > executes the little program. It was running for about 8 seconds on the 1.= 5 > GB dataset.

> + > +

The result (located in /tmp/enronCountByDay) now looks > like this.

> +
 data-lang=3D"text">11,Fri Sep 26 10:00:00 CEST 1997
> +154,Tue Jun 29 10:56:00 CEST 1999
> +292,Tue Aug 10 12:11:00 CEST 1999
> +185,Thu Aug 12 18:35:00 CEST 1999
> +26,Fri Mar 19 12:33:00 CET 1999
> +
> +

There is one thing left I want to point out here. MongoDB represents > objects stored in the database as JSON-documents. Since Stratosphere'= s > standard types do not support JSON documents, I was using the > WritableWrapper here. This wrapper allows to use any Hadoop > datatype with Stratosphere.

> + > +

The following code example shows how the JSON-documents are accessed > in Stratosphere.

> +
 data-lang=3D"java">public  class=3D"kt">void map class=3D"o">(Record  class=3D"n">record,  class=3D"n">Collector< class=3D"n">Record>  class=3D"n">out) throw=
s
> Exception {
> +    Writable valWr  class=3D"o">=3D record class=3D"o">.getField class=3D"o">(1,=
  class=3D"n">WritableWrapper. class=3D"na">class). class=3D"na">value();
> +    BSONWritable value
> =3D ( class=3D"n">BSONWritable)  class=3D"n">valWr;
> +    Object headers  class=3D"o">=3D value class=3D"o">.getDoc class=3D"o">().get class=3D"o">("headers" class=3D"o">);
> +    BasicDBObject headerOb
> =3D ( class=3D"n">BasicDBObject)  class=3D"n">headers;
> +    String date  class=3D"o">=3D ( class=3D"n">String)  class=3D"n">headerOb. class=3D"na">get( class=3D"s">"Date");
> +    // further date processing
> +}
> +
> +

Please use the comments if you have questions or if you want to > showcase your own MongoDB-Stratosphere integration. > +

> +Written by Robert Metzger ( ">@rmetzger_).

> +
> + href=3D"/news/2014/01/28/querying_mongodb.html#disqus_thread">Accessing D= ata > Stored in MongoDB with Stratosphere > +
> + > + > > -
> -

href=3D"/news/2012/10/15/icde2013.html">Stratosphere Demo Accepted for IC= DE > 2013

> -

15 Oct 2012

> - > -

Our demo submission
> -"Peeking into the Optimization of Data Flow Programs with > MapReduce-style UDFs"
> -has been accepted for ICDE 2013 in Brisbane, Australia.
> -The demo illustrates the contributions of our VLDB 2012 paper > "Opening the Black Boxes in Data Flow Optimization" href=3D"/assets/papers/optimizationOfDataFlowsWithUDFs_13.pdf">[PDF] = and > href=3D"/assets/papers/optimizationOfDataFlowsWithUDFs_poster_13.pdf">[Po= ster > PDF].

> -

Visit our poster, enjoy the demo, and talk to us if you are going to > attend ICDE 2013.

> -

Abstract:
> -Data flows are a popular abstraction to define data-intensive processing > tasks. In order to support a wide range of use cases, many data processin= g > systems feature MapReduce-style user-defined functions (UDFs). In contras= t > to UDFs as known from relational DBMS, MapReduce-style UDFs have less > strict templates. These templates do not alone provide all the informatio= n > needed to decide whether they can be reordered with relational operators > and other UDFs. However, it is well-known that reordering operators such = as > filters, joins, and aggregations can yield runtime improvements by orders > of magnitude.
> -We demonstrate an optimizer for data flows that is able to reorder > operators with MapReduce-style UDFs written in an imperative language. Ou= r > approach leverages static code analysis to extract information from UDFs > which is used to reason about the reorderbility of UDF operators. This > information is sufficient to enumerate a large fraction of the search spa= ce > covered by conventional RDBMS optimizers including filter and aggregation > push-down, bushy join orders, and choice of physical execution strategies > based on interesting properties.
> -We demonstrate our optimizer and a job submission client that allows > users to peek step-by-step into each phase of the optimization process: t= he > static code analysis of UDFs, the enumeration of reordered candidate data > flows, the generation of physical execution plans, and their parallel > execution. For the demonstration, we provide a selection of relational an= d > non-relational data flow programs which highlight the salient features of > our approach.

> - > -
> - href=3D"/news/2012/10/15/icde2013.html#disqus_thread">Stratosphere Demo > Accepted for ICDE 2013 > -
> - >
>
> > > Modified: flink/site/blog/page3/index.html > URL: > http://svn.apache.org/viewvc/flink/site/blog/page3/index.html?rev=3D16500= 29&r1=3D1650028&r2=3D1650029&view=3Ddiff > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D > --- flink/site/blog/page3/index.html (original) > +++ flink/site/blog/page3/index.html Wed Jan 7 10:40:31 2015 > @@ -131,6 +131,24 @@ >
> >
> +

href=3D"/news/2012/10/15/icde2013.html">Stratosphere Demo Accepted for IC= DE > 2013

> +

15 Oct 2012

> + > +

Our demo submission
> +"Peeking into the Optimization of Data Flow Programs with > MapReduce-style UDFs"
> +has been accepted for ICDE 2013 in Brisbane, Australia.
> +The demo illustrates the contributions of our VLDB 2012 paper > "Opening the Black Boxes in Data Flow Optimization" href=3D"/assets/papers/optimizationOfDataFlowsWithUDFs_13.pdf">[PDF] = and > href=3D"/assets/papers/optimizationOfDataFlowsWithUDFs_poster_13.pdf">[Po= ster > PDF].

> +

Visit our poster, enjoy the demo, and talk to us if you are going to > attend ICDE 2013.

> +

Abstract:
> +Data flows are a popular abstraction to define data-intensive processing > tasks. In order to support a wide range of use cases, many data processin= g > systems feature MapReduce-style user-defined functions (UDFs). In contras= t > to UDFs as known from relational DBMS, MapReduce-style UDFs have less > strict templates. These templates do not alone provide all the informatio= n > needed to decide whether they can be reordered with relational operators > and other UDFs. However, it is well-known that reordering operators such = as > filters, joins, and aggregations can yield runtime improvements by orders > of magnitude.
> +We demonstrate an optimizer for data flows that is able to reorder > operators with MapReduce-style UDFs written in an imperative language. Ou= r > approach leverages static code analysis to extract information from UDFs > which is used to reason about the reorderbility of UDF operators. This > information is sufficient to enumerate a large fraction of the search spa= ce > covered by conventional RDBMS optimizers including filter and aggregation > push-down, bushy join orders, and choice of physical execution strategies > based on interesting properties.
> +We demonstrate our optimizer and a job submission client that allows > users to peek step-by-step into each phase of the optimization process: t= he > static code analysis of UDFs, the enumeration of reordered candidate data > flows, the generation of physical execution plans, and their parallel > execution. For the demonstration, we provide a selection of relational an= d > non-relational data flow programs which highlight the salient features of > our approach.

> + > +
> + href=3D"/news/2012/10/15/icde2013.html#disqus_thread">Stratosphere Demo > Accepted for ICDE 2013 > +
> + > +
>

href=3D"/news/2012/08/21/release02.html">Version 0.2 Released

>

21 Aug 2012

> > > Added: flink/site/news/2015/01/06/december-in-flink.html > URL: > http://svn.apache.org/viewvc/flink/site/news/2015/01/06/december-in-flink= .html?rev=3D1650029&view=3Dauto > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D > --- flink/site/news/2015/01/06/december-in-flink.html (added) > +++ flink/site/news/2015/01/06/december-in-flink.html Wed Jan 7 10:40:31 > 2015 > @@ -0,0 +1,339 @@ > + > + > + > + > + > + initial-scale=3D1"> > + > + Apache Flink (incubating): December 2014 in the Flink > community > + type=3D"image/x-icon"> > + > + > + > + > + > + > + > + > + > + > + > +
> +
> +
> +
> + > + > + > +
> +
> +
> +
> + > + > +
> +
> +
> +
> +
> +
> +

December 2014 in the Flink > community

> +

06 Jan 2015

> +
> +

This is the first blog post of a > =C3=A2=E2=82=AC=C5=93newsletter=C3=A2=E2=82=AC like series where we give= a summary of the monthly > activity in the Flink community. As the Flink project grows, this can ser= ve > as a "tl;dr" for people that are not following the Flink dev an= d > user mailing lists, or those that are simply overwhelmed by the traffic.<= /p> > + > +

> + > +

The biggest news is that the Apache board approved Flink as a > top-level Apache project! The Flink team is working closely with the Apac= he > press team for an official announcement, so stay tuned for details!

> + > + > + > +

The Flink website got a total > make-over, both in terms of appearance and content.

> + > + > + > +

A new IRC channel called #flink was created at irc.freenode.org. An > easy way to access the IRC channel is through the http://webchat.freenode.net/">web client. Feel free to stop by to > ask anything or share your ideas about Apache Flink!

> + > +

Meetups and Talks

> + > +

Apache Flink was presented in the http://www.meetup.com/Netherlands-Hadoop-User-Group/events/218635152">Ams= terdam > Hadoop User Group

> + > +

Notable code contributions

> + > +

Note: Code contributions listed here may not be part > of a release or even the current snapshot yet.

> + > +

https://github.com/apache/incubator-flink/pull/275">Streaming Scala > API

> + > +

The Flink Streaming Java API recently got its Scala counterpart. Once > merged, Flink Streaming users can use both Scala and Java for their > development. The Flink Streaming Scala API is built as a thin layer on to= p > of the Java API, making sure that the APIs are kept easily in sync.

> + > +

https://github.com/apache/incubator-flink/pull/254">Intermediate > datasets

> + > +

This pull request introduces a major change in the Flink runtime. > Currently, the Flink runtime is based on the notion of operators that > exchange data through channels. With the PR, intermediate data sets that > are produced by operators become first-class citizens in the runtime. Whi= le > this does not have any user-facing impact yet, it lays the groundwork for= a > slew of future features such as blocking execution, fine-grained > fault-tolerance, and more efficient data sharing between cluster and > client.

> + > +

https://github.com/apache/incubator-flink/pull/259">Configurable > execution mode

> + > +

This pull request allows the user to change the object-reuse > behaviour. Before this pull request, some operations would reuse objects > passed to the user function while others would always create new objects. > This introduces a system wide switch and changes all operators to either > reuse objects or don=C3=A2=E2=82=AC=E2=84=A2t reuse objects.

> + > +

https://github.com/apache/incubator-flink/pull/149">Distributed > Coordination via Akka

> + > +

Another major change is a complete rewrite of the JobManager / > TaskManager components in Scala. In addition to that, the old RPC service > was replaced by Actors, using the Akka framework.

> + > +

https://github.com/apache/incubator-flink/pull/249">Sorting of very large > records

> + > +

Flink's internal sort-algorithms were improved to better handle > large records (multiple 100s of megabytes or larger). Previously, the > system did in some cases hold instances of multiple large records, > resulting in high memory consumption and JVM heap thrashing. Through this > fix, large records are streamed through the operators, reducing the memor= y > consumption and GC pressure. The system now requires much less memory to > support algorithms that work on such large records.

> + > +

https://github.com/apache/incubator-flink/pull/271">Kryo Serialization as > the new default fallback

> + > +

Flink=C3=A2=E2=82=AC=E2=84=A2s build-in type serialization framework = is handles all common > types very efficiently. Prior versions uses Avro to serialize types that > the built-in framework could not handle. > +Flink serialization system improved a lot over time and by now surpasses > the capabilities of Avro in many cases. Kryo now serves as the default > fallback serialization framework, supporting a much broader range of > types.

> + > +

https://github.com/apache/incubator-flink/pull/268">Hadoop FileSystem > support

> + > +

This change permits users to use all file systems supported by Hadoop > with Flink. In practice this means that users can use Flink with Tachyon, > Google Cloud Storage (also out of the box Flink YARN support on Google > Compute Cloud), FTP and all the other file system implementations for > Hadoop.

> + > +

Heading to the 0.8.0 release > + > +

The community is working hard together with the Apache infra team to > migrate the Flink infrastructure to a top-level project. At the same time= , > the Flink community is working on the Flink 0.8.0 release which should be > out very soon.

> + > +

> +
> +
> +
> +
> +
> +
> +
> +
> + > + > + class=3D"dsq-brlink">comments powered by class=3D"logo-disqus">Disqus > +
> +
> +
> +
> + > +
> +
> +
> +

Apache Flink is an effort undergoing incubation at The > Apache > + Software Foundation (ASF), sponsored by the Apach= e > Incubator PMC. > + Incubation is required of all newly accepted > projects until a further > + review indicates that the infrastructure, > communications, and > + decision making process have stabilized in a > manner consistent with > + other successful ASF projects. While incubation > status is not > + necessarily a reflection of the completeness or > stability of the > + code, it does indicate that the project has yet t= o > be fully endorsed > + by the ASF.

> + class=3D"img-responsive" > + src=3D"/img/main/apache-incubator-logo.png" > alt=3D"Apache Flink" /> > + > +

> + Policy" > + class=3D"af-privacy-policy">Privacy > Policy > +

> +
> +
> + > + > + > + > + > + > + > + > + > > > --047d7b3a8192683abf050c0d9a41--