flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-2072) Add a quickstart guide for FlinkML
Date Thu, 11 Jun 2015 07:55:00 GMT

    [ https://issues.apache.org/jira/browse/FLINK-2072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14581615#comment-14581615

ASF GitHub Bot commented on FLINK-2072:

Github user tillrohrmann commented on a diff in the pull request:

    --- Diff: docs/libs/ml/quickstart.md ---
    @@ -24,4 +25,214 @@ under the License.
     * This will be replaced by the TOC
    -Coming soon.
    +## Introduction
    +FlinkML is designed to make learning from your data a straight-forward process, abstracting
    +the complexities that usually come with having to deal with big data learning tasks.
In this
    +quick-start guide we will show just how easy it is to solve a simple supervised learning
    +using FlinkML. But first some basics, feel free to skip the next few lines if you're
    +familiar with Machine Learning (ML).
    +As defined by Murphy [1] ML deals with detecting patterns in data, and using those
    +learned patterns to make predictions about the future. We can categorize most ML algorithms
    +two major categories: Supervised and Unsupervised Learning.
    +* **Supervised Learning** deals with learning a function (mapping) from a set of inputs
    +(features) to a set of outputs. The learning is done using a *training set* of (input,
    +output) pairs that we use to approximate the mapping function. Supervised learning problems
    +further divided into classification and regression problems. In classification problems
we try to
    +predict the *class* that an example belongs to, for example whether a user is going to
click on
    +an ad or not. Regression problems one the other hand, are about predicting (real) numerical
    +values, often called the dependent variable, for example what the temperature will be
    +* **Unsupervised Learning** deals with discovering patterns and regularities in the data.
An example
    +of this would be *clustering*, where we try to discover groupings of the data from the
    +descriptive features. Unsupervised learning can also be used for feature selection, for
    +through [principal components analysis](https://en.wikipedia.org/wiki/Principal_component_analysis).
    +## Linking with FlinkML
    +In order to use FlinkML in you project, first you have to
    +[set up a Flink program](http://ci.apache.org/projects/flink/flink-docs-master/apis/programming_guide.html#linking-with-flink).
    +Next, you have to add the FlinkML dependency to the `pom.xml` of your project:
    +{% highlight xml %}
    +  <groupId>org.apache.flink</groupId>
    +  <artifactId>flink-ml</artifactId>
    +  <version>{{site.version }}</version>
    +{% endhighlight %}
    +## Loading data
    +To load data to be used with FlinkML we can use the ETL capabilities of Flink, or specialized
    +functions for formatted data, such as the LibSVM format. For supervised learning problems
it is
    +common to use the `LabeledVector` class to represent the `(features, label)` examples.
A `LabeledVector`
    +object will have a FlinkML `Vector` member representing the features of the example and
a `Double`
    +member which represents the label, which could be the class in a classification problem,
or the dependent
    +variable for a regression problem.
    +As an example, we can use Haberman's Survival Data Set , which you can
    +[download from the UCI ML repository](http://archive.ics.uci.edu/ml/machine-learning-databases/haberman/haberman.data.
    --- End diff --
    Missing closing parenthesis of the link.

> Add a quickstart guide for FlinkML
> ----------------------------------
>                 Key: FLINK-2072
>                 URL: https://issues.apache.org/jira/browse/FLINK-2072
>             Project: Flink
>          Issue Type: New Feature
>          Components: Documentation, Machine Learning Library
>            Reporter: Theodore Vasiloudis
>            Assignee: Theodore Vasiloudis
>             Fix For: 0.9
> We need a quickstart guide that introduces users to the core concepts of FlinkML to get
them up and running quickly.

This message was sent by Atlassian JIRA

View raw message