flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From tillrohrmann <...@git.apache.org>
Subject [GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...
Date Thu, 11 Jun 2015 08:15:39 GMT
Github user tillrohrmann commented on a diff in the pull request:

    https://github.com/apache/flink/pull/792#discussion_r32198348
  
    --- Diff: docs/libs/ml/quickstart.md ---
    @@ -24,4 +25,214 @@ under the License.
     * This will be replaced by the TOC
     {:toc}
     
    -Coming soon.
    +## Introduction
    +
    +FlinkML is designed to make learning from your data a straight-forward process, abstracting
away
    +the complexities that usually come with having to deal with big data learning tasks.
In this
    +quick-start guide we will show just how easy it is to solve a simple supervised learning
problem
    +using FlinkML. But first some basics, feel free to skip the next few lines if you're
already
    +familiar with Machine Learning (ML).
    +
    +As defined by Murphy [1] ML deals with detecting patterns in data, and using those
    +learned patterns to make predictions about the future. We can categorize most ML algorithms
into
    +two major categories: Supervised and Unsupervised Learning.
    +
    +* **Supervised Learning** deals with learning a function (mapping) from a set of inputs
    +(features) to a set of outputs. The learning is done using a *training set* of (input,
    +output) pairs that we use to approximate the mapping function. Supervised learning problems
are
    +further divided into classification and regression problems. In classification problems
we try to
    +predict the *class* that an example belongs to, for example whether a user is going to
click on
    +an ad or not. Regression problems one the other hand, are about predicting (real) numerical
    +values, often called the dependent variable, for example what the temperature will be
tomorrow.
    +
    +* **Unsupervised Learning** deals with discovering patterns and regularities in the data.
An example
    +of this would be *clustering*, where we try to discover groupings of the data from the
    +descriptive features. Unsupervised learning can also be used for feature selection, for
example
    +through [principal components analysis](https://en.wikipedia.org/wiki/Principal_component_analysis).
    +
    +## Linking with FlinkML
    +
    +In order to use FlinkML in you project, first you have to
    +[set up a Flink program](http://ci.apache.org/projects/flink/flink-docs-master/apis/programming_guide.html#linking-with-flink).
    +Next, you have to add the FlinkML dependency to the `pom.xml` of your project:
    +
    +{% highlight xml %}
    +<dependency>
    +  <groupId>org.apache.flink</groupId>
    +  <artifactId>flink-ml</artifactId>
    +  <version>{{site.version }}</version>
    +</dependency>
    +{% endhighlight %}
    +
    +## Loading data
    +
    +To load data to be used with FlinkML we can use the ETL capabilities of Flink, or specialized
    +functions for formatted data, such as the LibSVM format. For supervised learning problems
it is
    +common to use the `LabeledVector` class to represent the `(features, label)` examples.
A `LabeledVector`
    +object will have a FlinkML `Vector` member representing the features of the example and
a `Double`
    +member which represents the label, which could be the class in a classification problem,
or the dependent
    +variable for a regression problem.
    +
    +As an example, we can use Haberman's Survival Data Set , which you can
    +[download from the UCI ML repository](http://archive.ics.uci.edu/ml/machine-learning-databases/haberman/haberman.data.
    +This dataset *"contains cases from study conducted on the survival of patients who had
undergone
    +surgery for breast cancer"*. The data comes in a comma-separated file, where the first
3 columns
    +are the features and last column is the class, and the 4th column indicates whether the
patient
    +survived 5 years or longer (label 1), or died within 5 years (label 2). You can check
the [UCI
    +page](https://archive.ics.uci.edu/ml/datasets/Haberman%27s+Survival) for more information
on the data.
    +
    +We can load the data as a `DataSet[String]` first:
    +
    +{% highlight scala %}
    +
    +import org.apache.flink.api.scala.ExecutionEnvironment
    +
    +val env = ExecutionEnvironment.createLocalEnvironment(2)
    +
    +val survival = env.readCsvFile[(String, String, String, String)]("/path/to/haberman.data")
    +
    +{% endhighlight %}
    +
    +We can now transform the data into a `DataSet[LabeledVector]`. This will allow us to
use the
    +dataset with the FlinkML classification algorithms. We know that the 4th element of the
dataset
    +is the class label, and the rest are features, so we can build `LabeledVector` elements
like this:
    +
    +{% highlight scala %}
    +
    +import org.apache.flink.ml.common.LabeledVector
    +import org.apache.flink.ml.math.DenseVector
    +
    +val survivalLV = survival
    +  .map{tuple =>
    +    val list = tuple.productIterator.toList
    +    val numList = list.map(_.asInstanceOf[String].toDouble)
    +    LabeledVector(numList(3), DenseVector(numList.take(3).toArray))
    +  }
    +
    +{% endhighlight %}
    +
    +We can then use this data to train a learner. We will however use another dataset to
exemplify
    +building a learner; that will allow us to show how we can import other dataset formats.
    +
    +**LibSVM files**
    +
    +A common format for ML datasets is the LibSVM format and a number of datasets using that
format can be
    +found [in the LibSVM datasets website](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/).
FlinkML provides utilities for loading
    +datasets using the LibSVM format through the `readLibSVM` function available through
the MLUtils object.
    +You can also save datasets in the LibSVM format using the `writeLibSVM` function.
    +Let's import the svmguide1 dataset. You can download the
    +[training set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/svmguide1)
    +and the [test set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/svmguide1.t).
    +This is an astroparticle binary classification dataset, used by Hsu et al. [3] in their
practical
    --- End diff --
    
    If you do something like this it should work: `[[1]](#[1])` to mark the anchor link and
`<a name="[1]"></a>` for the anchor.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

Mime
View raw message