flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-2072) Add a quickstart guide for FlinkML
Date Mon, 08 Jun 2015 09:21:00 GMT

    [ https://issues.apache.org/jira/browse/FLINK-2072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14576804#comment-14576804

ASF GitHub Bot commented on FLINK-2072:

Github user thvasilo commented on a diff in the pull request:

    --- Diff: docs/libs/ml/quickstart.md ---
    @@ -24,4 +24,198 @@ under the License.
     * This will be replaced by the TOC
    -Coming soon.
    +## Introduction
    +FlinkML is designed to make learning from your data a straight-forward process, abstracting
    +the complexities that usually come with having to deal with big data learning tasks.
In this
    +quick-start guide we will show just how easy it is to solve a simple supervised learning
    +using FlinkML. But first some basics, feel free to skip the next few lines if you're
    +familiar with Machine Learning (ML)
    +As defined by Murphy [cite ML-APP] ML deals with detecting patterns in data, and using
    +learned patterns to make predictions about the future. We can categorize most ML algorithms
    +two major categories: Supervised and Unsupervised Learning.
    +* Supervised Learning deals with learning a function (mapping) from a set of inputs
    +(predictors) to a set of outputs. The learning is done using a __training set__ of (input,
    +output) pairs that we use to approximate the mapping function. Supervised learning problems
    +further divided into classification and regression problems. In classification problems
we try to
    +predict the __class__ that an example belongs to, for example whether a user is going
to click on
    +an ad or not. Regression problems are about predicting (real) numerical values,  often
called the dependent
    +variable, for example what the temperature will be tomorrow.
    +* Unsupervised learning deals with discovering patterns and regularities in the data.
An example
    +of this would be __clustering__, where we try to discover groupings of the data from
    +descriptive features. Unsupervised learning can also be used for feature selection, for
    +through [principal components analysis](https://en.wikipedia.org/wiki/Principal_component_analysis).
    +## Loading data
    +For loading data to be used with FlinkML we can use the ETL capabilities of Flink, or
    +functions for formatted data, such as the LibSVM format. For supervised learning problems
it is
    +common to use the `LabeledVector` class to represent the `(features, label)` examples.
A `LabeledVector`
    +object will have a FlinkML `Vector` member representing the features of the example and
a `Double`
    +member which represents the label, which could be the class in a classification problem,
or the dependent
    +variable for a regression problem.
    +# TODO: Get dataset that has separate train and test sets
    +As an example, we can use the Breast Cancer Wisconsin (Diagnostic) Data Set, which you
    +[download from the UCI ML repository](http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data).
    +We can load the data as a `DataSet[String]` first:
    +{% highlight scala %}
    +val cancer = env.readCsvFile[(String, String, String, String, String, String, String,
String, String, String, String)]("/path/to/breast-cancer-wisconsin.data")
    +{% endhighlight %}
    +The dataset has some missing values indicated by `?`. We can filter those rows out and
    +then transform the data into a `DataSet[LabeledVector]`. This will allow us to use the
    +dataset with the FlinkML classification algorithms.
    +{% highlight scala %}
    +val cancerLV = cancer
    +  .map(_.productIterator.toList)
    +  .filter(!_.contains("?"))
    +  .map{list =>
    +    val numList = list.map(_.asInstanceOf[String].toDouble)
    +    LabeledVector(numList(11), DenseVector(numList.take(10).toArray))
    +    }
    +{% endhighlight %}
    +We can then use this data to train a learner.
    +A common format for ML datasets is the LibSVM format and a number of datasets using that
format can be
    +found [in the LibSVM datasets website](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/).
FlinkML provides utilities for loading
    +datasets using the LibSVM format through the `readLibSVM` function available through
the MLUtils object.
    +You can also save datasets in the LibSVM format using the `writeLibSVM` function.
    +Let's import the Adult (a9a) dataset. You can download the 
    +[training set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a)
    +and the [test set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a.t).
    +We can simply import the dataset then using:
    +{% highlight scala %}
    +val adultTrain = MLUtils.readLibSVM("path/to/a8a")
    +val adultTest = MLUtils.readLibSVM("path/to/a8a.t")
    +{% endhighlight %}
    +This gives us a `DataSet[LabeledVector]` that we will use in the following section to
create a classifier.
    +Due to an error in the test dataset we have to adjust the test data using the following
code, to 
    +ensure that the dimensionality of all test examples is 123, as with the training set:
    +{% highlight scala %}
    +val adjustedTest = adultTest.map{lv =>
    +      val vec = lv.vector.asBreeze
    +      val padded = vec.padTo(123, 0.0).toDenseVector
    +      val fvec = padded.fromBreeze
    +      LabeledVector(lv.label, fvec)
    +    }
    +{% endhighlight %}
    +## Classification
    +Once we have imported the dataset we can train a `Predictor` such as a linear SVM classifier.
    +{% highlight scala %}
    +val svm = SVM()
    +  .setBlocks(env.getParallelism)
    +  .setIterations(100)
    +  .setRegularization(0.001)
    +  .setStepsize(0.1)
    +  .setSeed(42)
    +{% endhighlight %}
    --- End diff --
    Will add.

> Add a quickstart guide for FlinkML
> ----------------------------------
>                 Key: FLINK-2072
>                 URL: https://issues.apache.org/jira/browse/FLINK-2072
>             Project: Flink
>          Issue Type: New Feature
>          Components: Documentation, Machine Learning Library
>            Reporter: Theodore Vasiloudis
>            Assignee: Theodore Vasiloudis
>             Fix For: 0.9
> We need a quickstart guide that introduces users to the core concepts of FlinkML to get
them up and running quickly.

This message was sent by Atlassian JIRA

View raw message