flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From tillrohrmann <...@git.apache.org>
Subject [GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...
Date Tue, 09 Jun 2015 14:30:25 GMT
Github user tillrohrmann commented on a diff in the pull request:

    --- Diff: docs/libs/ml/quickstart.md ---
    @@ -24,4 +24,198 @@ under the License.
     * This will be replaced by the TOC
    -Coming soon.
    +## Introduction
    +FlinkML is designed to make learning from your data a straight-forward process, abstracting
    +the complexities that usually come with having to deal with big data learning tasks.
In this
    +quick-start guide we will show just how easy it is to solve a simple supervised learning
    +using FlinkML. But first some basics, feel free to skip the next few lines if you're
    +familiar with Machine Learning (ML)
    +As defined by Murphy [cite ML-APP] ML deals with detecting patterns in data, and using
    +learned patterns to make predictions about the future. We can categorize most ML algorithms
    +two major categories: Supervised and Unsupervised Learning.
    +* Supervised Learning deals with learning a function (mapping) from a set of inputs
    +(predictors) to a set of outputs. The learning is done using a __training set__ of (input,
    +output) pairs that we use to approximate the mapping function. Supervised learning problems
    +further divided into classification and regression problems. In classification problems
we try to
    +predict the __class__ that an example belongs to, for example whether a user is going
to click on
    +an ad or not. Regression problems are about predicting (real) numerical values,  often
called the dependent
    +variable, for example what the temperature will be tomorrow.
    +* Unsupervised learning deals with discovering patterns and regularities in the data.
An example
    +of this would be __clustering__, where we try to discover groupings of the data from
    +descriptive features. Unsupervised learning can also be used for feature selection, for
    +through [principal components analysis](https://en.wikipedia.org/wiki/Principal_component_analysis).
    +## Loading data
    +For loading data to be used with FlinkML we can use the ETL capabilities of Flink, or
    +functions for formatted data, such as the LibSVM format. For supervised learning problems
it is
    +common to use the `LabeledVector` class to represent the `(features, label)` examples.
A `LabeledVector`
    +object will have a FlinkML `Vector` member representing the features of the example and
a `Double`
    +member which represents the label, which could be the class in a classification problem,
or the dependent
    +variable for a regression problem.
    +# TODO: Get dataset that has separate train and test sets
    +As an example, we can use the Breast Cancer Wisconsin (Diagnostic) Data Set, which you
    +[download from the UCI ML repository](http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data).
    +We can load the data as a `DataSet[String]` first:
    +{% highlight scala %}
    +val cancer = env.readCsvFile[(String, String, String, String, String, String, String,
String, String, String, String)]("/path/to/breast-cancer-wisconsin.data")
    +{% endhighlight %}
    +The dataset has some missing values indicated by `?`. We can filter those rows out and
    +then transform the data into a `DataSet[LabeledVector]`. This will allow us to use the
    +dataset with the FlinkML classification algorithms.
    +{% highlight scala %}
    +val cancerLV = cancer
    +  .map(_.productIterator.toList)
    +  .filter(!_.contains("?"))
    +  .map{list =>
    +    val numList = list.map(_.asInstanceOf[String].toDouble)
    +    LabeledVector(numList(11), DenseVector(numList.take(10).toArray))
    +    }
    +{% endhighlight %}
    +We can then use this data to train a learner.
    +A common format for ML datasets is the LibSVM format and a number of datasets using that
format can be
    +found [in the LibSVM datasets website](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/).
FlinkML provides utilities for loading
    +datasets using the LibSVM format through the `readLibSVM` function available through
the MLUtils object.
    +You can also save datasets in the LibSVM format using the `writeLibSVM` function.
    +Let's import the Adult (a9a) dataset. You can download the 
    +[training set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a)
    +and the [test set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a.t).
    +We can simply import the dataset then using:
    +{% highlight scala %}
    +val adultTrain = MLUtils.readLibSVM("path/to/a8a")
    +val adultTest = MLUtils.readLibSVM("path/to/a8a.t")
    +{% endhighlight %}
    +This gives us a `DataSet[LabeledVector]` that we will use in the following section to
create a classifier.
    +Due to an error in the test dataset we have to adjust the test data using the following
code, to 
    +ensure that the dimensionality of all test examples is 123, as with the training set:
    +{% highlight scala %}
    +val adjustedTest = adultTest.map{lv =>
    +      val vec = lv.vector.asBreeze
    +      val padded = vec.padTo(123, 0.0).toDenseVector
    +      val fvec = padded.fromBreeze
    +      LabeledVector(lv.label, fvec)
    +    }
    +{% endhighlight %}
    +## Classification
    +Once we have imported the dataset we can train a `Predictor` such as a linear SVM classifier.
    +{% highlight scala %}
    +val svm = SVM()
    +  .setBlocks(env.getParallelism)
    +  .setIterations(100)
    +  .setRegularization(0.001)
    +  .setStepsize(0.1)
    +  .setSeed(42)
    --- End diff --
    Yes I think so. Just for the sake of completeness.

If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.

View raw message