spark-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From feynmanliang <...@git.apache.org>
Subject [GitHub] spark pull request: [SPARK-9888][MLlib]User guide for new LDA feat...
Date Tue, 25 Aug 2015 17:23:58 GMT
Github user feynmanliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/8254#discussion_r37893142
  
    --- Diff: docs/mllib-clustering.md ---
    @@ -443,23 +443,106 @@ LDA can be thought of as a clustering algorithm as follows:
     * Rather than estimating a clustering using a traditional distance, LDA uses a function
based
      on a statistical model of how text documents are generated.
     
    -LDA takes in a collection of documents as vectors of word counts.
    -It supports different inference algorithms via `setOptimizer` function. EMLDAOptimizer
learns clustering using [expectation-maximization](http://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm)
    -on the likelihood function and yields comprehensive results, while OnlineLDAOptimizer
uses iterative mini-batch sampling for [online variational inference](https://www.cs.princeton.edu/~blei/papers/HoffmanBleiBach2010b.pdf)
and is generally memory friendly. After fitting on the documents, LDA provides:
    -
    -* Topics: Inferred topics, each of which is a probability distribution over terms (words).
    -* Topic distributions for documents: For each non empty document in the training set,
LDA gives a probability distribution over topics. (EM only). Note that for empty documents,
we don't create the topic distributions. (EM only)
    +LDA supports different inference algorithms via `setOptimizer` function.
    +`EMLDAOptimizer` learns clustering using
    +[expectation-maximization](http://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm)
    +on the likelihood function and yields comprehensive results, while
    +`OnlineLDAOptimizer` uses iterative mini-batch sampling for [online
    +variational
    +inference](https://www.cs.princeton.edu/~blei/papers/HoffmanBleiBach2010b.pdf)
    +and is generally memory friendly.
     
    -LDA takes the following parameters:
    +LDA takes in a collection of documents as vectors of word counts and the
    +following parameters:
     
     * `k`: Number of topics (i.e., cluster centers)
    -* `maxIterations`: Limit on the number of iterations of EM used for learning
    -* `docConcentration`: Hyperparameter for prior over documents' distributions over topics.
Currently must be > 1, where larger values encourage smoother inferred distributions.
    -* `topicConcentration`: Hyperparameter for prior over topics' distributions over terms
(words). Currently must be > 1, where larger values encourage smoother inferred distributions.
    -* `checkpointInterval`: If using checkpointing (set in the Spark configuration), this
parameter specifies the frequency with which checkpoints will be created.  If `maxIterations`
is large, using checkpointing can help reduce shuffle file sizes on disk and help with failure
recovery.
    -
    -*Note*: LDA is a new feature with some missing functionality.  In particular, it does
not yet
    -support prediction on new documents, and it does not have a Python API.  These will be
added in the future.
    +* `LDAOptimizer`: Optimizer to use for learning the LDA model, either
    +`EMLDAOptimizer` or `OnlineLDAOptimizer`
    +* `docConcentration`: Dirichlet parameter for prior over documents'
    +distributions over topics. Larger values encourage smoother inferred
    +distributions.
    +* `topicConcentration`: Dirichlet parameter for prior over topics'
    +distributions over terms (words). Larger values encourage smoother
    +inferred distributions.
    +* `maxIterations`: Limit on the number of iterations.
    +* `checkpointInterval`: If using checkpointing (set in the Spark
    +configuration), this parameter specifies the frequency with which
    +checkpoints will be created.  If `maxIterations` is large, using
    +checkpointing can help reduce shuffle file sizes on disk and help with
    +failure recovery.
    +
    +
    +All of MLlib's LDA models support:
    +
    +* `describeTopics(n: Int)`: Prints `n` of the inferred topics, each of
    +which is a probability distribution over terms (words).
    +* `topicsMatrix`: For each non empty document in the
    +training set, LDA gives a probability distribution over topics. Note
    +that for empty documents, we don't create the topic distributions.
    +
    +*Note*: LDA is still an experimental feature under active development.
    +As a result, certain features are only available in one of the two
    +optimizers / models generated by the optimizer. The following
    +discussion will describe each optimizer/model pair separately.
    +
    +**EMLDAOptimizer and DistributedLDAModel**
    +
    +For the parameters provided to `LDA`:
    +
    +* `docConcentration`: Only symmetric priors are supported, so all values
    +in the provided `k`-dimensional vector must be identical. All values
    +must also be $> 1.0$. Providing `Vector(-1)` results in default behavior
    +(uniform `k` dimensional vector with value $(50 / k) + 1$
    +* `topicConcentration`: Only symmetric priors supported. Values must be
    +$> 1.0$. Providing `-1` results in defaulting to a value of $0.1 + 1$.
    +* `maxIterations`: Interpreted as maximum number of EM iterations.
    +
    +`EMLDAOptimizer` produces a `DistributedLDAModel`, which stores not only
    +the inferred topics but also the full training corpus and topic
    +distributions for each document in the training corpus. A
    +`DistributedLDAModel` supports:
    +
    + * `topTopicsPerDocument(k)`: The top `k` topics and their weights for
    + each document in the training corpus
    + * `topDocumentsPerTopic(k)`: The top `k` documents for each topic and
    + the corresponding weight of the topic in the documents.
    + * `logPrior`: log probability of the estimated topics and
    + document-topic distributions given the hyperparameters
    + `docConcentration` and `topicConcentration`
    + * `logLikelihood`: log likelihood of the training corpus, given the
    + inferred topics and document-topic distributions
    +
    +**OnlineLDAOptimizer and LocalLDAModel**
    +
    +For the parameters provided to `LDA`:
    +
    +* `docConcentration`: Asymmetric priors can be used by passing in a
    +vector with values equal to the Dirichlet parameter in each of the `k`
    +dimensions. Values should be $>= 0$. Providing `Vector(-1)` results in
    +default behavior (uniform `k` dimensional vector with value $(1.0 / k)$)
    +* `topicConcentration`: Only symmetric priors supported. Values must be
    +$>= 0$. Providing `-1` results in defaulting to a value of $(1.0 / k)$.
    +* `maxIterations`: Interpreted as maximum number of minibatches to
    +submit.
    +
    +In addition, `OnlineLDAOptimizer` accepts the following parameters:
    +
    +* `miniBatchFraction`: Fraction of corpus sampled and used at each
    +iteration
    +* `optimizeAlpha`: If set to true, performs maximum-likelihood
    --- End diff --
    
    SPARK-10230 will track that
    
    Should we deprecate the public APIs that reference alpha?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Mime
View raw message