spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alok Singh (JIRA)" <>
Subject [jira] [Commented] (SPARK-5571) LDA should handle text as well
Date Fri, 17 Jul 2015 06:52:05 GMT


Alok Singh commented on SPARK-5571:

Hi Feynman,

Sorry for the delay and gap, here at work , we had some training and few internal updates/changes
and was not able to respond.

Here are my thoughts , please comments

I think we will need the stemmer module too. I was thinking we can just create a wrapper over
the Lucene EnglishAnalyzer Or the OpenNLP stemmer. This can be seperate transformer  jira
under the 'ml' tag
Without this component, we will have a lot of edges and nodes in the created graphx.

we can support two ways
- in one user give the list of stop words
-in another, we calculate it using the idf with tfi-idf transformer. We could create the new
transformer which under the hood calls the tfi-df transformer with the filter range. This
can also be another transformer jira under 'ml' tag.

The  LDA.runText
The core LDA.runText method can be under the mllib tag and can be easier with the assumption
the input bag of words just need to be passed to a  CountVectorizer and then to
which will be implemented as per the description.

The complete pipeline
User can create it's own pipeline using ml but I think we should create the TextLDA_Pipeline
which will combine the above steps together and put it under 'ml' tag jira

What are your thoughts [~josephkb] and [~fliang]


> LDA should handle text as well
> ------------------------------
>                 Key: SPARK-5571
>                 URL:
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>    Affects Versions: 1.3.0
>            Reporter: Joseph K. Bradley
> Latent Dirichlet Allocation (LDA) currently operates only on vectors of word counts.
 It should also supporting training and prediction using text (Strings).
> This plan is sketched in the [original LDA design doc|].
> There should be:
> * runWithText() method which takes an RDD with a collection of Strings (bags of words).
 This will also index terms and compute a dictionary.
> * dictionary parameter for when LDA is run with word count vectors
> * prediction/feedback methods returning Strings (such as describeTopicsAsStrings, which
is commented out in LDA currently)

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message