spark-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From mengxr <>
Subject [GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...
Date Fri, 07 Nov 2014 19:37:28 GMT
Github user mengxr commented on the pull request:
    > At @shivaram's suggestion, I started porting over a simple text classifier pipeline
that was already using an Estimator/Transformer abstraction of RDD[U] to RDD[V] transforms
to this interface. The almost-complete port (the imports got messed up when moving files around)
can be found at shivaram/spark-ml@522aec7.
    Thanks for testing on the newsgroup20 pipeline!
    > The trickiest bit by far was all of the implicit conversions. I ended up needing
to use several types of implicit conversion imports (case class -> schema RDD, spark sql
dsl, parameter map, etc.) They also got mysteriously deleted by the IDE as I moved files between
projects. I ended up having to copy and paste these whenever appropriate because I couldn't
keep track of them.
    I removed implicitMapping because I found that `map(inputCols)` is actually shorter than
`(inputCols: String)`. For a Scala IDE, I don't think we can trust it translating code.
    > Like Shivaram, I'm also not familiar with the Spark SQL dsl, so here I also had to
copy and paste code. It's unclear what syntax is valid and what isn't. For example, is saying
"as outputCol" enough, or is "as Symbol(outputCol)" required?
    `as(String)` is recently added to Spark master. If you are not on the latest version,
 you need `as Symbol(outputCol)`.
    > There is a lot of boilerplate code. It was easier to write the Transformers in the
form RDD[U] to RDD[V] instead of SchemaRDD to SchemaRDD, so I fully agree with Shivaram on
that front. Potentially, certain interfaces along those lines (iterator to iterator transformers
that can be applied to RDDs using mappartitions) could make it easier to have transformers
not depend on local Spark Contexts to execute.
    We will keep that option. Note that the strong-type approach will face troubles when we
need to deal with extra columns, e.g, event ids, or support more features later, e.g., support
weighted instances.
    > I found the parameter mapping in estimators fairly verbose, I like Shivaram's idea
of having the estimators pass everything to the transformers no matter what.
    Yes, I like that idea too.
    > Estimators requiring the transformers they output to extend Model didn't make sense
to me. Certain estimators, such as to choose only the most frequent tokens in a collection
to keep for each document, don't seem like they should output models. On that front, should
it be required for estimators to specify the type of transformer they output? It can be convenient
sometimes to just inline an anonymous Transformer to output without making it a top-level
    The set of most frequent tokens could be viewed as a descriptive model in your case. I
did try estimators without generic model types, but I don't remember what went wrong and made
me switch to generic.
    > There are a lot of parameter traits: HasRegParam, HasMaxIter, HasScoreCol, HasFeatureCol....
Does it make sense to have this many specific parameter traits if we still have to maintain
boilerplate setters code for Java anyway?
    I have no preference on this one. I put those before I realize the problem with Java.
I marked those as `private[ml]` in the current version and I'm okay to remove them completely.

If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at or file a JIRA ticket
with INFRA.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message