mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Trevor Grant <trevor.d.gr...@gmail.com>
Subject Re: Traits for a mahout algorithm Library.
Date Thu, 21 Jul 2016 19:35:01 GMT
+1

The sklearn paradigm I think is awesome as an API, but I'm not looking to
make sklearn for Spark.  To Dmitriy's first point (correct me if I
extrapolating incorrectly), every underlying engine already has a SGD
Regression, K-Means, and a couple other standbys.  They take no time to
build, but why? If the user wants them, they can use them in the native
engine (or we can slap them in there just cause).

Let's (aim to) differentiate by providing useful algorithms not already
shipped standard in every other ML package on the block.

Another 'algorithm' that is used very widely in every industry I've been in
( Marketing and CPG ), that doesn't have a pleasant 'Big Data' solution is
hierarchical models (also called mix-models).  There's a bunch of other
'daily drivers' that everyone already use in R/SAS/ etc. that just don't
scale well, thus the rise of SGD, and Big Data algos.  Mahout is the ML
library for people who actually know math IMHO, in contrast to others that
are ML for computer scientists.  Let's expose some algorithms that single
node analysts know and are comfortable with.

So OLS isn't as efficient as SGD... so what.  An analyst can pick up
Mahout, and migrate their old methods into a distributed environment.
Further, they can see t-scores and f-scores and chi tests all those
statistics that everyone has come to know an love.  I think that would be a
huge win, as it erases this idea that if you're going to work in big data
you must abandon the old ways.

To Dmitriy's last point- the sklearn equivelent of that:
http://scikit-learn.org/stable/modules/grid_search.html

I agree 100%, it's something I truly miss about sklearn.  I'd support
implementing those 'everyone has one' algos from paragraph 1 if that was
the end goal.

Finally, re data-frames.  Why not leave it as vectors and matrices? That is
a more R-Like thing to do anyway.

val X: Matrix= data
val y: Vector = labels

model1.fit(X, y)

I don't mean to dominate the conversation, and I'm sorry- but I really
wanted to toss that idea re: hierarchical models out there, bc I know lots
of people who would love to have them, and it is the thing keeping them on
single core machines at the moment.

tg


Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org

*"Fortunate is he, who is able to know the causes of things."  -Virgil*


On Thu, Jul 21, 2016 at 1:43 PM, Dmitriy Lyubimov <dlieu.7@gmail.com> wrote:

> sk-learn learner, transformer and predictor features sound good to me,
> tried-and-proven
>
> most importantly imo we need strong established type system and not repeat
> what i view as a problem in some other offerings. If the type system is
> strict and limited in size, then there's much less need in data adapters,
> or none at all.
>
> so what we have :
> -- double precison tensor types (but not n-d arrays though)
> what we don't have:
> -- data frames
>
> What we may want to have
> -- formula support, especially for non-linear glm ("linear generalized
> linear", does this makes sense at all?) ok non-linear regressions
> formula normally acts on data-frame-y data, not on tensor data, albeit it
> produces tensor data. Herein lies a conundrum. I don't see mahout taking on
> data frames, this is just too big. but good formula and "factor" (in R
> sense) support is nice to have for down-to-earth problems.
>
> perhaps a tactical solution here is to integrate some foreign engine data
> frames but mahout native formula support. But i didn't give it much
> thought, because, although formulas and step-wise non-linear model searches
> are the first thing to happen to any analytics (but somehow it hasn't
> happened well enough elsewhere), i don't see how it can be made cheaply in
> engine-agnostic way. I still commonly view mahout as an under-funded
> project, so choices of new things should be smart -- small in volume, great
> in the bang. Dataframes are not small in the volume, esp. since i am
> increasingly turning away from Spark in my personal endeavors, so i won't
> support just integrating sparkql for this purpose.
>
> Big area that people actually need (IMO) and what hasn't been done well
> elsewhere (IMO) are model and model parameter searches. This "ML optimizer"
> idea that has been in AMPLab for as long as i remember them, and is still
> very popular, but I don't think there are good offers that actually solve
> this problem in OSS. One of the reasons, modern OSS is pretty slow for the
> volume required by the task. if we get some unique improvements to the
> framework, we can think of getting in this business. this shouldn't be that
> much difficult, assuming the throughput is not an issue. GPU clusters are
> increasingly common, we can hope we'll get there in the future.
>
> on algorithm side, i would love to see something with 2d inputs, cnns or
> something, for image processing.
>
>
>
>
> On Thu, Jul 21, 2016 at 8:08 AM, Trevor Grant <trevor.d.grant@gmail.com>
> wrote:
>
> > I was thinking so too.  Most ML frameworks are at least loosly based on
> the
> > Sklearn paradigm.  For those not familiar, at a very abstract level-
> >
> > model1 = new Algo // e.g. K-Means, Random Forest, Neural Net
> >
> > model1.fit(trainingData)
> >
> > // then depending on the goal of the algorithm you have either (or both)
> > preds = model1.predict( testData)  // which returns a vector of
> predictions
> > for each obs point in testing data
> >
> > // or sometimes
> > newVals = model1.transform( testData) // which returns a new dataset like
> > object, as this makes more sense for things like neural nets, or when
> > you're not just predicting a single value per observation
> >
> >
> > In addition to the above, pre-processing operations then also have a
> > transform method such as
> >
> > preprocess1 = new Normalizer
> >
> > preprocess1.fit( trainingData )  // in this phase calculates the mean and
> > variance of the training data set
> >
> > preprocessedTrainingData = preprocess1.transform( trainingData)
> > preprocessTestingData = preprocess1.transform( testingData)
> >
> > I think this is a reasonalbe approach bc A) it makes sense and B) is a
> > standard of sorts across ML libraries (bc of A)
> >
> > We have two high level bucket types, based on what the output is:
> >
> > Predictors and Transformers
> >
> > Predictors: anything that return a single value per observation, this is
> > classifiers and regressors
> >
> > Transformers: anything that returns a vector per observation
> > - Pre-processing operations
> > - Classifiers, in that usually there is a probability vector for each
> > observation as to which class it belongs too, the 'predict' method then
> > just picks the most likely class
> > - Neural nets ( though with one small tweak can be extended to regression
> > or classification )
> > - Any unsupervised learning application (e.g. clustering)
> > - etc.
> >
> > And so really we have something like:
> >
> > class LearningFunction
> >   def fit()
> >
> > class Transformer extends LearningFunction:
> >   def transform
> >
> > class Predictor extends Transformer:
> >   def predict
> >
> >
> > This paradigm also lends its self nicely to pipelines...
> >
> > pipeline1 = new Pipeline
> >                    .add( transformer1 )
> >                    .add(  transformer2 )
> >                    .add( model1 )
> >
> > pipeline1.fit( trainingData )
> > pipelin1.predict( testingData )
> >
> > I have to read up on reccomenders a bit more to figure how those play in,
> > or if we need another class.
> >
> > In addition to that I think we would have an optimizers section that
> allows
> > for the various flavors of SGD, but also allows other types of optimizers
> > all together.
> >
> > Again, just moving the conversation forward a bit here.
> >
> > Excited to get to work on this
> >
> > Best,
> >
> > tg
> >
> >
> >
> >
> >
> >
> > Trevor Grant
> > Data Scientist
> > https://github.com/rawkintrevo
> > http://stackexchange.com/users/3002022/rawkintrevo
> > http://trevorgrant.org
> >
> > *"Fortunate is he, who is able to know the causes of things."  -Virgil*
> >
> >
> > On Thu, Jul 21, 2016 at 7:13 AM, Sebastian <ssc@apache.org> wrote:
> >
> > > Hi Andrew,
> > >
> > > I think this topic is broader than just defining a few traits. A
> popular
> > > way of integrating ML algorithms is via the combination of dataframes
> and
> > > pipelines, similar to what scipy and SparkML are offering at the
> moment.
> > > Maybe it could make sense to integrate with what they have instead of
> > > starting our own efforts?
> > >
> > > Best,
> > > Sebastian
> > >
> > >
> > >
> > > On 21.07.2016 04:35, Andrew Palumbo wrote:
> > >
> > >> Hi All,
> > >>
> > >>
> > >> I'd like to draw your attention to MAHOUT-1856:
> > >> https://issues.apache.org/jira/browse/MAHOUT-1856
> > >>
> > >>
> > >> This is a discussion that has popped up several times over the last
> > >> couple of years. as we move towards building out our algorithm
> library,
> > It
> > >> would be great  to nail this down now.
> > >>
> > >>
> > >> Most Importantly to not be able to be criticized as "a loose bag of
> > >> algorithms" as we've sometimes been in the past.
> > >>
> > >>
> > >> The main point being It would be good to lay out  common traits for
> > >> Classification, Clustering, and Optimization algorithms.
> > >>
> > >>
> > >> This is just a start. I created this issue a few months back, and
> > >> intentionally left off Recommender, because I was unsure if there were
> > >> common traits across them.  By traits, I am referring to both both the
> > >> literal meaning and more specifically, actual Scala traits.
> > >>
> > >>
> > >> @pat, @tdunning, @ssc, could you give your thoughts on this?
> > >>
> > >>
> > >> As well, it would be good to add online flavors of different algorithm
> > >> classes into the mix.
> > >>
> > >>
> > >> @tdunning could you share some thoughts here?
> > >>
> > >>
> > >> Trevor Grant will be heading up this effort, and It would be great if
> we
> > >> all as a team could come up with abstract design plans for each class
> of
> > >> algorithm (as well as to determine the current "classes of
> algorithms",
> > as
> > >> each of us has our own unique blend of specializations.  And could
> give
> > our
> > >> thoughts on this.
> > >>
> > >>
> > >> Currently this is really the opening of the conversation.
> > >>
> > >>
> > >> It would be best to post thoughts on:
> > >> https://issues.apache.org/jira/browse/MAHOUT-1856
> > >>
> > >>
> > >> Any feedback is welcomed.
> > >>
> > >>
> > >> Thanks,
> > >>
> > >>
> > >> Andy
> > >>
> > >>
> > >>
> > >>
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message