mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Saikat Kanjilal <sxk1...@hotmail.com>
Subject RE: [Discuss--A proposal for building an application in mahout to measure runtime performance of algorithms in mahout]
Date Wed, 22 Jun 2016 04:21:27 GMT
Ok, so for now I am able to get around the issues bwlow by working on code to measure performance
times  not requiring the notion of a DIstributedContext to get this up and running, I have
two methods that I am measuring performance times for,ssvd and spca.   Github repo is here:
https://github.com/skanjila/mahout/tree/mahout-1869
Please provide feedback as I will now restructure/reorganize code to add more methods and
start work on a perf harness that spits out a report in csv and then eventually tie this to
zeppelin.
I've kept JIRA up to date as well.
Thanks in advance.

> From: sxk1969@hotmail.com
> To: dev@mahout.apache.org
> Subject: RE: [Discuss--A proposal for building an application in mahout to measure runtime
performance of algorithms in mahout]
> Date: Mon, 20 Jun 2016 20:37:31 -0700
> 
> AndrewP et al,Any chance I can get some pointers on the items below, would love some
direction on this.Thanks
> 
> > From: sxk1969@hotmail.com
> > To: dev@mahout.apache.org
> > Subject: RE: [Discuss--A proposal for building an application in mahout to measure
runtime performance of algorithms in mahout]
> > Date: Sun, 12 Jun 2016 12:40:26 -0700
> > 
> > Hi Folks,I need some input/help here to get me unblocked and moving:
> > 1) I need to reuse/extend the DistributedContext inside the runtime perf measurement
module as all algorithms inside math-scala need this, I was trying to mimic some of the H2O
code and saw that they had their own engine, I am wondering what the best way is to extend
DistributedContext and get the benefit of an already existing engine without needing to tie
into h2o or flink, or is the only way to add an engine to point to one of those back ends,
ideally I want to build the runtime perf module in a backend agnostic way and currently I
dont see a way around this, thoughts?2) I also tried to reuse some of the logic inside math-scala
but in digging into this code it seems that this code is strongly tied to scala test utilities
> > 
> > Net-Net: I just need access to the DistributedContext without linking in any test
utilities or backends.
> > Would love some advice on ways to move forward to maximize reuse.Thanks in advance.
> > 
> > > From: sxk1969@hotmail.com
> > > To: dev@mahout.apache.org
> > > Subject: RE: [Discuss--A proposal for building an application in mahout to
measure runtime performance of algorithms in mahout]
> > > Date: Thu, 9 Jun 2016 21:45:13 -0700
> > > 
> > > Andrew et al,So I've finally been able to over the past few days got a self
contained module compiling that leverages the DistributedContext, for starters I copied the
NaiveBayes test code, ripped out the test infrastructure code around it and then added some
timers, next steps will be to dump to csv and eventually to zeppelin, some questions before
I get too far ahead:
> > > 1) I made the design decision to create my own trait and encapsulate the context
within that, I am wondering if I should instead leverage the context that is already defined
in math-scala ,, this however brings its own complications in that it brings in the MahoutSuite
which I'm not sure I really need, thoughts on this
> > > 2) I need some infrastructure to run the perf framework , I can use an azure
ubuntu vm for now but is there an AWS instance or some other vm I can eventually use, I would
really like to avoid using my mac laptop as a runtime perf testing environment
> > > 
> > > Thanks, I'll update JIRA as I make more headway.
> > > 
> > > > From: sxk1969@hotmail.com
> > > > To: dev@mahout.apache.org
> > > > Subject: RE: [Discuss--A proposal for building an application in mahout
to measure runtime performance of algorithms in mahout]
> > > > Date: Mon, 6 Jun 2016 08:58:49 -0700
> > > > 
> > > > Andrew,Thanks for the input, I will shift gears a bit and just get some
lightweight code going that calls into mahout algorithms and does a csv dump out.  Note that
I think akka could be a good use for this as you could make an async call and get back a notification
when the csv dump is finished.  Also I am indeed not focusing on mapreduce algorithms and
will be tackling the algorithms in the math-scala library.  What do you think of making this
a lightweight web based workbench using spray that committers can run outside of mahout through
curl or something, this was my initial vision in using spray and its good that I'm getting
early feedback.
> > > > 
> > > > On zeppelin do you think its worthwhile that I incorporate Trevor's efforts
to take that csv and turn that into one or two visualizations.  I'm trying to understand how
that effort may(or may not) intersect with what I'm trying to accomplish.
> > > > Also point taken on the small data sets.
> > > > Thanks
> > > > 
> > > > > From: ap.dev@outlook.com
> > > > > To: dev@mahout.apache.org
> > > > > Subject: Re: [Discuss--A proposal for building an application in
mahout to measure runtime performance of algorithms in mahout]
> > > > > Date: Mon, 6 Jun 2016 15:50:16 +0000
> > > > > 
> > > > > Saikat,
> > > > > 
> > > > > If you're going to pursue this there is a few things that I would
suggest.  First, keep it light weight.  We don't want to bring a a lot of extra dependencies
or data into the distribution.  I'm not sure what this means as far as spray/akka, but those
seem like overkill in my opinion. This should be able to be kept down to a simple csv dump
I think.
> > > > > 
> > > > > Second, use Data that can be either randomly generated with a seeded
RNG, or a function like Mackey-Glass or downloaded (probably best), and only use a small very
small sample in the tests- since they're pretty long currently. The main point being that
we don't want to ship any large test datasets with the distro.
> > > > > 
> > > > > Third, we're not using MapReduce anymore, so focus on algorithms
in the math-scala library (eg. dssvd, thinqr, dals, etc.) as well as Matrix algebra operations.
 That is where i see this being useful, so that we may compare changes and optimizations going
forward.
> > > > > 
> > > > > Thanks,
> > > > > 
> > > > > Andy
> > > > > 
> > > > > ________________________________________
> > > > > From: Saikat Kanjilal <sxk1969@hotmail.com>
> > > > > Sent: Friday, June 3, 2016 12:35:54 AM
> > > > > To: dev@mahout.apache.org
> > > > > Subject: RE: [Discuss--A proposal for building an application in
mahout to measure runtime performance of algorithms in mahout]
> > > > > 
> > > > > Hi All,Created a JIRA ticket and have moved the discussion for the
runtime performance framework  there:
> > > > > https://issues.apache.org/jira/browse/MAHOUT-1869
> > > > > @AndrewP & Trevor I would like to integrate zeppelin into the
runtime performance measurement framework to output some measurement related data for some
of the algorithms.
> > > > > Should I wait till the zeppelin integration is completely working
before I incorporate this piece?
> > > > > Also would really some feedback either on the JIRA ticket or in response
to this thread.Regards
> > > > > 
> > > > > > From: sxk1969@hotmail.com
> > > > > > To: dev@mahout.apache.org
> > > > > > Subject: [Discuss--A proposal for building an application in
mahout to measure runtime performance of algorithms in mahout]
> > > > > > Date: Thu, 19 May 2016 21:31:05 -0700
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > This proposal will outline a runtime performance module used
to measure the performance of various algorithms in mahout in the three major areas, clustering,
regression and classification.  The module will be a spray/scala/akka application which will
be run by any current or new algorithm in mahout and will display a csv file and a set of
zeppelin plots outlining the various criteria for performance.    The goal of releasing any
new build in mahout will be to run a set of tests for each of the algorithms to compare and
contrast some benchmarks from one release to another.
> > > > > >
> > > > > >
> > > > > > Architecture
> > > > > > The run time performance application will run on top of spray/scala
and akka and will make async api calls into the various mahout algorithms to generate a cvs
file containing data representing the run time performance measurement calculations for each
algorithm of interest as well as a set of zeppelin plots for displaying some of these results.
 The spray scala architecture will leverage the zeppelin server to create the visualizations.
 The discussion below centers around two types of algorithms to be addressed by the application.
> > > > > >
> > > > > >
> > > > > > Clustering
> > > > > > The application will consist of a set of rest APIs to do the
following:
> > > > > >
> > > > > >
> > > > > > a) A method to load and execute the run time perf module and
takes as inputs the name of the algorithm (kmeans, fuzzy kmeans) and a location of a set of
files containing various sizes of data sets
> > > > > >
> > > > > >
> > > > > > /algorithm=clustering/fileLocation=/path/to/files/of/different/datasets/clusters=12,20,30,40
and finally a set of values for the number of clusters to use for each of the different sizes
of the datasets
> > > > > >
> > > > > >
> > > > > > The above API call will return a runId which the client program
can then use to monitor the module
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > b) A method to monitor the application to ensure that its making
progress towards generating the zeppelin plots
> > > > > > /monitor/runId=456
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > The above method will execute asynchronously by calling into
the mahout kmeans (fuzzy kmeans) clustering implementations and will generate zeppelin plots
showing the normalized time on the y axis and the number of clusters in the x axis.  The spray/scala
akka framework will allow the client application to receive a callback when the run time performance
calculations are actually completed.  For now the calculations for measuring run time performance
will contain: a) the ratio of the number of points clustered correctly to the total number
of points b) the total time taken for the algorithm to run .  These items will be represented
in separate zeppelin plots.
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > Regression
> > > > > > a) The runtime performance module will run the likelihood ratio
test with a different set of features in every run .  We will introduce a rest API to run
the likelihood ratio test and return the results, this will once again be an sync call through
the spray/akka stack.
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > b) The run time performance module will contain the following
metrics for every algorithm: 1) cpu usage 2) memory usage 3) time taken for algorithm to converge
and run to completion.  These metrics will be reported on top of the zeppelin graphs for both
the regression and the different clustering algorithms mentioned above.
> > > > > >
> > > > > > How does the application get runThe run time performance measuring
application will get invoked from the command line, eventually it would be worthwhile to hook
this into some sort of integration test suite to certify the different mahout releases.
> > > > > >
> > > > > >
> > > > > > I will add more thoughts around this and create a JIRA ticket
only once there's enough consensus between the committers that this is headed in the right
direction.  I will also add some more thoughts on measuring run time performance of some of
the other algorithms after some more research.
> > > > > > Would love feedback or additional things to consider that I
might have missed.  If its more appropriate I can move the discussion to a jira ticket as
well so please let me know.Thanks in advance.
> > > >  		 	   		  
> > >  		 	   		  
> >  		 	   		  
>  		 	   		  
 		 	   		  
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message