couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Hahn <m...@reevuit.com>
Subject Re: simple math/statistics problem using map reduce
Date Thu, 28 Nov 2013 18:56:55 GMT
Thanks.  That makes sense.

Using map/reduce was as much a curiosity as a practical requirement.
 Another way to monitor accuracy is to watch my progress indicator and see
how close it is to the real time.


On Thu, Nov 28, 2013 at 10:50 AM, Nitin Borwankar <nitin@borwankar.com>wrote:

> Hi Mark,
>
> It may not be worth it to have a real time estimate of the coefficients A,
> B if the variance is very small.
>
> In other words if your collection of past videos cover most of the
> different kinds of videos you are likely to encounter then estimates of A,
> B are likely to be pretty robust and not change much with future new
> samples. So using older A,B is not likely to throw your conversion time
> predictions off by much.
>
> So if the next sample that comes along is not likely to add much change to
> the values of A, B. and you might as well update much less frequently -
> daily, weekly whatever - via cron or batch updates.
>
> How does one determine this - here's a "back of the envelope", "seat of the
> pants" experiment.
>
> So first, after calculating A, B using all my video conversion times I
> would do a second series of calculations.
>
> Here I would start with say the first 20 videos or some number = ~ 30-50%
> of your videos and calculate A,B.  Then keep adding the next 5% , and
> repeat the calculation of A,B.
>
> Do this until you use up all the samples. But at the last step just add one
> video at a time for the last 10 videos while doing the calcs.
>
> Now look at A, B for each calculation.  Do they settle down to be close to
> a "mean" A and "mean" B? What is the variance around the mean A, B?  If
> this is small or very small, then re-computing every time is "really cool
> and all" but not worth it computationally.
>
> What is meant by "small" here?  Well, take two successive estimates of A,
> B.
> Do a prediction using A1,B1  then A2,B2  how much are you off by if you use
> the older sample?  If A, B don't vary much then your prediction won't vary
> much and you could use a stale sample without noticeable impact on your
> prediction. Noticeable = say off by more than 10% accuracy in prediction.
>
> Then just update A,B every day or week.
>
> Bottom line before you do a "real time"  update of parameters do a "back of
> the envelope" experiment to see if it's worth it for the complexity and
> point-of-failure it adds.
>
> Happy to chat offlist and/or offline if you want - am nborwankar on the
> google email system.
>
> Nitin
>
>
>
> ------------------------------------------------------------------
> Nitin Borwankar
> nborwankar@gmail.com
>
>
> On Wed, Nov 27, 2013 at 1:13 PM, Mark Hahn <mark@reevuit.com> wrote:
>
> > I'm not an expert on statistics (and I'm lazy) so I thought I'd pose my
> > problem here.  Consider it a holiday mind exercise while avoiding
> > relatives.
> >
> > I send customer-uploaded videos to Amazon Elastic Transcoder to generate
> a
> > video for html5 consumption.  It takes a few seconds up to tens of
> minutes
> > to convert.  I have no way to track progress so I estimate the time to
> > complete and show a fake progress indicator. I have been using the
> run-time
> > of the video and this is not working well at all.  High bit-rate (big
> file)
> > videos fare the worst.
> >
> > I'm guessing there are two main parameters to estimate the conversion
> time,
> > the files size and run-time.  The file size is a good estimate of input
> > processing and run-rime is a good estimate of output processing.  Amazon
> > has been pretty consistent in their conversion times in the short-run.
> >
> > I have tons of data in my couchdb from previous conversions.  I want to
> do
> > regression analysis of these past runs to calculate parameters for
> > estimation.  I know the file-size, run-time, and conversion time for
> each.
> >
> > I will use runLen * A + fileSize * B as the estimation formula.  A and B
> > will be calculated by solving runLen@A + fileSize * B = convTime from
> the
> > samples.  It would be nice to use a map-reduce to always have the latest
> > estimate of A and B, if possible.
> >
> > My first thought would be to just find the average for each of the three
> > input vars and solve for A and B using these averages.  However I'm
> pretty
> > sure this would yield the wrong result because each set of three samples
> > need to be used independently (not sure).
> >
> > So I would like to have each map take one conversion sample and do the
> > regression in the reduce.  Can someone give me pointers on how to do
> this?
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message