Thanks. That makes sense.
Using map/reduce was as much a curiosity as a practical requirement.
Another way to monitor accuracy is to watch my progress indicator and see
how close it is to the real time.
On Thu, Nov 28, 2013 at 10:50 AM, Nitin Borwankar wrote:
> Hi Mark,
>
> It may not be worth it to have a real time estimate of the coefficients A,
> B if the variance is very small.
>
> In other words if your collection of past videos cover most of the
> different kinds of videos you are likely to encounter then estimates of A,
> B are likely to be pretty robust and not change much with future new
> samples. So using older A,B is not likely to throw your conversion time
> predictions off by much.
>
> So if the next sample that comes along is not likely to add much change to
> the values of A, B. and you might as well update much less frequently -
> daily, weekly whatever - via cron or batch updates.
>
> How does one determine this - here's a "back of the envelope", "seat of the
> pants" experiment.
>
> So first, after calculating A, B using all my video conversion times I
> would do a second series of calculations.
>
> Here I would start with say the first 20 videos or some number = ~ 30-50%
> of your videos and calculate A,B. Then keep adding the next 5% , and
> repeat the calculation of A,B.
>
> Do this until you use up all the samples. But at the last step just add one
> video at a time for the last 10 videos while doing the calcs.
>
> Now look at A, B for each calculation. Do they settle down to be close to
> a "mean" A and "mean" B? What is the variance around the mean A, B? If
> this is small or very small, then re-computing every time is "really cool
> and all" but not worth it computationally.
>
> What is meant by "small" here? Well, take two successive estimates of A,
> B.
> Do a prediction using A1,B1 then A2,B2 how much are you off by if you use
> the older sample? If A, B don't vary much then your prediction won't vary
> much and you could use a stale sample without noticeable impact on your
> prediction. Noticeable = say off by more than 10% accuracy in prediction.
>
> Then just update A,B every day or week.
>
> Bottom line before you do a "real time" update of parameters do a "back of
> the envelope" experiment to see if it's worth it for the complexity and
> point-of-failure it adds.
>
> Happy to chat offlist and/or offline if you want - am nborwankar on the
> google email system.
>
> Nitin
>
>
>
> ------------------------------------------------------------------
> Nitin Borwankar
> nborwankar@gmail.com
>
>
> On Wed, Nov 27, 2013 at 1:13 PM, Mark Hahn wrote:
>
> > I'm not an expert on statistics (and I'm lazy) so I thought I'd pose my
> > problem here. Consider it a holiday mind exercise while avoiding
> > relatives.
> >
> > I send customer-uploaded videos to Amazon Elastic Transcoder to generate
> a
> > video for html5 consumption. It takes a few seconds up to tens of
> minutes
> > to convert. I have no way to track progress so I estimate the time to
> > complete and show a fake progress indicator. I have been using the
> run-time
> > of the video and this is not working well at all. High bit-rate (big
> file)
> > videos fare the worst.
> >
> > I'm guessing there are two main parameters to estimate the conversion
> time,
> > the files size and run-time. The file size is a good estimate of input
> > processing and run-rime is a good estimate of output processing. Amazon
> > has been pretty consistent in their conversion times in the short-run.
> >
> > I have tons of data in my couchdb from previous conversions. I want to
> do
> > regression analysis of these past runs to calculate parameters for
> > estimation. I know the file-size, run-time, and conversion time for
> each.
> >
> > I will use runLen * A + fileSize * B as the estimation formula. A and B
> > will be calculated by solving runLen@A + fileSize * B = convTime from
> the
> > samples. It would be nice to use a map-reduce to always have the latest
> > estimate of A and B, if possible.
> >
> > My first thought would be to just find the average for each of the three
> > input vars and solve for A and B using these averages. However I'm
> pretty
> > sure this would yield the wrong result because each set of three samples
> > need to be used independently (not sure).
> >
> > So I would like to have each map take one conversion sample and do the
> > regression in the reduce. Can someone give me pointers on how to do
> this?
> >
>