I'm not an expert on statistics (and I'm lazy) so I thought I'd pose my
problem here. Consider it a holiday mind exercise while avoiding relatives.
I send customer-uploaded videos to Amazon Elastic Transcoder to generate a
video for html5 consumption. It takes a few seconds up to tens of minutes
to convert. I have no way to track progress so I estimate the time to
complete and show a fake progress indicator. I have been using the run-time
of the video and this is not working well at all. High bit-rate (big file)
videos fare the worst.
I'm guessing there are two main parameters to estimate the conversion time,
the files size and run-time. The file size is a good estimate of input
processing and run-rime is a good estimate of output processing. Amazon
has been pretty consistent in their conversion times in the short-run.
I have tons of data in my couchdb from previous conversions. I want to do
regression analysis of these past runs to calculate parameters for
estimation. I know the file-size, run-time, and conversion time for each.
I will use runLen * A + fileSize * B as the estimation formula. A and B
will be calculated by solving runLen@A + fileSize * B = convTime from the
samples. It would be nice to use a map-reduce to always have the latest
estimate of A and B, if possible.
My first thought would be to just find the average for each of the three
input vars and solve for A and B using these averages. However I'm pretty
sure this would yield the wrong result because each set of three samples
need to be used independently (not sure).
So I would like to have each map take one conversion sample and do the
regression in the reduce. Can someone give me pointers on how to do this?