I'm not an expert on statistics (and I'm lazy) so I thought I'd pose my
problem here. Consider it a holiday mind exercise while avoiding relatives.
I send customeruploaded videos to Amazon Elastic Transcoder to generate a
video for html5 consumption. It takes a few seconds up to tens of minutes
to convert. I have no way to track progress so I estimate the time to
complete and show a fake progress indicator. I have been using the runtime
of the video and this is not working well at all. High bitrate (big file)
videos fare the worst.
I'm guessing there are two main parameters to estimate the conversion time,
the files size and runtime. The file size is a good estimate of input
processing and runrime is a good estimate of output processing. Amazon
has been pretty consistent in their conversion times in the shortrun.
I have tons of data in my couchdb from previous conversions. I want to do
regression analysis of these past runs to calculate parameters for
estimation. I know the filesize, runtime, and conversion time for each.
I will use runLen * A + fileSize * B as the estimation formula. A and B
will be calculated by solving runLen@A + fileSize * B = convTime from the
samples. It would be nice to use a mapreduce to always have the latest
estimate of A and B, if possible.
My first thought would be to just find the average for each of the three
input vars and solve for A and B using these averages. However I'm pretty
sure this would yield the wrong result because each set of three samples
need to be used independently (not sure).
So I would like to have each map take one conversion sample and do the
regression in the reduce. Can someone give me pointers on how to do this?
