hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wettin <karl.wet...@gmail.com>
Subject Reusing jobs
Date Fri, 18 Apr 2008 01:00:08 GMT
Is it possible to execute a job more than once?

I use map reduce when adding a new instance to a hierarchial cluster 
tree. It finds the least distant node and inserts the new instance as a 
sibling to that node.

As far as I know it is in very the nature of this algorithm that one 
inserts one instance at a time, that this is how the second dimension is 
created that makes it better than a vector cluster. It would be possible 
to map all permutations of instances and skip the reduction, but that 
would result in many more calulations than iteratively training the tree 
as the latter only require one to test against the instances already 
inserted to the tree.

Iteratively training this tree using Hadoop means executing one job per 
instance that measure distance to all instances in a file that I also 
append the new instance to once inserted in the tree.

All of above is very inefficient, especially with a young tree that 
could be trained in nanoseconds locally. So I do that until it takes 20 
seconds to insert an instance.

But really, this is all Hadoop framework overhead. I'm not quite sure of 
all it does when I execute a job, but it seems like quite a lot. And all 
I'm doing is executing a couple of identical jobs over and over again 
using new data.

It would be very nice if I it just took a few milliseconds to do that.


       karl

Mime
View raw message