Hi Ted,
> If you are running SGD on a single node, just open the HDFS files directly.
> You won't have significant benefit to locality unless the files are
> relatively small.
Good point. However, the applicability of it may depend on the network
topology of the cluster:
Reasonably fast implementations of SGD are bandwidth bound even when
reading from local disk on typical machines. Depending on the network
topology of the cluster, the rack-local bandwidth may be an order of
magnitude higher than the bandwidth you get when reading from a node
in another rack. So I believe there is value in data locality for SGD.
Your point is of course universally true for sequential algorithms
that are CPU-bound such as batch learning schemes.
Take care,
Markus
|