On Oct 18, 2007, at 3:30 PM, Lance Amundsen wrote:
> Thx, I'll give that a try. Seems to me a method to tell hadoop to
> split a
> file every "n" key/value pairs would be logical. Or maybe a
> createSplitBoundary when appending key/value records?
The problem is that the split generator doesn't want to read the data
files. So it picks byte ranges as a reasonable proxy. I know of some
applications that have custom input formats that use md5 ranges as
input splits and read multiple files for each split. You could
equivalently use rows, as long as you had an index.
-- Owen
|