hama-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Samuel Guo" <guosi...@gmail.com>
Subject Re: blocking_mapred() speed
Date Thu, 11 Dec 2008 08:30:39 GMT
On Thu, Dec 11, 2008 at 2:36 PM, Edward J. Yoon <edwardyoon@apache.org>wrote:

> If we remove 'reduce phase', I guess we can reduce the disk I/O operations.


> In the map, read { Constants.BLOCK_STARTROW, Constants.BLOCK_ENDROW,
> Constants.BLOCK_STARTCOLUMN, Constants.BLOCK_ENDCOLUMN } instead of {
> Constants.COLUMN }, and write directly blocks.

Two methods to be considered:
1) We need a InputFormat that partitions the matrix table according to the
row boundaries of the blocks.
    This should be carefully to make sure a single block will not divied
into two or more mappers.

2) Like what RandomMatrixMap does, we just tell the mappers the row/column
boundaries of the blocks of a matrix-table.
    Scanner the portion of the table will be done in a mapper.

I think 1) may be better than 2).
An InputFormat can get the locality of a range of table to let MR know how
to move the mr computations close to it.
In 2), if we do it like RandomMatrixMap, we may lose some locality
informations of the table. so that the network transfer overhead may be

It is just my guess and thoughts.

> What do you think?
> --
> Best Regards, Edward J. Yoon @ NHN, corp.
> edwardyoon@apache.org
> http://blog.udanax.org

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message