hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dhruba Borthakur <dhr...@gmail.com>
Subject Re: Optimizing Hadoop MR with File Based File Systems
Date Thu, 07 May 2009 01:07:16 GMT
Hi Jonathan,

Exposing block locations is one part of the story. The other part is to
enable hadoop slave software to figure out which location teh tasktracker it
is runnign on. For example, you would have to set
topology.node.switch.mapping.impl to point to a script that descries the
rack/node location of a tasktracker.


On Wed, May 6, 2009 at 10:26 AM, Jonathan Seidman <
jonathan.seidman@opendatagroup.com> wrote:

> We've created an implementation of FileSystem which allows us to use Sector
> (http://sector.sourceforge.net/) as the backing store for Hadoop. This
> implementation is functionally complete, and we can now run Hadoop
> MapReduce
> jobs against data  stored in Sector. We're now looking at how to optimize
> this interface, since the performance suffers considerably compared to MR
> processing run against HDFS. Sector is written in C++, so there's some
> unavoidable overhead from JNI. One big difference between Sector and HDFS
> is
> that Sector is file-based and not block-based - files are stored intact on
> the native file system. We suspect this may have something to do with the
> poor performance, since Hadoop seems to be optimized for a block-based file
> system.
> Based on the assumption that properly supporting data locality will have a
> large impact on performance, we've implemented getFileBlockLocations().
> Since we don't have blocks our implementation basically creates a
> BlockLocation containing an array of nodes hosting the file. The following
> is what our method looks like:
> public BlockLocation[] getFileBlockLocations( final Path path )
>        throws FileNotFoundException, IOException
>    {
>        SNode stat = jniClient.sectorStat( path.toString() );
>        String[] locs = stat.getLocations();
>        if ( locs == null ) {
>            return null;
>        }
>        BlockLocation[] blocs = new BlockLocation[1];
>        blocs[0] = new BlockLocation(null, locs, 0L, stat.getSize() );
>        return blocs;
>    }
> In the code above, we are using file size, stat.getSize(), as the length
> since a block is a file.  This means that the offset is always 0L. This
> method seems to have improved performance somewhat, but we're wondering if
> there's a modification we can make that will better help Hadoop locate
> data.
> If we find a way to index our files to make them appear as blocks to
> Hadoop,
> would that provide a performance benefit? Any suggestions are appreciated.
> We're currently testing with Hadoop 0.18.3.
> Thanks.
> --
> Jonathan Seidman
> Open Data Group

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message