hadoop-mapreduce-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Todd Lipcon <t...@cloudera.com>
Subject Re: [jira] Created: (MAPREDUCE-1973) Optimize input split creation
Date Fri, 30 Jul 2010 21:36:11 GMT
On Fri, Jul 30, 2010 at 2:33 PM, Burkhardt, Paul <Paul_Burkhardt@sra.com>wrote:

> I ran "ant test" on CDH3B2 (hadoop-0.20.2+320) and it fails prior to and
> after patching, so I don't think it is the patch. See the build/test
> directory and review the TEST output files. In my environment, the
> TestFileAppend4 test fails.
>

Hi Paul,

Would you mind sending me the TEST output for TFA4 off-list? I spent
significant time working on these tests (and even have a hudson target
internally that runs it 48 times a day) and am surprised it fails in your
environment.

Thanks
-Todd



>
> Paul
>
> -----Original Message-----
> From: Ted Yu [mailto:yuzhihong@gmail.com]
> Sent: Wednesday, July 28, 2010 7:06 PM
> To: mapreduce-dev@hadoop.apache.org
> Subject: Re: [jira] Created: (MAPREDUCE-1973) Optimize input split
> creation
>
> I applied the patch on cdh3b2.
> ant test gave me:
>   [junit] Running org.apache.hadoop.mrunit.types.TestPair
>    [junit] Tests run: 20, Failures: 0, Errors: 0, Time elapsed: 0.041
> sec
>
> BUILD FAILED
> /Users/tyu/hadoop-0.20.2+320/build.xml:839: Tests failed!
>
> How can I find out which tests actually failed ?
>
> Thanks
>
> On Tue, Jul 27, 2010 at 4:15 PM, Paul Burkhardt (JIRA)
> <jira@apache.org>wrote:
>
> > Optimize input split creation
> > -----------------------------
> >
> >                 Key: MAPREDUCE-1973
> >                 URL:
> https://issues.apache.org/jira/browse/MAPREDUCE-1973
> >             Project: Hadoop Map/Reduce
> >          Issue Type: Improvement
> >    Affects Versions: 0.20.2, 0.20.1
> >         Environment: Intel Nehalem cluster running Red Hat.
> >            Reporter: Paul Burkhardt
> >            Priority: Minor
> >
> >
> > The input split returns the locations that host the file blocks in the
> > split. The locations are determined by the getBlockLocations method of
> the
> > filesystem client which requires a remote connection to the filesystem
> (i.e.
> > HDFS). The remote connection is made for each file in the entire input
> > split. For jobs with many input files the network connections dominate
> the
> > cost of writing the input split file.
> >
> > A job requests a listing of the input files from the remote filesystem
> and
> > creates a FileStatus object as a handle for each file in the listing.
> The
> > FileStatus object can be imbued with the necessary host information on
> the
> > remote end and passed to the client-side in the bulk return of the
> listing
> > request. A getHosts method of the FileStatus would then return the
> locations
> > for the blocks comprising that file and eliminate the need for another
> trip
> > to the remote filesystem.
> >
> > The INodeFile maintains the blocks for a file and is an obvious choice
> to
> > be the originator for the locations of that file. It is also available
> to
> > the FSDirectory which first creates the listing of FileStatus objects.
> We
> > propose that the block locations be generated by the INodeFile to
> > instantiate the FileStatus object during the getListing request.
> >
> > Our tests demonstrated a factor of 2000 speedup for approximately
> 60,000
> > input files.
> >
> > --
> > This message is automatically generated by JIRA.
> > -
> > You can reply to this email to add a comment to the issue online.
> >
> >
>



-- 
Todd Lipcon
Software Engineer, Cloudera

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message