hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Parker, Matthew - IS" <Matthew.Par...@exelisinc.com>
Subject Running Terasort on Hadoop and Local File System
Date Wed, 06 Mar 2013 14:33:50 GMT

I'm trying to simulate running Hadoop on Lustre by configuring it to use the local file system
using a single cloudera VM (cdh3u4).

I can generate the data just fine, but when running the sorting portion of the program, I
get an error about not being able to find the _partition.lst file. It exists in the generated
data directory.

Perusing the Terasort code, I see in the main method that has a Path reference to partition.lst
is created with the parent directory.

  public int run(String[] args) throws Exception {
       LOG.info("starting");
      JobConf job = (JobConf) getConf();
>>  Path inputDir = new Path(args[0]);
>>  inputDir = inputDir.makeQualified(inputDir.getFileSystem(job));
>>  Path partitionFile = new Path(inputDir, TeraInputFormat.PARTITION_FILENAME);
      URI partitionUri = new URI(partitionFile.toString() +
                               "#" + TeraInputFormat.PARTITION_FILENAME);
      TeraInputFormat.setInputPaths(job, new Path(args[0]));
      FileOutputFormat.setOutputPath(job, new Path(args[1]));
      job.setJobName("TeraSort");
      job.setJarByClass(TeraSort.class);
      job.setOutputKeyClass(Text.class);
      job.setOutputValueClass(Text.class);
      job.setInputFormat(TeraInputFormat.class);
      job.setOutputFormat(TeraOutputFormat.class);
      job.setPartitionerClass(TotalOrderPartitioner.class);
      TeraInputFormat.writePartitionFile(job, partitionFile);
      DistributedCache.addCacheFile(partitionUri, job);
      DistributedCache.createSymlink(job);
      job.setInt("dfs.replication", 1);
      TeraOutputFormat.setFinalSync(job, true);
      JobClient.runJob(job);
      LOG.info("done");
      return 0;
  }

But in the configure method, the Path isn't created with the parent directory reference.

    public void configure(JobConf job) {

      try {
        FileSystem fs = FileSystem.getLocal(job);
>>    Path partFile = new Path(TeraInputFormat.PARTITION_FILENAME);
        splitPoints = readPartitions(fs, partFile, job);
        trie = buildTrie(splitPoints, 0, splitPoints.length, new Text(), 2);
      } catch (IOException ie) {
        throw new IllegalArgumentException("can't read paritions file", ie);
      }

    }

I modified the code as follows, and now sorting portion of the Terasort test works using the
general file system. I think the above code is a bug.

    public void configure(JobConf job) {

      try {
        FileSystem fs = FileSystem.getLocal(job);

  >>  Path[] inputPaths = TeraInputFormat.getInputPaths(job);
  >>  Path partFile = new Path(inputPaths[0], TeraInputFormat.PARTITION_FILENAME);

        splitPoints = readPartitions(fs, partFile, job);
        trie = buildTrie(splitPoints, 0, splitPoints.length, new Text(), 2);
      } catch (IOException ie) {
        throw new IllegalArgumentException("can't read paritions file", ie);
      }

    }


________________________________

This e-mail and any files transmitted with it may be proprietary and are intended solely for
the use of the individual or entity to whom they are addressed. If you have received this
e-mail in error please notify the sender. Please note that any views or opinions presented
in this e-mail are solely those of the author and do not necessarily represent those of Exelis
Inc. The recipient should check this e-mail and any attachments for the presence of viruses.
Exelis Inc. accepts no liability for any damage caused by any virus transmitted by this e-mail.

Mime
View raw message