hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <cutt...@apache.org>
Subject Re: reading zip files
Date Thu, 11 May 2006 18:50:54 GMT
Vijay Murthi wrote:
> I have just started looking at Hadoop source code. How can I use each
> file a distinct split? Already my data is evenly distributed across
> these compressed files.  

Implement your own InputFormat.

http://lucene.apache.org/hadoop/docs/api/org/apache/hadoop/mapred/InputFormat.html

In particular, your getSplits implementation should return a single 
split per input file, ignoring the numSplits parameter.

You can probably subclass InputFormatBase, and have your getSplits 
method simply call listPaths() and then construct and return a single 
split per path returned.

Your RecordReader implementation might then look something like:

   public RecordReader getRecordReader(FileSystem fs, FileSplit split,
                                       JobConf job, Reporter reporter)
     throws IOException {

     final BufferedReader in =
       new BufferedReader(new InputStreamReader
         (new GZIPInputStream(fs.open(split.getPath()))));

     return new RecordReader() {
         long position;

         public synchronized boolean next(Writable key, Writable value)
           throws IOException {
           String line = in.readLine();
           if (line != null) {
             position += line.length();
             ((UTF8)value).set(line);
             return true;
           }
           return false;
         }

         public synchronized long getPos() throws IOException {
           return position;
         }

         public synchronized void close() throws IOException {
           in.close();
         }

       };
   }

Then include your InputFormat's class file in your job's jar file.

Doug

Mime
View raw message