hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vijay Murthi" <murt...@yahoo-inc.com>
Subject RE: reading zip files
Date Fri, 19 May 2006 17:13:53 GMT
Thanks Doug. That worked spectacular!!!! Also, I am completely able to
speed up little bit more by setting a big buffer size for
InputStreamReader.

-VJ

> -----Original Message-----
> From: Doug Cutting [mailto:cutting@apache.org]
> Sent: Thursday, May 11, 2006 11:51 AM
> To: hadoop-user@lucene.apache.org
> Subject: Re: reading zip files
> 
> Vijay Murthi wrote:
> > I have just started looking at Hadoop source code. How can I use
each
> > file a distinct split? Already my data is evenly distributed across
> > these compressed files.
> 
> Implement your own InputFormat.
> 
>
http://lucene.apache.org/hadoop/docs/api/org/apache/hadoop/mapred/InputF
or
> mat.html
> 
> In particular, your getSplits implementation should return a single
> split per input file, ignoring the numSplits parameter.
> 
> You can probably subclass InputFormatBase, and have your getSplits
> method simply call listPaths() and then construct and return a single
> split per path returned.
> 
> Your RecordReader implementation might then look something like:
> 
>    public RecordReader getRecordReader(FileSystem fs, FileSplit split,
>                                        JobConf job, Reporter reporter)
>      throws IOException {
> 
>      final BufferedReader in =
>        new BufferedReader(new InputStreamReader
>          (new GZIPInputStream(fs.open(split.getPath()))));
> 
>      return new RecordReader() {
>          long position;
> 
>          public synchronized boolean next(Writable key, Writable
value)
>            throws IOException {
>            String line = in.readLine();
>            if (line != null) {
>              position += line.length();
>              ((UTF8)value).set(line);
>              return true;
>            }
>            return false;
>          }
> 
>          public synchronized long getPos() throws IOException {
>            return position;
>          }
> 
>          public synchronized void close() throws IOException {
>            in.close();
>          }
> 
>        };
>    }
> 
> Then include your InputFormat's class file in your job's jar file.
> 
> Doug


Mime
View raw message