hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vijay Murthi" <murt...@yahoo-inc.com>
Subject RE: reading zip files
Date Thu, 11 May 2006 18:32:55 GMT
Thanks Doug. I have around 500 directories. Each directory has around
500 files each 25 MB gzip (uncompressed around 140 MB each). A file
uncompressed has around 170,000 lines. Each line on average is about .85
kb. 

I have just started looking at Hadoop source code. How can I use each
file a distinct split? Already my data is evenly distributed across
these compressed files.  

I see Hadoop using abstracted JAVA files for File IO. What files are
appropriate to make changes so that on MapClass, inside Map function if
I call value.toString() returns a record?


Hope this helps,
VJ



> -----Original Message-----
> From: Doug Cutting [mailto:cutting@apache.org]
> Sent: Thursday, May 11, 2006 11:09 AM
> To: hadoop-user@lucene.apache.org
> Subject: Re: reading zip files
> 
> Vijay Murthi wrote:
> > I am trying to process several gigs of zipped text files from a
> directory. If I unzip them the size increase atleast 4 times and
> potentially I can run out of disk space.
> >
> > Has anyone tried to read zipped text files directly from the input
> directory?
> >
> > or anyone tried implementing a zip version of
> SequenceFileRecordReader.java and Filesplit?
> 
> SequenceFile currently supports per-record compression.  This is
> effective when your input records are fairly large (> a few kB).
> 
> What format are your zipped input files in?  Are there multiple
records
> per file?  If so, how big are the records?  A future goal for
> SequenceFile is to support compression across multiple records, to
make
> compression effective with small records.  Until then, compression of
> small records is difficult.  The best approach currently is to use an
> InputFormat that does not split files, but makes each file into a
> distinct split.  Then try to divide your data into approximately equal
> sized files that are each compressed.
> 
> Doug



Mime
View raw message