hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Dyer <psyb...@gmail.com>
Subject Re: Uncompressed size of Sequence files
Date Wed, 27 Nov 2013 18:34:28 GMT
I should probably mention my attempt to use the 'hadoop' command for this
task fails (this file is fairly large, about 80GB compressed):

$ HADOOP_HEAPSIZE=3000 hadoop fs -text /path/to/file | wc -c
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.lang.StringCoding$StringEncoder.encode(StringCoding.java:300)
 at java.lang.StringCoding.encode(StringCoding.java:344)
at java.lang.StringCoding.encode(StringCoding.java:387)
 at java.lang.String.getBytes(String.java:956)
at org.apache.hadoop.fs.FsShell$TextRecordInputStream.read(FsShell.java:391)
 at java.io.InputStream.read(InputStream.java:179)
at java.io.InputStream.read(InputStream.java:101)
at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:74)
 at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:47)
at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:100)
 at org.apache.hadoop.fs.FsShell.printToStdout(FsShell.java:122)
at org.apache.hadoop.fs.FsShell.access$100(FsShell.java:50)
 at org.apache.hadoop.fs.FsShell$2.process(FsShell.java:427)
 at org.apache.hadoop.fs.FsShell.text(FsShell.java:421)
at org.apache.hadoop.fs.FsShell.doall(FsShell.java:1597)
 at org.apache.hadoop.fs.FsShell.run(FsShell.java:1798)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.hadoop.fs.FsShell.main(FsShell.java:1916)

On Sat, Nov 23, 2013 at 3:14 PM, Robert Dyer <psybers@gmail.com> wrote:

> Is there an easy way to get the uncompressed size of a sequence file that
> is block compressed?  I am using the Snappy compressor.
> I realize I can obviously just decompress them to temporary files to get
> the size, but I would assume there is an easier way.  Perhaps an existing
> tool that my search did not turn up?
> If not, I will have to run a MR job load each compressed block and read
> the Snappy header to get the size.  I need to do this for a large number of
> files so I'd prefer a simple CLI tool (sort of like 'hadoop fs -du').
> - Robert

View raw message