hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tom White" <...@cloudera.com>
Subject Re: Concatenating PDF files
Date Mon, 05 Jan 2009 11:47:14 GMT
Hi Richard,

Are you running out of memory after many PDFs have been processed by
one mapper, or during the first? The former would suggest that memory
isn't being released; the latter that the task VM doesn't have enough
memory to start with.

Are you setting the memory available to map tasks by setting
mapred.child.java.opts? You can try to see how much memory the
processes are using by logging into a machine when the job is running
and running 'top' or 'ps'.

It won't help the memory problems, but it sounds like you could run
with zero reducers for this job (conf.setNumReduceTasks(0)). Also, EC2
XL instances can run more than two tasks per node (they have 4 virtual
cores, see http://aws.amazon.com/ec2/instance-types/). And you should
configure them to take advantage of multiple disks -


On Fri, Jan 2, 2009 at 8:50 PM, Zak, Richard [USA] <zak_richard@bah.com> wrote:
> All, I have a project that I am working on involving PDF files in HDFS.
> There are X number of directories and each directory contains Y number
> of PDFs, and per directory all the PDFs are to be concatenated.  At the
> moment I am running a test with 5 directories and 15 PDFs in each
> directory.  I am also using iText to handle the PDFs, and I wrote a
> wrapper class to take PDFs and add them to an internal PDF that grows. I
> am running this on Amazon's EC2 using Extra Large instances, which have
> a total of 15 GB RAM.  Each Java process, two per Instance, has 7GB
> maximum (-Xmx7000m).  There is one Master Instance and 4 Slave
> instances.  I am able to confirm that the Slave processes are connected
> to the Master and have been working.  I am using Hadoop 0.19.0.
> The problem is that I run out of memory when the concatenation class
> reads in a PDF.  I have tried both the iText library version 2.1.4 and
> the Faceless PDF library, and both have the error in the middle of
> concatenating the documents.  I looked into Multivalent, but that one
> just uses Strings to determine paths and it opens the files directly,
> while I am using a wrapper class to interact with items in HDFS, so
> Multivalent is out.
> Since the PDFs aren't enourmous (17 MB or less) and each Instance has
> tons of memory, so why am I running out of memory?
> The mapper works like this.  It gets a text file with a list of
> directories, and per directory it reads in the contents and adds them to
> the concatenation class.  The reducer pretty much does nothing.  Is this
> the best way to do this, or is there a better way?
> Thank you!
> Richard J. Zak

View raw message