hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Zak, Richard [USA]" <zak_rich...@bah.com>
Subject RE: Concatenating PDF files
Date Tue, 06 Jan 2009 22:31:18 GMT
Thank you very much Tom, that seems to have done the trick!


And I was able to churn through 4 directories each with 100 PDFs.  And yes,
from ps I could see that the processes were using the "-Xmx7000m" option.

Richard J. Zak

-----Original Message-----
From: Tom White [mailto:tom@cloudera.com] 
Sent: Monday, January 05, 2009 06:47
To: core-user@hadoop.apache.org
Subject: Re: Concatenating PDF files

Hi Richard,

Are you running out of memory after many PDFs have been processed by one
mapper, or during the first? The former would suggest that memory isn't
being released; the latter that the task VM doesn't have enough memory to
start with.

Are you setting the memory available to map tasks by setting
mapred.child.java.opts? You can try to see how much memory the processes are
using by logging into a machine when the job is running and running 'top' or

It won't help the memory problems, but it sounds like you could run with
zero reducers for this job (conf.setNumReduceTasks(0)). Also, EC2 XL
instances can run more than two tasks per node (they have 4 virtual cores,
see http://aws.amazon.com/ec2/instance-types/). And you should configure
them to take advantage of multiple disks -


On Fri, Jan 2, 2009 at 8:50 PM, Zak, Richard [USA] <zak_richard@bah.com>
> All, I have a project that I am working on involving PDF files in HDFS.
> There are X number of directories and each directory contains Y number 
> of PDFs, and per directory all the PDFs are to be concatenated.  At 
> the moment I am running a test with 5 directories and 15 PDFs in each 
> directory.  I am also using iText to handle the PDFs, and I wrote a 
> wrapper class to take PDFs and add them to an internal PDF that grows. 
> I am running this on Amazon's EC2 using Extra Large instances, which 
> have a total of 15 GB RAM.  Each Java process, two per Instance, has 
> 7GB maximum (-Xmx7000m).  There is one Master Instance and 4 Slave 
> instances.  I am able to confirm that the Slave processes are 
> connected to the Master and have been working.  I am using Hadoop 0.19.0.
> The problem is that I run out of memory when the concatenation class 
> reads in a PDF.  I have tried both the iText library version 2.1.4 and 
> the Faceless PDF library, and both have the error in the middle of 
> concatenating the documents.  I looked into Multivalent, but that one 
> just uses Strings to determine paths and it opens the files directly, 
> while I am using a wrapper class to interact with items in HDFS, so 
> Multivalent is out.
> Since the PDFs aren't enourmous (17 MB or less) and each Instance has 
> tons of memory, so why am I running out of memory?
> The mapper works like this.  It gets a text file with a list of 
> directories, and per directory it reads in the contents and adds them 
> to the concatenation class.  The reducer pretty much does nothing.  Is 
> this the best way to do this, or is there a better way?
> Thank you!
> Richard J. Zak

View raw message