hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Zak, Richard [USA]" <zak_rich...@bah.com>
Subject Concatenating PDF files
Date Fri, 02 Jan 2009 20:50:36 GMT
All, I have a project that I am working on involving PDF files in HDFS.
There are X number of directories and each directory contains Y number
of PDFs, and per directory all the PDFs are to be concatenated.  At the
moment I am running a test with 5 directories and 15 PDFs in each
directory.  I am also using iText to handle the PDFs, and I wrote a
wrapper class to take PDFs and add them to an internal PDF that grows. I
am running this on Amazon's EC2 using Extra Large instances, which have
a total of 15 GB RAM.  Each Java process, two per Instance, has 7GB
maximum (-Xmx7000m).  There is one Master Instance and 4 Slave
instances.  I am able to confirm that the Slave processes are connected
to the Master and have been working.  I am using Hadoop 0.19.0.
The problem is that I run out of memory when the concatenation class
reads in a PDF.  I have tried both the iText library version 2.1.4 and
the Faceless PDF library, and both have the error in the middle of
concatenating the documents.  I looked into Multivalent, but that one
just uses Strings to determine paths and it opens the files directly,
while I am using a wrapper class to interact with items in HDFS, so
Multivalent is out.
Since the PDFs aren't enourmous (17 MB or less) and each Instance has
tons of memory, so why am I running out of memory?
The mapper works like this.  It gets a text file with a list of
directories, and per directory it reads in the contents and adds them to
the concatenation class.  The reducer pretty much does nothing.  Is this
the best way to do this, or is there a better way?
Thank you!
Richard J. Zak

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message