hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Darren Govoni <dar...@ontrenet.com>
Subject RE: Suitable for Hadoop?
Date Wed, 21 Jan 2009 15:58:42 GMT
   Thanks for the suggestion. I actually am building an EC2 architecture
to facilitate this! I tried using a database to warehouse the files, and
then NFS but the connection load is too heavy. So I thought maybe HDFS
could be used just too mitigate the data access across all the
instances. I have a parallel processing architecture based on SQS
queues, but will consider a map process. 
   I have about 32 processes or so per machine in EC2 reading from SQS
queues for files to process, they could then efficiently get the files
from HDFS, yes? Without bottlenecking access to a database or NFS

Well, i will test this direction and see too.

Thank you!

On Wed, 2009-01-21 at 09:41 -0500, Zak, Richard [USA] wrote:
> You can do that.  I did a Map/Reduce job for about 6 GB of PDFs to
> concatenate them, and the New York times used Hadoop to process a few TB of
> PDFs.
> What I would do is this:
> - Use the iText library, a Java library for PDF manipulation (don't know
> what you would use for reading Word docs)
> - Don't use any Reducers
> - Have the input be a text file with the directory(ies) to process, since
> the mapper takes in file contents (and you don't want to read in one line of
> binary)
> - Have the map process all contents for that one given directory from the
> input text file
> - Break down the documents into more directories to go easier on the memory
> - Use Amazon's EC2, and the scripts in <hadoop_dir>/src/contrib/ec2/bin/
> (there is a script which passes environment variables to launched instances,
> modify the script to allow Hadoop to use more memory by setting the
> HADOOP_HEAPSIZE environment variable and having the variable properly
> passed)
> I realize this isn't the strong point of Map/Reduce or Hadoop, but it still
> uses the HDFS in a beneficial manner, and the distributed part is very
> helpful!
> Richard J. Zak
> -----Original Message-----
> From: Darren Govoni [mailto:darren@ontrenet.com] 
> Sent: Wednesday, January 21, 2009 08:08
> To: core-user@hadoop.apache.org
> Subject: Suitable for Hadoop?
> Hi,
>   I have a task to process large quantities of files by converting them into
> other formats. Each file is processed as a whole and converted to a target
> format. Since there are 100's of GB of data I thought it suitable for
> Hadoop, but the problem is, I don't think the files can be broken apart and
> processed. For example, how would mapreduce work to convert a Word Document
> to PDF if the file is reduced to blocks? I'm not sure that's possible, or is
> it?
> thanks for any advice
> Darren

View raw message