hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jim Twensky <jim.twen...@gmail.com>
Subject Re: Suitable for Hadoop?
Date Wed, 21 Jan 2009 19:46:49 GMT
Ricky,

Hadoop was formerly optimized for large files, usually files of size larger
than one input split. However, there is an input format called
MultiFileInputFormat which can be used to utilize Hadoop to work efficiently
on smaller files. You can also set the isSplittable method of an input
format to "false" and ensure that a file is not split into pieces but rather
processed by only one mapper.

Jim

On Wed, Jan 21, 2009 at 9:14 AM, Ricky Ho <rho@adobe.com> wrote:

> Hmmm ...
>
> From a space efficiency perspective, given HDFS (with large block size) is
> expecting large files, is Hadoop optimized for processing large number of
> small files ?  Does each file take up at least 1 block ? or multiple files
> can sit on the same block.
>
> Rgds,
> Ricky
> -----Original Message-----
> From: Zak, Richard [USA] [mailto:zak_richard@bah.com]
> Sent: Wednesday, January 21, 2009 6:42 AM
> To: core-user@hadoop.apache.org
> Subject: RE: Suitable for Hadoop?
>
> You can do that.  I did a Map/Reduce job for about 6 GB of PDFs to
> concatenate them, and the New York times used Hadoop to process a few TB of
> PDFs.
>
> What I would do is this:
> - Use the iText library, a Java library for PDF manipulation (don't know
> what you would use for reading Word docs)
> - Don't use any Reducers
> - Have the input be a text file with the directory(ies) to process, since
> the mapper takes in file contents (and you don't want to read in one line
> of
> binary)
> - Have the map process all contents for that one given directory from the
> input text file
> - Break down the documents into more directories to go easier on the memory
> - Use Amazon's EC2, and the scripts in <hadoop_dir>/src/contrib/ec2/bin/
> (there is a script which passes environment variables to launched
> instances,
> modify the script to allow Hadoop to use more memory by setting the
> HADOOP_HEAPSIZE environment variable and having the variable properly
> passed)
>
> I realize this isn't the strong point of Map/Reduce or Hadoop, but it still
> uses the HDFS in a beneficial manner, and the distributed part is very
> helpful!
>
>
> Richard J. Zak
>
> -----Original Message-----
> From: Darren Govoni [mailto:darren@ontrenet.com]
> Sent: Wednesday, January 21, 2009 08:08
> To: core-user@hadoop.apache.org
> Subject: Suitable for Hadoop?
>
> Hi,
>  I have a task to process large quantities of files by converting them into
> other formats. Each file is processed as a whole and converted to a target
> format. Since there are 100's of GB of data I thought it suitable for
> Hadoop, but the problem is, I don't think the files can be broken apart and
> processed. For example, how would mapreduce work to convert a Word Document
> to PDF if the file is reduced to blocks? I'm not sure that's possible, or
> is
> it?
>
> thanks for any advice
> Darren
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message