hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <tdunn...@veoh.com>
Subject Re: How to handle tif image files in hadoop
Date Fri, 09 May 2008 17:03:11 GMT

Your read loop has a bug in it and also allocate way more garbage than is
necessary.  Also, the small buffer size will slow things down somewhat.

Try this instead:

      byte[] buffer = new byte[100000];
      int readBytes = instream.read(buffer);
      while (readBytes > 0) {
         fimb.write(buffer, 0, readBytes);
         readBytes = instream.read(buffer);
      }

But even if you manage to read the data correctly, why are you doing all the
work of reading the data into a buffer and then reading it into an image?

Why not just replace everything from line 3 on with this:

      BufferedImage picture = ImageIO.read( instream )


What you are doing here will work reasonably will if you have an input path
name, possibly because your map input contains file names, but this loses
all locality.

It would be better to copy or reinvent some of the archive code so that you
can put all of your images in a few files.  All you need from your envelope
is a byte count for each image.  Thus, if your input file has a 4 byte
integer containing the size of the following image you can build a very
simple input format that will read images and pass them to your mapper.
Doing that allows hadoop to position the compute tasks near your data which
will improve your performance dramatically.


On 5/9/08 6:22 AM, "charan@students.iiit.ac.in" <charan@students.iiit.ac.in>
wrote:

> Hi,
> 
>   Thankyou sir for letting me know one more aspect of hadoop.
> But I used JAI and processed our files by reading them as bytes in HDFS
> and sending it to JAI library for tiff . And it worked :)
> 
> For those Who want to work with tiff files in hdfs, here is a way
> 
>           Path inFile = new Path(infilename);
>           FSDataInputStream instream = fs.open(inFile);
>           ByteArrayOutputStream fimb = new ByteArrayOutputStream();
>           byte[] buffer = new byte[300];
>           int readBytes = 0;
>           while((readBytes = instream.read(buffer)) > 0)
>           {
>               fimb.write(buffer,0,300);
>               buffer = new byte[300];
>           }
>           byte[] formattedImageBytes;
>           formattedImageBytes = fimb.toByteArray();
>   BufferedImage picture = ImageIO.read ( new ByteArrayInputStream (
> formattedImageBytes ) );
> 
> Once we can get a buffered image it is easy to process as it is an object
> 
> Thank you.
> 
> 
>> Hello ,
>> 
>> It's better that you write your own InputFormat for processing the tif
>> images . For more information you can look into this
>> 
>> http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/Input
>> Format.html
>> 
>> ---
>> Peeyush
>> 
>> On Thu, 2008-05-08 at 13:32 +0530, charan@students.iiit.ac.in wrote:
>> 
>>> Hi,
>>> 
>>>  I want to process the information in tif images using hadoop. For this,
>>> a
>>> BufferedImage object has to be created. For JPEG images, ImageIO is used
>>> alongwith the ByteArrayOutputStream which contains byte data of the
>>> image. But  for TIFF image,this doesn't work. Is there any way to handle
>>> this problem?
>>> 
>>>   Also, can conventional JAI library methods be used to directly access
>>> TIFF files in HDFS?
>>> 
>>> Thank you.
>>> 
>>> 
>> 
>> 
>> 
> 
> 
> 


Mime
View raw message