hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Zak Stone <zst...@gmail.com>
Subject Re: hadoop streaming binary input / image processing
Date Thu, 14 May 2009 16:55:49 GMT
Hi Qiming,

You might consider using Dumbo, which is a Python wrapper for Hadoop
Streaming. The associated typedbytes module makes it easy for
streaming programs to work with binary data:

http://wiki.github.com/klbostee/dumbo
http://wiki.github.com/klbostee/typedbytes
http://dumbotics.com/2009/03/03/indexing-typed-bytes/

If you are using an older version of Hadoop (such as 18.3), you will
need to apply the following patches to Hadoop to make typedbytes work:

https://issues.apache.org/jira/browse/HADOOP-1722
https://issues.apache.org/jira/browse/HADOOP-5450

The commands you use to apply the patches might look something like this:

cd <HADOOP_HOME>
patch -p0 < HADOOP-1722-branch-0.18.patch
patch -p0 < HADOOP-5450.patch
ant package

The guy who put Dumbo together, Klaas Bosteels, is incredibly helpful,
and he continues to improve this useful project.

Zak


On Thu, May 14, 2009 at 12:39 PM, openresearch
<Qiming.He@openresearchinc.com> wrote:
>
> All,
>
> I have read some recommendation regarding image (binary input) processing
> using Hadoop-streaming which only accept text out-of-box for now.
> http://hadoop.apache.org/core/docs/current/streaming.html
> https://issues.apache.org/jira/browse/HADOOP-1722
> http://markmail.org/message/24woaqie2a6mrboc
>
> However, I have not got any straight answer.
>
> One recommendation is to put image data on HDFS, but we have to do "hdf
> -get" for each file/dir and process it locally which is every expensive.
>
> Another recommendation is to "...put them in a centralized place where all
> the hadoop nodes can access them (via .e.g, NFS mount)..." Obviously, IO
> will becomes bottleneck and it defeat the purpose of distributed processing.
>
> I also notice some enhancement ticket is open for hadoop-core. Is it
> committed to any svn (0.21) branch? can somebody show me an example how to
> take *.jpg files (from HDFS), and process files in a distributed fashion
> using streaming?
>
> Many thanks
>
> -Qiming
> --
> View this message in context: http://www.nabble.com/hadoop-streaming-binary-input---image-processing-tp23544344p23544344.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
>

Mime
View raw message