hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Richard Whitehead" <richard.whiteh...@ieee.org>
Subject Re: Building a distributed system
Date Tue, 19 Jul 2016 08:48:07 GMT
Thanks Ravi and Marcin,

You are right, what we need is a work queue, a way to start jobs on remote machines, and a
way to move data to and from those remote machines.   The “jobs” are just executables
that process one item of data.  We don’t need to split the data into chunks or to combine
the results from several jobs.

The feeling amongst the developers seems to be that Java would be preferable to Python (this
is a medical product and people, whether rightly or wrongly, think Java would be easier to

Is there a way to use the Hadoop (or some other) infrastructure in a simple way to prevent
us having to write a scheduler, database schema etc.?  We can do that but it seems to be solving
a problem that has already been solved many times.

Thanks again,


From: Ravi Prakash 
Sent: Monday, July 18, 2016 7:45 PM
To: Marcin Tustin 
Cc: Richard Whitehead ; user@hadoop.apache.org 
Subject: Re: Building a distributed system

Welcome to the community Richard!

I suspect Hadoop can be more useful than just splitting and stitching back data. Depending
on your use cases, it may come in handy to manage your machines, restart failed tasks, scheduling
work when data becomes available etc. I wouldn't necessarily count it out. I'm sorry I am
not familiar with celery, so I can't provide a direct comparison. Also, in the non-rare chance
that your input data grows, you wouldn't have to rewrite your infrastructure code if you wrote
your Hadoop code properly.



On Mon, Jul 18, 2016 at 9:23 AM, Marcin Tustin <mtustin@handybook.com> wrote:

  I think you're confused as to what these things are.  

  The fundamental question is do you want to run one job on sub parts of the data, then stitch
their results together (in which case hive/map-reduce/spark will be for you), or do you essentially
already have splitting to computer-sized chunks figured out, and you just need a work queue?
In the latter case there are a number of alternatives. I happen to like python, and would
recommend celery (potentially wrapped by something like airflow) for that case. 

  On Mon, Jul 18, 2016 at 12:17 PM, Richard Whitehead <richard.whitehead@ieee.org> wrote:


    I wonder if the community can help me get started.

    I’m trying to design the architecture of a project and I think that using some Apache
Hadoop technologies may make sense, but I am completely new to distributed systems and to
Apache (I am a very experienced developer, but my expertise is image processing on Windows!).

    The task is very simple: call 3 or 4 executables in sequence to process some data.  The
data is just a simple image and the processing takes tens of minutes.

    We are considering a distributed architecture to increase throughput (latency does not
matter).  So we need a way to queue work on remote computers, and a way to move the data around.
 The architecture will have to work n a single server, or on a couple of servers in a rack,
or in the cloud; 2 or 3 computers maximum.

    Being new to all this I would prefer something simple rather than something super-powerful.

    I was considering Hadoop YARN and Hadoop DFS, does this make sense?  I’m assuming MapReduce
would be over the top, is that the case?    

    Thanks in advance.


  Want to work at Handy? Check out our culture deck and open roles
  Latest news at Handy
  Handy just raised $50m led by Fidelity

View raw message