hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eli Finkelshteyn <iefin...@gmail.com>
Subject New Production Cluster Criticisms/Advice
Date Wed, 15 Aug 2012 01:07:50 GMT
Hey Folks,
I'm going to be setting up my first new production cluster soon, and was
hoping to get some advice and criticism on my current plan of action.
Here's my current plan:

I'm setting this up for a start-up that's not gathering very big data yet,
but will be in the next few months (I hope, anyway). I'd like to use the
cluster for a few things, at least at first:
1. logging stuff it doesn't make sense to write to a normal database (as
well as duplicates of what I am throwing in my database so I can use that
stuff from HDFS later on). Basically, just logging a ton of information I
might want for analytics/model training later.
2. analytics processing.
3. model training (for machine learning). I'll primarily do this through
4. will probably want hbase on there as well for real time reading of some
data. I'm not married to this, and haven't played around much with hbase
yet, but wanted to leave the possibility open.

*The Plan:*
I'm thinking I'll set this up in Amazon. We have most of the rest of our
hardware there, and I really like the option to be able to spin up a bunch
of extra workers at will to have them train some ML model for me and then
kill them off. For now, just to get things off the ground, I'm going to
setup a small 4 machine cluster (1 NameNode, 1
SecondaryNameNode/JobTracker, 2 DataNode/TaskTrackers). I'll start playing
around with that setup, and will add more to it as needed. Since everything
will be puppetized, adding more machines shouldn't be too bad (I think).
I've been using Cloudera so far, and I haven't seen any good reason to
switch, so I'll use CDH4. For logging, I'll just use scribe and wind up
storing stuff as lzos (a good tutorial on the best way to do this would be



View raw message