hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eli Finkelshteyn <iefin...@gmail.com>
Subject Re: New Production Cluster Criticisms/Advice
Date Wed, 15 Aug 2012 01:36:19 GMT
Hey Mohammad,
Thanks for the reply. I've been using Hadoop and Pig for a while, and I've
setup a pseudo-cluster before. I've just never setup anything
production-scale yet and wanted advice on that.

Cheers,

On Tue, Aug 14, 2012 at 6:20 PM, Mohammad Tariq <dontariq@gmail.com> wrote:

> Hello Eli,
>
>     If this is your first time with Hadoop then I would suggest to
> configure a cluster locally just to get yourself familiar with Hadoop(a
> pseudo setup would do).
>
> For your analytical stuff you can have a look at Pig, another member of
> the Hadoop ecosystem. It's a dataflow language that makes analytics really
> easy.
>
> As a data store Hbase would definitely be a good move.
>
> For data aggregation, you can also have a look at Flume and Chukwa, apart
> from Scribe.
>
> On Wednesday, August 15, 2012, Eli Finkelshteyn <iefinkel@gmail.com>
> wrote:
> > Hey Folks,
> > I'm going to be setting up my first new production cluster soon, and was
> hoping to get some advice and criticism on my current plan of action.
> Here's my current plan:
> > Background/Requirements:
> > I'm setting this up for a start-up that's not gathering very big data
> yet, but will be in the next few months (I hope, anyway). I'd like to use
> the cluster for a few things, at least at first:
> > 1. logging stuff it doesn't make sense to write to a normal database (as
> well as duplicates of what I am throwing in my database so I can use that
> stuff from HDFS later on). Basically, just logging a ton of information I
> might want for analytics/model training later.
> > 2. analytics processing.
> > 3. model training (for machine learning). I'll primarily do this through
> Mahout.
> > 4. will probably want hbase on there as well for real time reading of
> some data. I'm not married to this, and haven't played around much with
> hbase yet, but wanted to leave the possibility open.
> > The Plan:
> > I'm thinking I'll set this up in Amazon. We have most of the rest of our
> hardware there, and I really like the option to be able to spin up a bunch
> of extra workers at will to have them train some ML model for me and then
> kill them off. For now, just to get things off the ground, I'm going to
> setup a small 4 machine cluster (1 NameNode, 1
> SecondaryNameNode/JobTracker, 2 DataNode/TaskTrackers). I'll start playing
> around with that setup, and will add more to it as needed. Since everything
> will be puppetized, adding more machines shouldn't be too bad (I think).
> I've been using Cloudera so far, and I haven't seen any good reason to
> switch, so I'll use CDH4. For logging, I'll just use scribe and wind up
> storing stuff as lzos (a good tutorial on the best way to do this would be
> awesome).
> > Thoughts?
> > Eli
>
> --
> Regards,
>     Mohammad Tariq
>
>

Mime
View raw message