Mailing-List: contact user-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hbase.apache.org
Received-SPF: neutral (athena.apache.org: local policy)
MIME-Version: 1.0
In-Reply-To: <AANLkTinEueSFEYamGSJE-m1UcKMieBp3mDd_i0rVmxMn@mail.gmail.com>
References: <AANLkTinECFghOwyhTcSqM9cfs4is8ma3EEKSF-bdz3F6@mail.gmail.com>
	<C90EBF2E.FF8D%ksankar42@gmail.com>
	<AANLkTi=Ubem4ra1bABFHcd4YH0yMgQveCo9xqyZYQM_Q@mail.gmail.com>
	<AANLkTin5gCa7xAF7dqu5VD17g2aOtM=QQ1hATwkuQ2A_@mail.gmail.com>
	<AANLkTikm1vLCN1JV748d+ZPvahE9uNH2yjkyaY=pwt+H@mail.gmail.com>
	<AANLkTinEueSFEYamGSJE-m1UcKMieBp3mDd_i0rVmxMn@mail.gmail.com>
Date: Mon, 22 Nov 2010 15:02:57 +0200
Message-ID: <AANLkTi=F-6SAMYay+00MN4kqoaXYHZu9_cifsDXjFQ3S@mail.gmail.com>
Subject: Re: Hadoop/HBase hardware requirement
From: Lior Schachter <liors@infolinks.com>
To: user@hbase.apache.org
Content-Type: multipart/alternative; boundary=00163630f62d01ead20495a3e04f

--00163630f62d01ead20495a3e04f
Content-Type: text/plain; charset=ISO-8859-1

Hi Lars,
I agree with every sentence you wrote (and that's why we chose HBase).
However, from a managerial point-of-view the question of the initial
investment is very important (specially when considering a new technology).

Lior


p.s. The price is in USD ....

On Mon, Nov 22, 2010 at 2:43 PM, Lars George <lars.george@gmail.com> wrote:

> Hi Lior,
>
> I can only hope you state this in Schekel! But 20 nodes with Hadoop
> can do quite a lot and you cannot compare a single Oracle box with a
> 20 node Hadoop cluster as they serve slightly different use-cases. You
> need to make a commitment to what you want to achieve with HBase and
> that growth is the most important factor. Scaling Oracle is really
> expensive while HBase/Hadoop is not in comparison and costs are
> linear, while with Oracle more exponential.
>
> Lars
>
> On Mon, Nov 22, 2010 at 1:27 PM, Lior Schachter <liors@infolinks.com>
> wrote:
> > Hi all, Thanks for your input and assistance.
> >
> >
> > From your answers I understand that:
> > 1. more is better but our configuration might work.
> > 2. there are small tweaks we can do that will improve our configuration
> > (like having 4x500GB disks).
> > 3. use monitoring (like Ganglia) to find the bottlenecks.
> >
> > For me, The question here is how to balance between our current budget
> and
> > system stability (and performance).
> > I agree that more memory and more disk space will improve our
> responsiveness
> > but on the other hand our system is NOT expected to be real-time (but
> rather
> > a back office analytics with few hours delay).
> >
> > This is a crucial point since the proposed configurations we found in the
> > web don't distinguish between real-time configurations and back-office
> > configurations. To build a real-time cluster with 20 nodes will cost
> around
> > 200-300K (in Israel) this is similar to the price of a quite strong
> Oracle
> > cluster... so my boss (the CTO) was partially right when telling me - but
> > you said it would be cheap !! very cheap :)
> >
> > I believe that more money will come when we show the viability of the
> > system... I also read that heterogeneous clusters are common.
> >
> > It will help a lot if you can provide your configurations and system
> > characteristics (maybe in a Wiki page).
> > It will also help to get more of the "small tweaks" that you found
> helpful.
> >
> >
> > Lior Schachter
> >
> >
> >
> >
> >
> >
> >
> > On Mon, Nov 22, 2010 at 1:33 PM, Lars George <lars.george@gmail.com>
> wrote:
> >
> >> Oleg,
> >>
> >> Do you have Ganglia or some other graphing tool running against the
> >> cluster? It gives you metrics that are crucial here, for example the
> >> load on Hadoop and its DataNodes as well as insertion rates etc. on
> >> HBase. What is also interesting is the compaction queue to see if the
> >> cluster is going slow.
> >>
> >> Did you try loading from an empty system to a loaded one? Or was it
> >> already filled and you are trying to add more? Are you spreading the
> >> load across servers or are you using sequential keys that tax only one
> >> server at a time?
> >>
> >> 16GB should work, but is not ideal. The various daemons simply need
> >> room to breathe. But that said, I have personally started with 12GB
> >> even and it worked.
> >>
> >> Lars
> >>
> >> On Mon, Nov 22, 2010 at 12:17 PM, Oleg Ruchovets <oruchovets@gmail.com>
> >> wrote:
> >> > On Sun, Nov 21, 2010 at 10:39 PM, Krishna Sankar <ksankar42@gmail.com
> >> >wrote:
> >> >
> >> >> Oleg & Lior,
> >> >>
> >> >> Couple of questions & couple of suggestions to ponder:
> >> >> A)  When you say 20 Name Servers, I assume you are talking about 20
> Task
> >> >> Servers
> >> >>
> >> >
> >> > Yes
> >> >
> >> >
> >> >> B)  What type are your M/R jobs ? Compute Intensive vs. storage
> >> intensive ?
> >> >>
> >> >
> >> > M/R -- most of it -- it is a parsing stuff , result of m/r  5% - 10%
> >> stores
> >> > to hbase
> >> >
> >> >
> >> >> C)  What is your Data growth ?
> >> >>
> >> >
> >> >  currently we have 50GB per day , it could be ~150GB.
> >> >
> >> >
> >> >> D)  With the current jobs, are you saturating RAM ? CPU ? Or storage
> ?
> >> >>
> >> >    Map phase takes 100% CPU consumption since it is a parsing and
> input
> >> > files are  gz.
> >> >    Definitely have a memory issues.
> >> >
> >> >
> >> >> Ganglia/Hadoop metrics should tell.
> >> >> E)  Also are your jobs long running or short tasks ?
> >> >>
> >> >    map tasks takes from 5 second to 2 minutes
> >> >    reducer (insertion to hbase) takes -- ~3 hours
> >> >
> >> >
> >> >> Suggestions:
> >> >> A)  Your name node could be 32 GB, 2TB Disk. Make sure it is an
> >> enterprise
> >> >> class server and also backup to an NFS mount.
> >> >> B)  Also have a decent machine as the checkpoint name node. It could
> be
> >> >> similar to the task nodes
> >> >> B)  I assume by Master Machine, you mean Job Tracker. It could be
> >> similar
> >> >> to the Task Trackers - 16/24 GB memory, with 4-8 TB disk
> >> >> C)  As Jean-Daniel pointed out 500GB (with more spindles) is what I
> >> would
> >> >> also recommend. But it also depends on your primary data,
> intermediate
> >> >> data and final data size. 1 or 2 TB disks are also fine, because they
> >> give
> >> >> you more strage. I assume you have the default replication of 3
> >> >> D)  A 1Gb dedicated network would be good. As there are only ~25
> >> machines,
> >> >> you can hang them off of a good Gb switch. Consider 10Gb if there is
> too
> >> >> much intermediate data traffic, in the future.
> >> >> Cheers
> >> >> <k/>
> >> >>
> >> >> On 11/21/10 Sun Nov 21, 10, "Oleg Ruchovets" <oruchovets@gmail.com>
> >> wrote:
> >> >>
> >> >> >Hi all,
> >> >> >After testing HBase for few months with very light configurations
>  (5
> >> >> >machines, 2 TB disk, 8 GB RAM), we are now planing for production.
> >> >> >Our Load -
> >> >> >1) 50GB log files to process per day by Map/Reduce jobs.
> >> >> >2)  Insert 4-5GB to 3 tables in hbase.
> >> >> >3) Run 10-20 scans per day (scanning about 20 regions in a table).
> >> >> >All this should run in parallel.
> >> >> >Our current configuration can't cope with this load and we are
> having
> >> many
> >> >> >stability issues.
> >> >> >
> >> >> >This is what we have in mind :
> >> >> >1. Master machine - 32 GB, 4 TB, Two quad core CPUs.
> >> >> >2. Name node - 16 GB, 2TB, Two quad core CPUs.
> >> >> >we plan to have up to 20 name servers (starting with 5).
> >> >> >
> >> >> >We already read
> >> >> >
> >> >>
> >>
> http://www.cloudera.com/blog/2010/03/clouderas-support-team-shares-some-ba
> >> >> >sic-hardware-recommendations/
> >> >> >.
> >> >> >
> >> >> >We would appreciate your feedback on our proposed configuration.
> >> >> >
> >> >> >
> >> >> >Regards Oleg & Lior
> >> >>
> >> >>
> >> >>
> >> >
> >>
> >
>

--00163630f62d01ead20495a3e04f--