Need to evaluate a cluster
Hi Olivier,


Could you advice please ?

Either you spend your money on servers with more disks or your spend your money on cooling
/ power consumption and potentially building a new DC ;).

A typical server from a tier 1 vendor ( HP, Dell, IBM, Cisco ) should be around 5k euros (
fully loaded with HDD ).

Kind regards,

Thank for your return Mirko,

In my case, I can consider compression factor of *8 according to the service in charge of

Data, I’m dealing with are : logs only. But it’s many types of logs (printing logs, USB
logs, Remote access logs, Active Directory logs, database servers logs, Web servers logs,
Antivirus logs, etc.)
I precise that in my case it’s only logs that are stored. Sometime we could have CSV files.
But no videos or images are considered here.

Any advice according to that specific type of data?
What are the reasons to consider servers with 12 HDD (3TB) per server? Knowing that, I prefer
What could be the price of a LOW-COST server with 12HDD (3TB) ?


I multiply by 1.3 which means I add 30% of the estimated amount to have reserved capacity
for intermediate data.
In your case with approx. 2TB per day I think, data nodes with 1 to 3 discs are not a good
idea. You should consider servers with more discs and than add one per week. Start with 10
servers and 12 HDD (3TB) per server. This allows you to handle approx. 35 TB raw uncompressed
data. You have to evaluated compression in your special case. It can be high, but also not
very high, if raw data is already compressed somehow. What data are you dealing with?
Text, messages, logs or more binary data like images, mp3 oder video formats?

What does « 1.3 for overhead » mean in this calculation ?


if I follow your numbers I see one missing fact: What is the number of HDDs per DataNode?
Let's assume you use machines with 6 x 3TB HDDs per box, you would need about 60 DataNodes
per year (0.75 TB per day x 3 for replication x 1.3 for overhead / ( nr of HDDs per node x
capacity per HDD )).
With 12 HDD you would only need 30 servers per year.
How did you calculate the number of 367 datanodes?


Hello Dear,

I made an estimation of a number of nodes of a cluster that can be supplied by 720GB of data/day.
My estimation gave me 367 datanodes in a year. I’m a bit afraid by that amount of datanodes.
The assumptions, I used are the followings :

-          Daily supply (feed) : 720GB

-          HDFS replication factor: 3

-          Booked space for each disk outside HDFS: 30%

-          Size of a disk: 3TB.

I have two questions.

First, I would like to know if my assumptions are well taken?
Secondly, could someone help me to evaluate that cluster, to let me be sure that my results
are not to excessive, please ?

Standing by for your feedback

Warm regard

