hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vincent Boucher <vin.bouc...@gmail.com>
Subject Re: blocks with a huge size?
Date Tue, 11 Oct 2011 17:12:55 GMT
Hi Will-

Le Tuesday 11 October 2011, Will Maier a écrit :
> As I described earlier, our ~1PB cluster is very similar: some large
> servers, some small ones. We store data recorded at and simulated for the
> CMS detector at the LHC, as well as the products of our users' physics
> analysis. Like you, much of our data can be redownloaded to our Tier2 from
> the nearest Tier1 (Fermilab, USA).

Indeed, it's a similar setup, at a smaller scale: Tier2 for CMS at UCLouvain, 
Belgium, with data&sim from the whole collaboration; stageout files from the 
Grid for our local users ...

> And still, we've gone with a fairly standard configuration, with 2x
> replication across the board. I strongly suggest trying to design your
> cluster to exploit the good parts of HDFS instead of making it look like
> whatever your previous system was (ours was dCache, a distributed disk
> cache developed in and for the high energy physics community). HDFS doesn't
> map replicas to servers and works best with a replication factor >=2. These
> were new constraints for us when we migrated our storage cluster, but the
> benefits of storing data on commodity servers (our worker nodes) with
> inbuilt replication/resiliency and great throughput more than compensated.

We were working with a kind of homemade FUSE solution built on top of NFS 
(with a Metadata server hosting symbolic links to the different mass storage 
servers). The data was sprinkled over the mass storage servers at the file 
level, not block level.
It achieved great performance and worked very well as far as everything 
was ... working well. Eg: when a NFS server fails, the clients have a strange 
behaviour (whole nfs hanging). No clean admin tools.

For us, it is difficult to set a 2x replication while maintaining the current 
amount of data we serve to the Collaboration.

An alternative would be to switch the mass storage servers with currently one 
large ZFS partition each to servers with as many independent partitions as 
the # of drives they host (typical # of drives per server: 70); and set at 
the Hadoop level a 2x replication. The volume freed by killing the RAID is 
not enough to compensate the replication but that would be a first step.


View raw message