helix-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Varun Sharma <va...@pinterest.com>
Subject Using Helix for HDFS serving
Date Mon, 16 Jun 2014 22:55:43 GMT
Hi folks,

We are looking at helix for building a serving solution on top of HDFS for
data generated from mapreduce jobs. The files will be smaller than the HDFS
block size and hence each file will be on 3 replicas with each replica
having the whole file in entirety. A set of files output by MR would be the
resource and each file (or group of X files) would be a partition.

We can assume that there is a container which can serve these immutable
files for lookups. Since we have 3 replicas, we were wondering if we could
use helix for serving these files with 3 logically equivalent replicas. We
need a few things:

a) In the steady state, when HDFS blocks are all triplicated, the logical
assigment of the 3 replicas should respect block affinity.

b) When a node crashes, some blocks become under replicated both physically
and logically (from helix point of view). In such a case, we don't want to
carry out any transitions. Finally, over time (~ 20 minutes), HDFS will re
replicate blocks so that physical replication factor of 3 is attained. Once
this happens, we want the logical replication to catch up to 3 and also
respect hdfs block placement.

So there are two aspects, one is to retain block locality by doing logical
assigment in a way that the logical partition comes up on the same nodes
hosting the physical partition. Secondly, we want the logical placement to
trail the physical placement (as determined by HDFS). So we could have the
cluster in a non ideal state for a long period of time - say 20-30 minutes.

Please let us know if these are feasible with helix and if yes, what would
be the recommended practices.


View raw message