hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Varun Brahme <...@tradeshift.com>
Subject Using HBase to store a directory structure
Date Wed, 23 Jul 2014 00:39:54 GMT
I trying to use HBase to model a directory structure. Basically we have a fixed set of nested
directory structure that could store millions of files each. The directory structure is accessed
by users and every user has his/her own set. Something like 

user 1
	- dir 1
		- file 1
		- file 2
		- file 3
	- dir 2
		- file 4
		- file 5
	- dir 3
		- dir 4
			-file 6
			-file 7
		- dir 5
			- file 8
			- file 9

Each user would have a similar structure but that set would be be accessible only to that
user. For e.g. user 2 and user 3 would have their own directories and files and user 2 won’t
be able to access the files in user 1. The nesting is not very deep and the directories and
their nesting is fixed. The files in each directory is not. Each file can only be in one directory
and a directory won’t be having both files and directories at the same time. Files are of
course unique in a directory but may not be unique across directories. 

There would be a million users, each user would have 10 pre-set directories and there would
be about a million files in each directory meant to store files. How can I best model this
in HBase. A sample schema I thought of was the following:

Schema 1:
Table 1 stores a mapping of user id to directory name using a single column family, user id
is row key and dir name is column name. Each cell represented by user id and column name stores
a reference id (can be an auto increment value) 
Thus userId -> cf1: dirName : refId

Table 2 would be a mapping between refId from table 1 and filename as
RefId -> cf1: filename : reference_to_actual_location_on_filesystem

Schema 2:
This combines above two tables into one for better consistency
user id -> 
	cf1 : dirname : timestamp_of_when_file_was_created
	cf2 : filename : reference_to_actual_location_on_fs

In both cases, I am basically dealing with big fat tables possibly with 10 million rows by
1 billion mappings.

My question is , is Hbase good at querying such a huge table size and can serve requested
data in say a couple of secs to potentially 1000s of users accessing at once?

If not then is there a better schema to implement the directory structure? May by splitting
tables in such a way that user access becomes really fast.

Cluster size could be about 10 nodes at least but cannot be more than a 100 nodes.

Thank you.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message