Mailing-List: contact user-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hbase.apache.org
Received-SPF: pass (athena.apache.org: domain of vbr@tradeshift.com designates
 209.85.220.46 as permitted sender)
From: Varun Brahme <vbr@tradeshift.com>
Content-Type: multipart/alternative;
 boundary="Apple-Mail=_744BB69B-F197-4723-BC59-28867A2DB4D6"
Subject: Using HBase to store a directory structure
Message-Id: <BBD9A7D9-CE70-47B2-BFDB-BBFA33D957A9@tradeshift.com>
Date: Tue, 22 Jul 2014 17:39:54 -0700
To: user@hbase.apache.org
Mime-Version: 1.0 (Mac OS X Mail 7.3 \(1878.6\))

--Apple-Mail=_744BB69B-F197-4723-BC59-28867A2DB4D6
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain;
	charset=windows-1252

I trying to use HBase to model a directory structure. Basically we have =
a fixed set of nested directory structure that could store millions of =
files each. The directory structure is accessed by users and every user =
has his/her own set. Something like=20

user 1
	- dir 1
		- file 1
		- file 2
		- file 3
	- dir 2
		- file 4
		- file 5
	- dir 3
		- dir 4
			-file 6
			-file 7
		- dir 5
			- file 8
			- file 9

Each user would have a similar structure but that set would be be =
accessible only to that user. For e.g. user 2 and user 3 would have =
their own directories and files and user 2 won=92t be able to access the =
files in user 1. The nesting is not very deep and the directories and =
their nesting is fixed. The files in each directory is not. Each file =
can only be in one directory and a directory won=92t be having both =
files and directories at the same time. Files are of course unique in a =
directory but may not be unique across directories.=20

There would be a million users, each user would have 10 pre-set =
directories and there would be about a million files in each directory =
meant to store files. How can I best model this in HBase. A sample =
schema I thought of was the following:

Schema 1:
Table 1 stores a mapping of user id to directory name using a single =
column family, user id is row key and dir name is column name. Each cell =
represented by user id and column name stores a reference id (can be an =
auto increment value)=20
Thus userId -> cf1: dirName : refId

Table 2 would be a mapping between refId from table 1 and filename as
RefId -> cf1: filename : reference_to_actual_location_on_filesystem

Schema 2:
This combines above two tables into one for better consistency
Table
user id ->=20
	cf1 : dirname : timestamp_of_when_file_was_created
	cf2 : filename : reference_to_actual_location_on_fs

In both cases, I am basically dealing with big fat tables possibly with =
10 million rows by 1 billion mappings.

My question is , is Hbase good at querying such a huge table size and =
can serve requested data in say a couple of secs to potentially 1000s of =
users accessing at once?

If not then is there a better schema to implement the directory =
structure? May by splitting tables in such a way that user access =
becomes really fast.

Cluster size could be about 10 nodes at least but cannot be more than a =
100 nodes.

Thank you.

Varun=

--Apple-Mail=_744BB69B-F197-4723-BC59-28867A2DB4D6--