Return-Path: X-Original-To: apmail-hbase-user-archive@www.apache.org Delivered-To: apmail-hbase-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 1B97B112AC for ; Wed, 23 Jul 2014 01:45:07 +0000 (UTC) Received: (qmail 71284 invoked by uid 500); 23 Jul 2014 01:45:05 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 71225 invoked by uid 500); 23 Jul 2014 01:45:04 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Delivered-To: moderator for user@hbase.apache.org Received: (qmail 70677 invoked by uid 99); 23 Jul 2014 00:40:24 -0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of vbr@tradeshift.com designates 209.85.220.46 as permitted sender) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:from:content-type:subject:message-id:date:to :mime-version; bh=DWh1Txu+SZ76txA2y4uX3E50xWDDKYI2xV4smUofeP8=; b=nM0UWwjllBtGG2qeCww4O+GeNy5Czhf16syHeBFbeWp39Y1yVfthOVXRNAEEx72LcP Or/SkN7AQTdR5IotN3IzjT4a/HvIWQ9DSwvjj6ipqGdJpE81vSODe9Q3iuNgz5iEfuoG RjixrB/1Ss2BtukTjFYDdBZSyQ6aqKSn4vlsICnPYbfa34+FovmkGhhuA+MAH5S5F61H qfAPOYV1wgN71jnLL+uEsk1TPHvmdmKUZj8ygB138l6NuQXmouDQlcrTGXn++cBBy/7d SsREhlLwjuXhsYAbvG4HraBDj7pC/jCtllXqyifxbdt/oLFkAI0qImp89PYMz4nLZ12r 65Bw== X-Gm-Message-State: ALoCoQkdLeCWmAl7ulIoT6Fi+Do3b0DBh5anMkwpLkQmdGNg3kZ4scvxwJAMHMOm554UKpAUVv4c X-Received: by 10.68.113.165 with SMTP id iz5mr9502501pbb.105.1406075996953; Tue, 22 Jul 2014 17:39:56 -0700 (PDT) From: Varun Brahme Content-Type: multipart/alternative; boundary="Apple-Mail=_744BB69B-F197-4723-BC59-28867A2DB4D6" Subject: Using HBase to store a directory structure Message-Id: Date: Tue, 22 Jul 2014 17:39:54 -0700 To: user@hbase.apache.org Mime-Version: 1.0 (Mac OS X Mail 7.3 \(1878.6\)) X-Mailer: Apple Mail (2.1878.6) X-Virus-Checked: Checked by ClamAV on apache.org --Apple-Mail=_744BB69B-F197-4723-BC59-28867A2DB4D6 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=windows-1252 I trying to use HBase to model a directory structure. Basically we have = a fixed set of nested directory structure that could store millions of = files each. The directory structure is accessed by users and every user = has his/her own set. Something like=20 user 1 - dir 1 - file 1 - file 2 - file 3 - dir 2 - file 4 - file 5 - dir 3 - dir 4 -file 6 -file 7 - dir 5 - file 8 - file 9 Each user would have a similar structure but that set would be be = accessible only to that user. For e.g. user 2 and user 3 would have = their own directories and files and user 2 won=92t be able to access the = files in user 1. The nesting is not very deep and the directories and = their nesting is fixed. The files in each directory is not. Each file = can only be in one directory and a directory won=92t be having both = files and directories at the same time. Files are of course unique in a = directory but may not be unique across directories.=20 There would be a million users, each user would have 10 pre-set = directories and there would be about a million files in each directory = meant to store files. How can I best model this in HBase. A sample = schema I thought of was the following: Schema 1: Table 1 stores a mapping of user id to directory name using a single = column family, user id is row key and dir name is column name. Each cell = represented by user id and column name stores a reference id (can be an = auto increment value)=20 Thus userId -> cf1: dirName : refId Table 2 would be a mapping between refId from table 1 and filename as RefId -> cf1: filename : reference_to_actual_location_on_filesystem Schema 2: This combines above two tables into one for better consistency Table user id ->=20 cf1 : dirname : timestamp_of_when_file_was_created cf2 : filename : reference_to_actual_location_on_fs In both cases, I am basically dealing with big fat tables possibly with = 10 million rows by 1 billion mappings. My question is , is Hbase good at querying such a huge table size and = can serve requested data in say a couple of secs to potentially 1000s of = users accessing at once? If not then is there a better schema to implement the directory = structure? May by splitting tables in such a way that user access = becomes really fast. Cluster size could be about 10 nodes at least but cannot be more than a = 100 nodes. Thank you. Varun= --Apple-Mail=_744BB69B-F197-4723-BC59-28867A2DB4D6--