hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Arun Allamsetty <arun.allamse...@gmail.com>
Subject Re: Using HBase to store a directory structure
Date Wed, 23 Jul 2014 04:13:14 GMT
Hi Varun,

I am still learning HBase here, so the experts can point out the mistakes I
make. Your problem seems to be something which can be easily mapped to a
HBase table structure.

Firstly, never ever store references in HBase. It doesn't serve any purpose
and will just make your queries slower. Instead, denormalize (always) even
if it means redundant data. Disk is cheap so it's not that big an issue.

So my approach for your use case would have been to have the a composite
key consisting of user ID and the directory path as the row key. And have
the file names as columns in the row. If you want to store their content,
that can be their value or you can have a boolean value to denote if it is
new or has been modified (or have even a random placeholder). Timestamp is
native to each entry in a HBase cell, you you can have your timestamp there.

Since your directory structure is fixed, and I am assuming you'll know the
user IDs beforehand, you'll easily be able to access all the filenames for
a particular path.

And as far as the cluster sizes are concerned, it really depends on the
hardware you plan yo use. If you are planning to use cloud services, my
recommendation would be to buy dedicated instances than the virtualized
shared ones (some vendors provide those) because it does affect the cluster
performance. I've seen it firsthand.

Hope this helps.


Sent from a mobile device. Please don't mind the typos.
On Jul 22, 2014 7:45 PM, "Varun Brahme" <vbr@tradeshift.com> wrote:

> I trying to use HBase to model a directory structure. Basically we have a
> fixed set of nested directory structure that could store millions of files
> each. The directory structure is accessed by users and every user has
> his/her own set. Something like
> user 1
>         - dir 1
>                 - file 1
>                 - file 2
>                 - file 3
>         - dir 2
>                 - file 4
>                 - file 5
>         - dir 3
>                 - dir 4
>                         -file 6
>                         -file 7
>                 - dir 5
>                         - file 8
>                         - file 9
> Each user would have a similar structure but that set would be be
> accessible only to that user. For e.g. user 2 and user 3 would have their
> own directories and files and user 2 won’t be able to access the files in
> user 1. The nesting is not very deep and the directories and their nesting
> is fixed. The files in each directory is not. Each file can only be in one
> directory and a directory won’t be having both files and directories at the
> same time. Files are of course unique in a directory but may not be unique
> across directories.
> There would be a million users, each user would have 10 pre-set
> directories and there would be about a million files in each directory
> meant to store files. How can I best model this in HBase. A sample schema I
> thought of was the following:
> Schema 1:
> Table 1 stores a mapping of user id to directory name using a single
> column family, user id is row key and dir name is column name. Each cell
> represented by user id and column name stores a reference id (can be an
> auto increment value)
> Thus userId -> cf1: dirName : refId
> Table 2 would be a mapping between refId from table 1 and filename as
> RefId -> cf1: filename : reference_to_actual_location_on_filesystem
> Schema 2:
> This combines above two tables into one for better consistency
> Table
> user id ->
>         cf1 : dirname : timestamp_of_when_file_was_created
>         cf2 : filename : reference_to_actual_location_on_fs
> In both cases, I am basically dealing with big fat tables possibly with 10
> million rows by 1 billion mappings.
> My question is , is Hbase good at querying such a huge table size and can
> serve requested data in say a couple of secs to potentially 1000s of users
> accessing at once?
> If not then is there a better schema to implement the directory structure?
> May by splitting tables in such a way that user access becomes really fast.
> Cluster size could be about 10 nodes at least but cannot be more than a
> 100 nodes.
> Thank you.
> Varun

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message