cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From daemeon reiydelle <>
Subject Re: Arbitrary nested tree hierarchy data model
Date Sat, 28 Mar 2015 18:02:59 GMT
Fascinating. Both the mysql front and and this delightful inverted search
solution. Your creativity makes me wonder what other delights your query
solutions might expose!!

sent from my mobile
Daemeon C.M. Reiydelle
USA 415.501.0198
London +
On Mar 27, 2015 7:08 PM, "Robert Wille" <> wrote:

>  Okay, this is going to be a pretty long post, but I think its an
> interesting data model, and hopefully someone will find it worth going
> through.
>  First, I think it will be easier to understand the modeling choices I
> made if you see the end product. Go to
>|hzUkLqDmI. What you see looks like
> one big tree, but actually is a combination of trees spliced together.
> There is one tree in a relational database that forms what I call the
> top-level browse. The top-level browse is used to navigate through
> categories until you arrive at a publication. When you drill down into a
> publication, you are then viewing data stored in Cassandra. The link
> provided above points to the root of a publication (in this case, maps from
> the Civil War), so to the left is top-level browse coming from MySQL, and
> to the right is the Cassandra browse. Each publication has an independent
> tree in Cassandra, with all trees stored in the same set of tables (I do
> not dynamically create tables for each publication — I personally think
> that’s a bad practice). We currently have 458 publications, and
> collectively they have about half a billion nodes and consume about 400 GB
> (RF=3).
>  My trees are immutable. When there are changes to a publication (e.g.
> adding new documents), it is very difficult to know what changes need to be
> made to the tree to edit it in-place. Also, it would be impossible to
> maintain navigational consistency while a tree is in process of being
> restructured. So, when a publication changes, I create a completely new
> tree. Once the new tree is built, I change a reference to point to the new
> tree. I have a process that nightly pages through the tables and deletes
> records that belong to obsolete trees. This process takes about five hours.
> If it were much longer than that, I would probably run it weekly. My
> database has quite a bit of churn, which is fairly unusual for a
> Cassandra-based application. Most nights build two or three trees,
> generally resulting in a few tens of millions of new records and a slightly
> fewer number of deletions. Size-tiered compaction is a bad choice for
> churn, so I use leveled compaction. Most publications are at most a few
> million nodes, and generally build in less than 20 minutes.
>  Since any modeling exercise requires knowing the queries, I should
> describe that before getting into the model. Here are the features I need
> to support. For browsing the tree, I need to be able to get the children of
> a node (paginated), the siblings of a node (also paginated), and the
> ancestors of a node. The leaves of each tree are images and form a
> filmstrip. You can use the filmstrip to navigate through all the images in
> a publication in the tree’s natural order. If you go to my browse page and
> keep drilling down, you’ll eventually get to an image. The filmstrip
> appears at the bottom of the image viewer.
>  Before I discuss the schema, I should discuss a couple of other
> non-obvious things that are relevant to the data model. One very common
> operation is to retrieve a node and all of its ancestors in order to
> display a path. Denormalization would suggest that I store the data for
> each node, along with that of all of its ancestors. That would mean that in
> my biggest tree, I would store the root node 180 million times. I didn’t
> consider that kind of bloat to be acceptable, so I do not denormalize
> ancestors. I also wanted to retrieve a node and its ancestors in constant
> time, rather than O(n) as would be typical for tree traversal. In order to
> accomplish this, I use a pretty unique idea for a node's primary key. I
> create a hash from information in the node, and then append it to the hash
> of its parent. So, the primary key is really a path. When I need to
> retrieve a node and its ancestors, I tokenize the path and issue queries in
> parallel to get all the nodes in the ancestry at the same time. In keeping
> with this pattern of not denormalizing, my auxiliary tables do not have
> node data in them, but instead provide a means of getting hash paths, which
> I then tokenize and make parallel requests with. Most requests that use an
> auxiliary table can generally just make a query to the auxiliary table to
> get the hash path, and then retrieve the node and its ancestors from the
> node table. Three or fewer trips to Cassandra are sufficient for all my
> API’s.
>  Without further ado, here’s my schema (with commentary):
>  CREATE TABLE tree (
> tree INT,
> pub INT,
> rhpath VARCHAR,
> atime TIMESTAMP,
> ccount INT,
> ncount INT,
> PRIMARY KEY (tree)
> ) WITH gc_grace_seconds = 864000;
>  This table maintains the references to the root nodes for each tree. pub
> is the primary key for the publication table in my relational database.
> There is usually just one record for each publication. When a tree is being
> built (and until the old one is retired), a publication may have more
> than one tree. This table is small (458 records), and I cache it in memory
> and maintain inverted indexes. Not caching this table would result in one
> additional trip to Cassandra for all API’s. Each tree is assigned a random
> tree ID (which is an INT instead of UUID for reasons beyond this
> discussion). All my tables have a tree ID in them so I can know which tree
> each node belongs to, since the hash path is not unique.
>  CREATE TABLE node (
> hpath VARCHAR,
> tree INT,
> node VARCHAR,
> ccount INT,
> PRIMARY KEY (hpath, tree)
> ) WITH gc_grace_seconds = 0 AND compaction = { 'class' :
> 'LeveledCompactionStrategy', 'sstable_size_in_mb' : 160 };
>  This is the main table that holds all the text and ordinal information
> for all my nodes. This table contains the lion’s share of the data in my
> cluster. The node column is a JSON document. ccount is the number of child
> nodes.
>  CREATE TABLE path_by_page (
> page BIGINT,
> tree INT,
> hpath VARCHAR,
> pord INT,
> ord INT,
> PRIMARY KEY (page, tree)
> ) WITH gc_grace_seconds = 0 AND compaction = { 'class' :
> 'LeveledCompactionStrategy', 'sstable_size_in_mb' : 160 };
>  This table allows me to get the hash path for any image. page is the
> primary key of the page from my relational database. pord is the image’s
> ordinal in the publication filmstrip. ord is the page’s ordinal amongst its
> siblings.
>  CREATE TABLE path_by_pub (
> tree INT,
> bucket INT,
> pord INT,
> ord INT,
> hpath VARCHAR,
> page BIGINT,
> PRIMARY KEY ((tree, bucket), pord)
> ) WITH gc_grace_seconds = 0 AND compaction = { 'class' :
> 'LeveledCompactionStrategy', 'sstable_size_in_mb' : 160 };
>  This table allows me to do the filmstrip. pord is what I key off of to
> paginate.
>  CREATE TABLE path_by_parent (
> phpath VARCHAR,
> bucket INT,
> tree INT,
> ord INT,
> hpath VARCHAR,
> PRIMARY KEY ((phpath, bucket, tree), ord)
> ) WITH gc_grace_seconds = 0 AND compaction = { 'class' :
> 'LeveledCompactionStrategy', 'sstable_size_in_mb' : 160 };
>  This table allows me to get the children for a node. ord is the node's
> ordinal within its siblings. It is what I key off of to paginate. The link
> I provided above is to a publication that has a fanout of 1 for the leaf
> node’s parent nodes, so isn’t very interesting (although the content is
> very interesting). Here’s a more interesting node that has bigger fanout:
>|hrRUXqv6Rj6GFxKAo. And finally,
> here’s a node with a fanout of 378622:
>  As long as this post is, it probably wasn’t enough to fully understand
> everything I do with my schema. I have dozens of queries. If anyone would
> like me to dig a little deeper, I’d be happy to. Just email me.
>  Robert
>  On Mar 27, 2015, at 5:35 PM, Ben Bromhead <> wrote:
>  +1 would love to see how you do it
> On 27 March 2015 at 07:18, Jonathan Haddad <> wrote:
>> I'd be interested to see that data model. I think the entire list would
>> benefit!
>> On Thu, Mar 26, 2015 at 8:16 PM Robert Wille <> wrote:
>>> I have a cluster which stores tree structures. I keep several hundred
>>> unrelated trees. The largest has about 180 million nodes, and the smallest
>>> has 1 node. The largest fanout is almost 400K. Depth is arbitrary, but in
>>> practice is probably less than 10. I am able to page through children and
>>> siblings. It works really well.
>>> Doesn’t sound like its exactly like what you’re looking for, but if you
>>> want any pointers on how I went about implementing mine, I’d be happy to
>>> share.
>>> On Mar 26, 2015, at 3:05 PM, List <> wrote:
>>> > Not sure if this is the right place to ask, but we are trying to model
>>> a user-generated tree hierarchy in which they create child objects of a
>>> root node, and can create an arbitrary number of children (and children of
>>> children, and on and on).  So far we have looked at storing each tree
>>> structure as a single document in JSON format and reading/writing it out in
>>> it's entirety, doing materialized paths where we store the root id with
>>> every child and the tree structure above the child as a map, and some form
>>> of an adjacency list (which does not appear to be very viable as looking up
>>> the entire tree would be ridiculous).
>>> >
>>> > The hope is to end up with a data model that allows us to display the
>>> entire tree quickly, as well as see the entire path to a leaf when
>>> selecting that leaf.  If anyone has some suggestions/experience on how to
>>> model such a tree heirarchy we would greatly appreciate your input.
>>> >
>  --
> Ben Bromhead
> Instaclustr | | @instaclustr
> <> | (650) 284 9692

View raw message