Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 3ADD4109A5 for ; Sat, 28 Mar 2015 18:03:31 +0000 (UTC) Received: (qmail 13461 invoked by uid 500); 28 Mar 2015 18:03:27 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 13415 invoked by uid 500); 28 Mar 2015 18:03:27 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 13405 invoked by uid 99); 28 Mar 2015 18:03:27 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 28 Mar 2015 18:03:27 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of daemeonr@gmail.com designates 209.85.213.178 as permitted sender) Received: from [209.85.213.178] (HELO mail-ig0-f178.google.com) (209.85.213.178) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 28 Mar 2015 18:03:01 +0000 Received: by igbqf9 with SMTP id qf9so43500445igb.1 for ; Sat, 28 Mar 2015 11:03:00 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=WbveQea9EEuakQnvLZo3rliPtk5JIaL0bxGjR/2c3Nw=; b=XGTORJYdhfImzZKd4T5Th3od5tL38QCtp437PXmdkoMTHtUTfYA0GQnwC+rRjrqOms 651viRQvZTzElON2lTRweGVB8z88RCJsbJ7hBMQ4xicL1/77jJSjehMuvUAZrCqd6ehv 5tGQFYInXEKWZpCaJ9QCox8+1A5lqELGDwWFt42VxaIu+8P/s4tTztp1Q8XwO1xjo4ux RtWsIVgOXSd79UFBJy49dd1OZrS3boW6Rl8GPTibrhtTp3q71rkrqcIh2xSm/YdE2n+h jkBQus2bX3IfYOObbvxVNuyG5XMbqeCAF2rWXUKJX9dAhajOIb97dFf7J6CJX4mQvfso 3FLQ== MIME-Version: 1.0 X-Received: by 10.42.93.83 with SMTP id w19mr51831536icm.37.1427565779951; Sat, 28 Mar 2015 11:02:59 -0700 (PDT) Received: by 10.50.216.229 with HTTP; Sat, 28 Mar 2015 11:02:59 -0700 (PDT) Received: by 10.50.216.229 with HTTP; Sat, 28 Mar 2015 11:02:59 -0700 (PDT) In-Reply-To: <65CBA829-0EF0-4AED-9BF5-610F2F3AB537@fold3.com> References: <55147490.4090904@airstreamcomm.net> <9323B569-2884-4C7B-B110-18948457D1A9@fold3.com> <65CBA829-0EF0-4AED-9BF5-610F2F3AB537@fold3.com> Date: Sat, 28 Mar 2015 11:02:59 -0700 Message-ID: Subject: Re: Arbitrary nested tree hierarchy data model From: daemeon reiydelle To: user@cassandra.apache.org Content-Type: multipart/alternative; boundary=90e6ba614aa23526a605125d0f36 X-Virus-Checked: Checked by ClamAV on apache.org --90e6ba614aa23526a605125d0f36 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Fascinating. Both the mysql front and and this delightful inverted search solution. Your creativity makes me wonder what other delights your query solutions might expose!! sent from my mobile Daemeon C.M. Reiydelle USA 415.501.0198 London +44.0.20.8144.9872 On Mar 27, 2015 7:08 PM, "Robert Wille" wrote: > Okay, this is going to be a pretty long post, but I think its an > interesting data model, and hopefully someone will find it worth going > through. > > First, I think it will be easier to understand the modeling choices I > made if you see the end product. Go to > http://www.fold3.com/browse.php#249|hzUkLqDmI. What you see looks like > one big tree, but actually is a combination of trees spliced together. > There is one tree in a relational database that forms what I call the > top-level browse. The top-level browse is used to navigate through > categories until you arrive at a publication. When you drill down into a > publication, you are then viewing data stored in Cassandra. The link > provided above points to the root of a publication (in this case, maps fr= om > the Civil War), so to the left is top-level browse coming from MySQL, and > to the right is the Cassandra browse. Each publication has an independent > tree in Cassandra, with all trees stored in the same set of tables (I do > not dynamically create tables for each publication =E2=80=94 I personally= think > that=E2=80=99s a bad practice). We currently have 458 publications, and > collectively they have about half a billion nodes and consume about 400 G= B > (RF=3D3). > > My trees are immutable. When there are changes to a publication (e.g. > adding new documents), it is very difficult to know what changes need to = be > made to the tree to edit it in-place. Also, it would be impossible to > maintain navigational consistency while a tree is in process of being > restructured. So, when a publication changes, I create a completely new > tree. Once the new tree is built, I change a reference to point to the ne= w > tree. I have a process that nightly pages through the tables and deletes > records that belong to obsolete trees. This process takes about five hour= s. > If it were much longer than that, I would probably run it weekly. My > database has quite a bit of churn, which is fairly unusual for a > Cassandra-based application. Most nights build two or three trees, > generally resulting in a few tens of millions of new records and a slight= ly > fewer number of deletions. Size-tiered compaction is a bad choice for > churn, so I use leveled compaction. Most publications are at most a few > million nodes, and generally build in less than 20 minutes. > > Since any modeling exercise requires knowing the queries, I should > describe that before getting into the model. Here are the features I need > to support. For browsing the tree, I need to be able to get the children = of > a node (paginated), the siblings of a node (also paginated), and the > ancestors of a node. The leaves of each tree are images and form a > filmstrip. You can use the filmstrip to navigate through all the images i= n > a publication in the tree=E2=80=99s natural order. If you go to my browse= page and > keep drilling down, you=E2=80=99ll eventually get to an image. The filmst= rip > appears at the bottom of the image viewer. > > Before I discuss the schema, I should discuss a couple of other > non-obvious things that are relevant to the data model. One very common > operation is to retrieve a node and all of its ancestors in order to > display a path. Denormalization would suggest that I store the data for > each node, along with that of all of its ancestors. That would mean that = in > my biggest tree, I would store the root node 180 million times. I didn=E2= =80=99t > consider that kind of bloat to be acceptable, so I do not denormalize > ancestors. I also wanted to retrieve a node and its ancestors in constant > time, rather than O(n) as would be typical for tree traversal. In order t= o > accomplish this, I use a pretty unique idea for a node's primary key. I > create a hash from information in the node, and then append it to the has= h > of its parent. So, the primary key is really a path. When I need to > retrieve a node and its ancestors, I tokenize the path and issue queries = in > parallel to get all the nodes in the ancestry at the same time. In keepin= g > with this pattern of not denormalizing, my auxiliary tables do not have > node data in them, but instead provide a means of getting hash paths, whi= ch > I then tokenize and make parallel requests with. Most requests that use a= n > auxiliary table can generally just make a query to the auxiliary table to > get the hash path, and then retrieve the node and its ancestors from the > node table. Three or fewer trips to Cassandra are sufficient for all my > API=E2=80=99s. > > Without further ado, here=E2=80=99s my schema (with commentary): > > CREATE TABLE tree ( > tree INT, > pub INT, > rhpath VARCHAR, > atime TIMESTAMP, > ccount INT, > ncount INT, > PRIMARY KEY (tree) > ) WITH gc_grace_seconds =3D 864000; > > This table maintains the references to the root nodes for each tree. pub > is the primary key for the publication table in my relational database. > There is usually just one record for each publication. When a tree is bei= ng > built (and until the old one is retired), a publication may have more > than one tree. This table is small (458 records), and I cache it in memor= y > and maintain inverted indexes. Not caching this table would result in one > additional trip to Cassandra for all API=E2=80=99s. Each tree is assigned= a random > tree ID (which is an INT instead of UUID for reasons beyond this > discussion). All my tables have a tree ID in them so I can know which tre= e > each node belongs to, since the hash path is not unique. > > CREATE TABLE node ( > hpath VARCHAR, > tree INT, > node VARCHAR, > ccount INT, > PRIMARY KEY (hpath, tree) > ) WITH gc_grace_seconds =3D 0 AND compaction =3D { 'class' : > 'LeveledCompactionStrategy', 'sstable_size_in_mb' : 160 }; > > This is the main table that holds all the text and ordinal information > for all my nodes. This table contains the lion=E2=80=99s share of the dat= a in my > cluster. The node column is a JSON document. ccount is the number of chil= d > nodes. > > CREATE TABLE path_by_page ( > page BIGINT, > tree INT, > hpath VARCHAR, > pord INT, > ord INT, > PRIMARY KEY (page, tree) > ) WITH gc_grace_seconds =3D 0 AND compaction =3D { 'class' : > 'LeveledCompactionStrategy', 'sstable_size_in_mb' : 160 }; > > This table allows me to get the hash path for any image. page is the > primary key of the page from my relational database. pord is the image=E2= =80=99s > ordinal in the publication filmstrip. ord is the page=E2=80=99s ordinal a= mongst its > siblings. > > CREATE TABLE path_by_pub ( > tree INT, > bucket INT, > pord INT, > ord INT, > hpath VARCHAR, > page BIGINT, > PRIMARY KEY ((tree, bucket), pord) > ) WITH gc_grace_seconds =3D 0 AND compaction =3D { 'class' : > 'LeveledCompactionStrategy', 'sstable_size_in_mb' : 160 }; > > This table allows me to do the filmstrip. pord is what I key off of to > paginate. > > CREATE TABLE path_by_parent ( > phpath VARCHAR, > bucket INT, > tree INT, > ord INT, > hpath VARCHAR, > PRIMARY KEY ((phpath, bucket, tree), ord) > ) WITH gc_grace_seconds =3D 0 AND compaction =3D { 'class' : > 'LeveledCompactionStrategy', 'sstable_size_in_mb' : 160 }; > > This table allows me to get the children for a node. ord is the node's > ordinal within its siblings. It is what I key off of to paginate. The lin= k > I provided above is to a publication that has a fanout of 1 for the leaf > node=E2=80=99s parent nodes, so isn=E2=80=99t very interesting (although = the content is > very interesting). Here=E2=80=99s a more interesting node that has bigger= fanout: > http://www.fold3.com/browse.php#246|hrRUXqv6Rj6GFxKAo. And finally, > here=E2=80=99s a node with a fanout of 378622: > http://www.fold3.com/browse.php#1|hhqJwp03TQBwCAyoD. > > As long as this post is, it probably wasn=E2=80=99t enough to fully unde= rstand > everything I do with my schema. I have dozens of queries. If anyone would > like me to dig a little deeper, I=E2=80=99d be happy to. Just email me. > > Robert > > On Mar 27, 2015, at 5:35 PM, Ben Bromhead wrote: > > +1 would love to see how you do it > > On 27 March 2015 at 07:18, Jonathan Haddad wrote: > >> I'd be interested to see that data model. I think the entire list would >> benefit! >> >> On Thu, Mar 26, 2015 at 8:16 PM Robert Wille wrote: >> >>> I have a cluster which stores tree structures. I keep several hundred >>> unrelated trees. The largest has about 180 million nodes, and the small= est >>> has 1 node. The largest fanout is almost 400K. Depth is arbitrary, but = in >>> practice is probably less than 10. I am able to page through children a= nd >>> siblings. It works really well. >>> >>> Doesn=E2=80=99t sound like its exactly like what you=E2=80=99re looking= for, but if you >>> want any pointers on how I went about implementing mine, I=E2=80=99d be= happy to >>> share. >>> >>> On Mar 26, 2015, at 3:05 PM, List wrote: >>> >>> > Not sure if this is the right place to ask, but we are trying to mode= l >>> a user-generated tree hierarchy in which they create child objects of a >>> root node, and can create an arbitrary number of children (and children= of >>> children, and on and on). So far we have looked at storing each tree >>> structure as a single document in JSON format and reading/writing it ou= t in >>> it's entirety, doing materialized paths where we store the root id with >>> every child and the tree structure above the child as a map, and some f= orm >>> of an adjacency list (which does not appear to be very viable as lookin= g up >>> the entire tree would be ridiculous). >>> > >>> > The hope is to end up with a data model that allows us to display the >>> entire tree quickly, as well as see the entire path to a leaf when >>> selecting that leaf. If anyone has some suggestions/experience on how = to >>> model such a tree heirarchy we would greatly appreciate your input. >>> > >>> >>> > > > -- > > Ben Bromhead > > Instaclustr | www.instaclustr.com | @instaclustr > | (650) 284 9692 > > > --90e6ba614aa23526a605125d0f36 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable

Fascinating. Both the mysql front and and this delightful in= verted search solution. Your creativity makes me wonder what other delights= your query solutions might expose!!

sent from my mobile
Daemeon C.M. Reiydelle
USA 415.501.0198
London +44.0.20.8144.9872

On Mar 27, 2015 7:08 PM, "Robert Wille"= ; <rwille@fold3.com> wrote:
Okay, this is going to be a pretty long post, but I think its an interestin= g data model, and hopefully someone will find it worth going through.

First, I think it will be easier to understand the modeling choices I = made if you see the end product. Go to=C2=A0http://www.fold3.com/browse.= php#249|hzUkLqDmI. What you see looks like one big tree, but actually is a combination of trees spliced together. There is one tree= in a relational database that forms what I call the top-level browse. The = top-level browse is used to navigate through categories until you arrive at= a publication. When you drill down into a publication, you are then viewing data stored in Cassandra. The lin= k provided above points to the root of a publication (in this case, maps fr= om the Civil War), so to the left is top-level browse coming from MySQL, an= d to the right is the Cassandra browse. Each publication has an independent tree in Cassandra, with all tr= ees stored in the same set of tables (I do not dynamically create tables fo= r each publication =E2=80=94 I personally think that=E2=80=99s a bad practi= ce). We currently have 458 publications, and collectively they have about half a billion nodes and consume about 400 GB (RF=3D3).

My trees are immutable. When there are changes to a publication (e.g. = adding new documents), it is very difficult to know what changes need to be= made to the tree to edit it in-place. Also, it would be impossible to main= tain navigational consistency while a tree is in process of being restructured. So, when a publication changes= , I create a completely new tree. Once the new tree is built, I change a re= ference to point to the new tree. I have a process that nightly pages throu= gh the tables and deletes records that belong to obsolete trees. This process takes about five hours. If it = were much longer than that, I would probably run it weekly. My database has= quite a bit of churn, which is fairly unusual for a Cassandra-based applic= ation. Most nights build two or three trees, generally resulting in a few tens of millions of new records = and a slightly fewer number of deletions. Size-tiered compaction is a bad c= hoice for churn, so I use leveled compaction. Most publications are at most= a few million nodes, and generally build in less than 20 minutes.

Since any modeling exercise requires knowing the queries, I should des= cribe that before getting into the model. Here are the features I need to s= upport. For browsing the tree, I need to be able to get the children of a n= ode (paginated), the siblings of a node (also paginated), and the ancestors of a node. The leaves of each t= ree are images and form a filmstrip. You can use the filmstrip to navigate = through all the images in a publication in the tree=E2=80=99s natural order= . If you go to my browse page and keep drilling down, you=E2=80=99ll eventually get to an image. The filmstrip appears at = the bottom of the image viewer.

Before I discuss the schema, I should discuss a couple of other non-ob= vious things that are relevant to the data model. One very common operation= is to retrieve a node and all of its ancestors in order to display a path.= Denormalization would suggest that I store the data for each node, along with that of all of its ancestors. T= hat would mean that in my biggest tree, I would store the root node 180 mil= lion times. I didn=E2=80=99t consider that kind of bloat to be acceptable, = so I do not denormalize ancestors. I also wanted to retrieve a node and its ancestors in constant time, rather than = O(n) as would be typical for tree traversal. In order to accomplish this, I= use a pretty unique idea for a node's primary key. I create a hash fro= m information in the node, and then append it to the hash of its parent. So, the primary key is really a path.= When I need to retrieve a node and its ancestors, I tokenize the path and = issue queries in parallel to get all the nodes in the ancestry at the same = time. In keeping with this pattern of not denormalizing, my auxiliary tables do not have node data in them, b= ut instead provide a means of getting hash paths, which I then tokenize and= make parallel requests with. Most requests that use an auxiliary table can= generally just make a query to the auxiliary table to get the hash path, and then retrieve the node and i= ts ancestors from the node table. Three or fewer trips to Cassandra are suf= ficient for all my API=E2=80=99s.

Without further ado, here=E2=80=99s my schema (with commentary):

CREATE TABLE tr= ee (
tree INT,=C2=A0
pub INT,
rhpa= th VARCHAR,=C2=A0
atim= e TIMESTAMP,
ccou= nt INT,
ncou= nt INT,
PRIMARY KEY (tree)
) WITH gc_grace= _seconds =3D 864000;

This table maintains the referenc= es to the root nodes for each tree. pub is the primary key for the publicat= ion table in my relational database. There is usually just one record for e= ach publication. When a tree is being built (and until the old one is retired), a publication may have mor= e than=C2=A0one tree. This table is small (458 records), and I cache it in = memory and maintain inverted indexes. Not caching this table would result i= n one additional trip to Cassandra for all API=E2=80=99s. Each tree is assigned a random tree ID (which is an INT= instead of UUID for reasons beyond this discussion). All my tables have a = tree ID in them so I can know which tree each node belongs to, since the ha= sh path is not unique.

CREATE TABLE no= de (
hpat= h VARCHAR,=C2=A0
tree INT,
node VARCHAR,
ccou= nt INT,
PRIMARY KEY (hpath, tree)
) WITH gc_grace= _seconds =3D 0 AND compaction =3D { 'class' : 'LeveledCompactio= nStrategy', 'sstable_size_in_mb' : 160 };

This is the main table that holds= all the text and ordinal information for all my nodes. This table contains= the lion=E2=80=99s share of the data in my cluster. The node column is a J= SON document. ccount is the number of child nodes.

CREATE TABLE pa= th_by_page (
page BIGINT,=C2=A0
tree INT,
hpat= h VARCHAR,
pord= INT,
ord<= /span> INT,
PRIMARY KEY (page, tree)
) WITH gc_grace= _seconds =3D 0 AND compaction =3D { 'class' : 'LeveledCompactio= nStrategy', 'sstable_size_in_mb' : 160 };

This table allows me to get the h= ash path for any image. page is the primary key of the page from my relatio= nal database. pord is the image=E2=80=99s ordinal in the publication filmst= rip. ord is the page=E2=80=99s ordinal amongst its siblings.

CREATE TABLE pa= th_by_pub (
tree INT,
bucket INT,
pord= INT,
ord<= /span> INT,
hpat= h VARCHAR,
page BIGINT,
PRIMARY KEY ((tree, bucket), pord)
) WITH gc_grace= _seconds =3D 0 AND compaction =3D { 'class' : 'LeveledCompactio= nStrategy', 'sstable_size_in_mb' : 160 };

This table allows me to do the fi= lmstrip. pord is what I key off of to paginate.

CREATE TABLE pa= th_by_parent (
phpa= th VARCHAR,
bucket INT,
tree INT,
ord<= /span> INT,
hpat= h VARCHAR,
PRIMARY KEY ((phpath, bucket, tree), ord)
) WITH gc_grace= _seconds =3D 0 AND compaction =3D { 'class' : 'LeveledCompactio= nStrategy', 'sstable_size_in_mb' : 160 };

This table allows me to get the c= hildren for a node.=C2=A0ord is the node's ordinal within its siblings.= It is what I key off of to paginate. The link I provided above is to a pub= lication that has a fanout of 1 for the leaf node=E2=80=99s parent nodes, so isn=E2=80=99t very interesting (altho= ugh the content is very interesting). Here=E2=80=99s a more interesting nod= e that has bigger fanout:=C2=A0http://www.fold3.com/browse.php#2= 46|hrRUXqv6Rj6GFxKAo. And finally, here=E2=80=99s a node with a fanout of 378622:=C2=A0http://www.fold3.com/browse.php#1|hhqJwp03TQBwCAyoD.

As long as this post is, it proba= bly wasn=E2=80=99t enough to fully understand everything I do with my schem= a. I have dozens of queries. If anyone would like me to dig a little deeper= , I=E2=80=99d be happy to. Just email me.

Robert=C2=A0

On Mar 27, 2015, at 5:35 PM, Ben Bromhead <ben@instaclustr.com> wrote:

+1 would love to see how you do it

On 27 March 2015 at 07:18, Jonathan Haddad <jon@jonhaddad.co= m> wrote:
I'd be interested to see that data model. I think the entire list would= benefit!

On Thu, Mar 26, 2015 at 8:16 PM Robert Wille <= ;rwille@fold3.com= > wrote:
I have a cluster which stores tree structures. I keep several hundred unrel= ated trees. The largest has about 180 million nodes, and the smallest has 1= node. The largest fanout is almost 400K. Depth is arbitrary, but in practi= ce is probably less than 10. I am able to page through children and siblings. It works really well.

Doesn=E2=80=99t sound like its exactly like what you=E2=80=99re looking for= , but if you want any pointers on how I went about implementing mine, I=E2= =80=99d be happy to share.

On Mar 26, 2015, at 3:05 PM, List <list@airstreamcomm.net> wrote:

> Not sure if this is the right place to ask, but we are trying to model= a user-generated tree hierarchy in which they create child objects of a ro= ot node, and can create an arbitrary number of children (and children of ch= ildren, and on and on).=C2=A0 So far we have looked at storing each tree structure as a single document in JSON fo= rmat and reading/writing it out in it's entirety, doing materialized pa= ths where we store the root id with every child and the tree structure abov= e the child as a map, and some form of an adjacency list (which does not appear to be very viable as looking u= p the entire tree would be ridiculous).
>
> The hope is to end up with a data model that allows us to display the = entire tree quickly, as well as see the entire path to a leaf when selectin= g that leaf.=C2=A0 If anyone has some suggestions/experience on how to mode= l such a tree heirarchy we would greatly appreciate your input.
>




--

Ben Bromhead

Instaclustr |=C2=A0www.instaclustr.com=C2=A0|=C2=A0@instaclustr=C2= =A0| (650) 284 9692


--90e6ba614aa23526a605125d0f36--