incubator-couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wendall Cada <wenda...@83864.com>
Subject Re: Tree like structures in CouchDB
Date Fri, 22 Feb 2013 21:22:45 GMT
One more thing. I think it's a very good idea to review View Collation ( 
http://wiki.apache.org/couchdb/View_collation ) to better understand how 
view indexes are sorted. When starting to use CouchDB, it's not obvious 
that you can use arrays or objects as keys. Understanding how to 
leverage arrays for creating view indexes will change how you use 
CouchDB (for the better).

Wendall

On 02/22/2013 01:13 PM, Wendall Cada wrote:
> I wanted to give my feedback about what I've learned in this area.
>
> First, I don't use the doc _id at all for sorting docs. It solves one 
> single use-case, but fails if you have others, so instead, I do this:
>
> Every doc, whether the parent or child has identifying information. So 
> a child might contain the parent id, thread id, etc. Parent doesn't 
> need to know about it's children so it doesn't matter, as those can be 
> pulled in a single view query.
>
> Say I want to do something as originally stated, I'd create a view 
> where I emit([parent_id, next_level_id, next_level_id], null) with 
> default values for the latter nested levels being 0 by default. When I 
> query the view, I get back a result set that would look like the 
> following.
> [
> {"id":"0f1e244b14452a884f3dfa5b4086f793","key":[1, 0, 
> 0],"value":null}, <- parent
> {"id":"27f4c6bb9bcaad331e68f80629bffa6e","key":[1, 1, 
> 0],"value":null}, <- first level
> {"id":"46c17a23254c2dcce0860b4c398e0009","key":[1, 1, 
> 1],"value":null}, <- first item in first level
> {"id":"95903e4c2e2cbb5e2dfbc934adf6095f","key":[1, 1, 2],"value":null} 
> <- second item in first level
> ]
>
> To pull the entire thread based on the parent query is simply 
> startkey=[1,0,0]&endkey=[1,{}]
>
> The advantage of this approach is simply that say I want to display a 
> list of all posts by user for a specific thread, I can create a view 
> where I emit([parent_id, user_id, comment_id], null)
>
> This gives the ability to pull a specific comment for a user based on 
> user_id and thread_id, or an entire list of comments based on user_id. 
> These sorts of indexes are very cheap and flexible. You never have to 
> mess with creating your own custom id system. Of course, the tradeoff 
> is that you have to do your own conflict resolution for async 
> operations with thread ids if you want them to increment. Better 
> solution here is to use both timestamp and user_id for the actual 
> comment to ensure it is unique and still sorts well.
>
> Wendall
>
>
> On 02/22/2013 11:29 AM, Алекс Zatvornitskiy wrote:
>> @Jim Klo suggested a nice solution. A document centric database implies
>> embedding what you need inside a document. However it still depends from
>> your actual environment/scope/specific
>>
>>
>> On Fri, Feb 22, 2013 at 8:04 PM, Jim Klo <jim.klo@sri.com> wrote:
>>
>>> interesting discussion!
>>>
>>> jim, yes, in my case nesting won't be very deep so i've been thinking
>>> about your suggestion:
>>>
>>> 1 : orig message
>>> 1-2 : reply to 1
>>> 1-2-5 : reply to 2
>>> 1-3 : reply to 1
>>> 1-3-4 : reply to
>>>
>>> you say "1-3-4" is an ID, right?
>>>
>>>
>>> Yes
>>>
>>> so tell me, if the message with the ID "1-3-4" wants to find out who 
>>> his
>>> parent is, then we'd have to parse the ID, remove one digit and 
>>> query the
>>> database for "1-3" to get the parent. do you think this is optimal 
>>> for the
>>> couchdb?
>>>
>>>
>>> You could do it this way; I've only outlined a strategy for ID's that
>>> sorts in a hierarchal fashion, not a way to determine ancestry. I'd
>>> probably suggest just storing the parent id in the child - depends upon
>>> your actual use case.
>>>
>>> Another alternative, but would require the use of a view, is to 
>>> store the
>>> parent ancestry as an array in each document instead of using a 
>>> composite
>>> id ( I believe others have suggested this too):
>>>
>>> { id:1, ancestry: []}
>>> { id:2, ancestry: [1]}
>>> { id:5, ancestry: [1, 2]}
>>> { id:3, ancestry: [1]}
>>> { id:4, ancestry: [1, 3]}
>>> { id:6, ancestry: [1, 2]}
>>> { id:7, ancestry: [1, 3]}
>>>
>>> then a map function:
>>>
>>> function(doc) {
>>> emit(doc.ancestry.concat([doc.id]), null);
>>> }
>>>
>>> and just use the built-in _count reduce if you need to count the 
>>> number of
>>> nodes in a tree (or sub tree)
>>>
>>> This will create a view that sorts in the same way as the id, but 
>>> using an
>>> array instead of a string. This view will look something like, 
>>> noting that
>>> the key is what's sorted:
>>>
>>> {
>>> rows: [
>>> { id: 1, key: [ 1 ], value= null},
>>> { id: 2, key: [ 1, 2 ], value= null},
>>> { id: 5, key: [ 1, 2, 5 ], value= null},
>>> { id: 6, key: [ 1, 2, 6 ], value= null},
>>> { id: 3, key: [ 1, 3 ], value= null},
>>> { id: 4, key: [ 1, 3, 4 ], value= null},
>>> { id: 7, key: [ 1, 3, 7 ], value= null},
>>> ]
>>> }
>>>
>>> you can then query an entire tree with startkey=[1]&endkey=[1,{}] which
>>> will give you all the descendants for the one tree with one query.
>>>   fetching a sub tree works similarly: startkey=[1,2]&endkey=[1,2,{}]
>>>
>>> and you will have equivalent functionality to using a composite id with
>>> possibly a simpler to comprehend mechanism for building queries and
>>>   creating ids at the expense of creating a view...
>>>
>>> you'll probably also want a view that just lists root messages too… 
>>> that
>>> would be:
>>>
>>> function(doc) {
>>> if (doc.ancestry.length == 0)
>>> emit(doc.id, null);
>>> }
>>>
>>> - Jim
>>>
>>>
>>> *
>>> Jim Klo
>>> Senior Software Engineer
>>> Center for Software Engineering
>>> SRI International
>>> *
>>> *
>>> t. @nsomnac
>>> *
>>>
>>>
>>> On 22/02/13 20:57, Jim Klo wrote:
>>>
>>> On Feb 21, 2013, at 11:31 PM, "Elisiano Petrini" <elisiano@gmail.com>
>>> wrote:
>>>
>>> I understand the concept, but on the same level there could be more 
>>> than
>>> one child. Using this as an ID would work only if you put something
>>> unique to the node in composing the ID.
>>>
>>> I don't recall multiple nodes being a constraint in the problem, 
>>> even then
>>> it's solvable using sharded ID's - the reality is there's always finite
>>> number of nodes. The only thing that changes is that tree won't sort
>>> naturally in the order they arrive. Remember, the parent node has to 
>>> exist
>>> everywhere for a collision to occur.
>>>
>>> Generating a sequential ID (per tree level) depending on the order of
>>> arrival might lead to collisions (if messages arrive at the same time).
>>>
>>> Sure, but only if you have a badly designed sequence generators that 
>>> are
>>> not designed to operate in a sharded manner. A simple way is to
>>> pre-allocate a pool of ID's to a process (say 500 ID's per process). No
>>> other process should be able to use another process' allocation. Once
>>> exhausted, the process fetches another pool of IDs. This may not be 
>>> ideal
>>> in a mobile scenario, but would probably still work, you just might 
>>> have to
>>> devise a smarter allocation algorithm based upon usage and connectivity
>>> frequencies. (And now I've really digressed) This is a very common 
>>> solution
>>> done in the RDBMS world to combat ID conflicts.
>>>
>>>
>>> Anyhow, I feel we're digressing here :)
>>>
>>> It is worth discussion as not everyone is experienced in massively
>>> parallel systems. :)
>>>
>>> If the originator has only a single node - they shouldn't have any
>>> conflict issues with the simplest solution.
>>>
>>>
>>>
>>> On Fri, 2013-02-22 at 16:23 +0900, svilen wrote:
>>>
>>> that's the path to root stored as id
>>> so answers to 1-2 will be say 1-2-5 1-2-6 1-2-33
>>>
>>> Hi Kim,
>>>     I don't think using this kind of documents' ids is a viable
>>> solution: what if there is more that one message in response to
>>> another (ie:several people answering to the same message)?
>>>
>>> On Fri, 2013-02-22 at 07:00 +0000, Jim Klo wrote:
>>>
>>> In theory you mention infinite depth, but is that realistic?
>>>
>>> A simple way is to make ID's a composite serial and chain them
>>> together.
>>>
>>> 1 : orig message
>>> 1-2 : reply to 1
>>> 1-2-5 : reply to 2
>>> 1-3 : reply to 1
>>> 1-3-4 : reply to
>>>
>>> The keys will then sort in order, such that can build the tree DFS
>>> and use range queries to get just segments.
>>>
>>> You can obviously do some creative things in compression - using
>>> different number bases to handle deep trees.
>>>
>>>
>>>
>>> On Feb 21, 2013, at 10:10 PM, "Michael Heuberger"
>>> <michael.heuberger@binarykitchen.com> wrote:
>>>
>>> andrey, svilen, many thanks again.
>>>
>>> agreed, storing immediate children in a document is a bad idea. i
>>> think i go for storing only an immediate parent id to get an
>>> infinite tree depth.
>>>
>>> thanks for your inspirations guys
>>>
>>> cheers
>>> michael
>>>
>>> On 22/02/13 18:49, Andrey Kuprianov wrote:
>>>
>>> PS - storing only an immediate parent id would be the solution
>>> for an infinite tree depth.
>>>
>>>
>>> On Fri, Feb 22, 2013 at 1:48 PM, Andrey Kuprianov <
>>> andrey.kouprianov@gmail.com> wrote:
>>>
>>> Storing immediate children in a parent is a bad idea in a sense
>>> that you will never be able to traverse your tree back to the
>>> root, given a child node.
>>>
>>> Storing all in one doc - you will run out of space before you
>>> know it. Documents have a finite length and the larger the
>>> document, the slower it will get to serialize and unserialize
>>> it.
>>>
>>> Storing just an immediate parent in children might result in a
>>> costly multiple read operations, if you want to get root of the
>>> tree given an arbitrary child.
>>>
>>> Therefore the only feasible solution is to store references to
>>> the parent chain - i.e. ancestors - starting from an immediate
>>> parent and ending with a tree root.
>>>
>>>
>>> Here's a simple calculation for ancestors approach:
>>>
>>> Suppose the maximum document size is 1MB. A uuid is 32
>>> characters in length + taking into account commas and brackets,
>>> then the depth of your tree (I am not taking content into
>>> account) could be roughly:
>>>
>>> ((32 + 1)*n  + 2 - 1) = 1024 * 1024  ==> n = 31775 levels
>>>
>>> However, max document size is configurable from the settings
>>> (as of CouchDB 1.2) and could easily be up to 4GB, giving you
>>> over 130mil of levels. Should be enough... :)
>>>
>>>
>>>
>>> On Fri, Feb 22, 2013 at 1:14 PM, Michael Heuberger <
>>> michael.heuberger@binarykitchen.com> wrote:
>>>
>>> thanks andrey and svil
>>>
>>> depth is infinite. in theory a message can have million
>>> replies forth and back. similar like a very long email thread.
>>>
>>> i think storing ancestors ą la andrey would be too much hassle
>>> but works well for few levels.
>>>
>>> svil, what did you mean with dicts and with keeping all
>>> children in one doc? can you tell me more?
>>>
>>> thanks,
>>> michael
>>>
>>>
>>> On 22/02/13 17:27, svilen wrote:
>>>
>>> my rough guess:
>>>
>>> if it's finite depth, then all in one doc:
>>> - list of (item or (list of ...))
>>> - or same with dicts
>>>
>>> else, one doc per message keeping just link-to-parent,
>>> or multiple links-to-grand...grand-parents and/or root.
>>> similar to the strategies in SQL - nested etc.
>>> keeping all chidlren of same node in one doc is also
>>> possible..
>>>
>>> in any case either traversal or storing or both will be
>>> difficult.
>>>
>>> ciao
>>> svil
>>>
>>> On Fri, 22 Feb 2013 17:13:01 +1300
>>> Michael Heuberger
>>> <michael.heuberger@**binarykitchen.com<michael.heuberger@binarykitchen.com

>>>
>>> wrote:
>>>
>>> Hello guys
>>>
>>> I'd like to store messages in a tree like structure.
>>> Whenever you reply to a message, then the original message
>>> becomes the parent message.
>>>
>>> How would you implement something like this in CouchDB? Just
>>> curious and need a little guidance + advice here.
>>>
>>> Thanks,
>>> Michael
>>>
>>> -- 
>>>
>>> Binary Kitchen
>>> Michael Heuberger
>>> 4c Dunbar Road
>>> Mt Eden
>>> Auckland 1024
>>> (New Zealand)
>>>
>>> Mobile (text only) ...  +64 21 261 89 81
>>> Email ................  michael@binarykitchen.com
>>> Website ..............  http://www.binarykitchen.com
>>>
>>> -- 
>>>
>>> Binary Kitchen
>>> Michael Heuberger
>>> 4c Dunbar Road
>>> Mt Eden
>>> Auckland 1024
>>> (New Zealand)
>>>
>>> Mobile (text only) ...  +64 21 261 89 81
>>> Email ................  michael@binarykitchen.com
>>> Website ..............  http://www.binarykitchen.com
>>>
>>> -- 
>>>
>>> Binary Kitchen
>>> Michael Heuberger
>>> 4c Dunbar Road
>>> Mt Eden
>>> Auckland 1024
>>> (New Zealand)
>>>
>>> Mobile (text only) ...  +64 21 261 89 81
>>> Email ................  michael@binarykitchen.com
>>> Website ..............  http://www.binarykitchen.com
>>>
>>>
>>>
>>> -- 
>>>
>>> Binary Kitchen
>>> Michael Heuberger
>>> 4c Dunbar Road
>>> Mt Eden
>>> Auckland 1024
>>> (New Zealand)
>>>
>>> Mobile (text only) ...  +64 21 261 89 81
>>> Email ................  michael@binarykitchen.com
>>> Website ..............  http://www.binarykitchen.com
>>>
>>>
>>>
>


Mime
View raw message