On Sun, Dec 27, 2009 at 11:38 AM, Mark Robson wrote: > 2009/12/27 August Zajonc >> >> Looking at the data model a simple solution is two column families, >> one containing items as the row-key with tags as columns, and a second >> with tags as the row-key with items as columns. This gives me fast >> access at the cost of 2x the writes (cheap) and storage (also cheap). >> So not bad. > > I think this is the normal model. > > However, there is no need to put them in separate column-families, you could > simply use non-overlapping keys. Got it. One question I wasn't sure of is if that buys me a way to atomically update the index to maintain consistency. I don't think I can. > > There is however, a scalability problem when you have a single tag with a > very large number of items, or vice versa, that you will have a lot of > columns in a single CF / key. As this needs to be held in the ram of a node > during a query (and possibly other operations), it will blow the memory > usage up. Got it. Part of this depends on the metadata overhead to store a column. Clearly Name, Value, Timestamp is a part of it, but is there anything else in terms of storage / memory overhead per column I should be thinking of when I consider how many column are reasonable to fit in a single CF / Key. Cheers, - August > I guess the solution may be to create a number of different keys for the > same tag. > > In any case, querying a very large number of items is problematic - the user > will not usually want them all, so you'd need to prioritise them somehow > anyway, so it might be sufficient to only store the "highest priority" items > against a single tag key (and have other keys for the lower priority ones). > How you define priority is application-specific. > > Mark > -- August Consulting PO Box 410384 San Francisco, CA 94141 415-358-1850 (p) 415-354-8383 (f)