accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Cardon, Tejay E" <>
Subject RE: EXTERNAL: Re: Table design
Date Wed, 21 Mar 2012 18:35:01 GMT
Thanks Eric.  Just to make sure I understood correctly:
If I have many (say 5+) locality groups, that would be bad for performance, but if I have
2 locality groups with 10+ column families each, that would not be a major issue?


From: Eric Newton []
Sent: Wednesday, March 21, 2012 12:16 PM
Subject: EXTERNAL: Re: Table design

In accumulo, there are no limits on the number/size of column families.  However, if you do
want to group them into separate locality groups, you need to list the column families for
the group.  This has to be storable in zookeeper, so groups should be limited to "dozens"
of column families.  Reading from different groups at the same time will use more resources,
so, like HBase, you should limit the number of groups you have.

The RFile format takes advantage of the similarity of data between keys, and does not repeat
elements of the key that are identical from key to key.  If everything has the same visibility,
it will only be listed once.

And, when I say there is "no limit"... there is no predefined limit, but rows, cf, cq, visibilities
and values all need to comfortably fit in the physical RAM available, perhaps multiple times,
as they are serialized and deserialized in the various services.

As for table design... it depends a great deal on what you want to do.

Here is a short description of a complex indexing scheme that makes it efficient to do distributed
conjunctive queries on documents:

It makes it possible to do fast searches for queries like `TITLE matches "f.*bar" and contains
the words "catch" and "22" '.


On Wed, Mar 21, 2012 at 12:43 PM, Cardon, Tejay E <<>>
Thank you ahead of time for the input.

When designing tables in HBase, one is encouraged to use single letter names for column families,
and to only have 2 or 3 families.  The documentation states that this has to do with the underlying
way that the data is stored on disk.  I'm curious if similar considerations need to be made
with Accumulo.

Furthermore, and more specific to Accumulo, what considerations should be made for visibility
labels?  If the visibility string for each cell is stored on disk along with the data in the
cell, I could see where both long roll names and large combinations of rolls could have a
major impact on disk utilization.

Finally, can anyone recommend a good resource for Accumulo table design (or for key/value
store design in general)?

Tejay Cardon


View raw message