accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adam Fuchs <>
Subject Re: Accumulo: "BigTable" vs. "Document Model"
Date Fri, 04 Sep 2015 21:07:13 GMT
Sqrrl uses a hybrid approach. For records that are relatively static we use
a compacted form, but for maintaining aggregates and for making updates to
the compacted form documents we use a more explicit form. This is done
mostly through iterators and a fairly complex type system. The big
trade-off for us was storage footprint. We gain something like 30% more
compression by using the compacted form, and that also translates into
better ingest and query performance. I can tell you it takes a significant
engineering investment to make this work without overspecializing, so make
sure your use case warrants it.


On Fri, Sep 4, 2015 at 11:42 AM, Michael Moss <>

> Hello, everyone.
> I'd love to hear folks' input on using the "natural" data model of
> Accumulo ("BigTable" style) vs more of a Document Model. I'll try to
> succinctly describe with a contrived example.
> Let's say I have one domain object I'd like to model, "SensorReadings". A
> single entry might look something like the following with 4 distinct CF, CQ
> pairs.
> RowKey: DeviceID-YYYMMDD-ReadingID (i.e. - 1-20150101-1234)
> CF: "Meta", CQ: "Timestamp", Value: <Some timestamp>
> CF: "Sensor", CQ: "Temperature", Value: 80.4
> CF: "Sensor", CQ: "Humidity", Value: 40.2
> CF: "Sensor", CQ: "Barometer", Value: 29.1
> I might do queries like "get me all SensorReadings for 2015 for DeviceID =
> 1" and if I wanted to operate on each SensorReading as a single unit (and
> not as the 4 'rows' it returns for each one), I'd either have to aggregate
> the 4 CF, CQ pairs for each RowKey client side, or use something like the
> WholeRowIterator.
> In addition, if I wanted to write a query like, "for DeviceID = 1 in 2015,
> return me SensorReadings where Temperature > 90, Humidity < 40, Barometer >
> 31", I'd again have to either use the WholeRowIterator to 'see' each entire
> SensorReading in memory on the server for the compound query, or I could
> take the intersection of the results of 3 parallel, independent queries on
> the client side.
> Where I am going with this is, what are the thoughts around creating a
> Java, Protobuf, Avro (etc) object with these 4 CF, CQ pairs as fields and
> storing each SensorReading as a single 'Document'?
> RowKey: DeviceID-YYYMMDD
> CF: ReadingID Value: Protobuf(Timestamp=123, Temperature=80.4,
> Humidity=40.2, Barometer = 29.1)
> This way you avoid having to use the WholeRowIterator and unless you often
> have queries that only look at a tiny subset of your fields (let's say just
> "Temperature"), the serialization costs seem similar since Value is just
> bytes anyway.
> Appreciate folks' experience and wisdom here. Hope this makes sense, happy
> to clarify.
> Best.
> -Mike

View raw message