accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From dlmar...@comcast.net
Subject Re: Accumulo: "BigTable" vs. "Document Model"
Date Fri, 04 Sep 2015 15:57:57 GMT

Both will work, but I think the answer depends on the amount of data that you will be querying
over and your query latency requirements. I would include the wikisearch[1] storage scheme
into your list as well (k/v table + indices). Then, personally, I would rate them in the following
order as database size increases and query latency requirements decrease: 

1. Document Model 
2. K/V Model 
3. K/V Model with indices (wikisearch) 

[1] https://accumulo.apache.org/example/wikisearch.html 

----- Original Message -----

From: "Michael Moss" <michael.moss@gmail.com> 
To: user@accumulo.apache.org 
Sent: Friday, September 4, 2015 11:42:20 AM 
Subject: Accumulo: "BigTable" vs. "Document Model" 

Hello, everyone. 

I'd love to hear folks' input on using the "natural" data model of Accumulo ("BigTable" style)
vs more of a Document Model. I'll try to succinctly describe with a contrived example. 

Let's say I have one domain object I'd like to model, "SensorReadings". A single entry might
look something like the following with 4 distinct CF, CQ pairs. 

RowKey: DeviceID-YYYMMDD-ReadingID (i.e. - 1-20150101-1234) 
CF: "Meta", CQ: "Timestamp", Value: <Some timestamp> 
CF: "Sensor", CQ: "Temperature", Value: 80.4 
CF: "Sensor", CQ: "Humidity", Value: 40.2 
CF: "Sensor", CQ: "Barometer", Value: 29.1 

I might do queries like "get me all SensorReadings for 2015 for DeviceID = 1" and if I wanted
to operate on each SensorReading as a single unit (and not as the 4 'rows' it returns for
each one), I'd either have to aggregate the 4 CF, CQ pairs for each RowKey client side, or
use something like the WholeRowIterator. 

In addition, if I wanted to write a query like, "for DeviceID = 1 in 2015, return me SensorReadings
where Temperature > 90, Humidity < 40, Barometer > 31", I'd again have to either
use the WholeRowIterator to 'see' each entire SensorReading in memory on the server for the
compound query, or I could take the intersection of the results of 3 parallel, independent
queries on the client side. 

Where I am going with this is, what are the thoughts around creating a Java, Protobuf, Avro
(etc) object with these 4 CF, CQ pairs as fields and storing each SensorReading as a single
'Document'? 

RowKey: DeviceID-YYYMMDD 
CF: ReadingID Value: Protobuf(Timestamp=123, Temperature=80.4, Humidity=40.2, Barometer =
29.1) 

This way you avoid having to use the WholeRowIterator and unless you often have queries that
only look at a tiny subset of your fields (let's say just "Temperature"), the serialization
costs seem similar since Value is just bytes anyway. 

Appreciate folks' experience and wisdom here. Hope this makes sense, happy to clarify. 

Best. 

-Mike 






Mime
View raw message