accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Medinets <david.medin...@gmail.com>
Subject Re: Accumulo: "BigTable" vs. "Document Model"
Date Fri, 04 Sep 2015 20:16:46 GMT
+1 for Eric's suggestion. I used this technique. It seemed to work nicely.
When storing ProtoBuf, JSON, or any other 'document' remember to factor in
the parsing needed during iteration. This affects both CPU and Memory
requirements on the tservers.

On Fri, Sep 4, 2015 at 11:53 AM, Eric Newton <eric.newton@gmail.com> wrote:

> You could use a server-side iterator that does the filtering on the
> server, and returns a protobuf value for matching rows.
>
> -Eric
>
>
> On Fri, Sep 4, 2015 at 11:42 AM, Michael Moss <michael.moss@gmail.com>
> wrote:
>
>> Hello, everyone.
>>
>> I'd love to hear folks' input on using the "natural" data model of
>> Accumulo ("BigTable" style) vs more of a Document Model. I'll try to
>> succinctly describe with a contrived example.
>>
>> Let's say I have one domain object I'd like to model, "SensorReadings". A
>> single entry might look something like the following with 4 distinct CF, CQ
>> pairs.
>>
>> RowKey: DeviceID-YYYMMDD-ReadingID (i.e. - 1-20150101-1234)
>> CF: "Meta", CQ: "Timestamp", Value: <Some timestamp>
>> CF: "Sensor", CQ: "Temperature", Value: 80.4
>> CF: "Sensor", CQ: "Humidity", Value: 40.2
>> CF: "Sensor", CQ: "Barometer", Value: 29.1
>>
>> I might do queries like "get me all SensorReadings for 2015 for DeviceID
>> = 1" and if I wanted to operate on each SensorReading as a single unit (and
>> not as the 4 'rows' it returns for each one), I'd either have to aggregate
>> the 4 CF, CQ pairs for each RowKey client side, or use something like the
>> WholeRowIterator.
>>
>> In addition, if I wanted to write a query like, "for DeviceID = 1 in
>> 2015, return me SensorReadings where Temperature > 90, Humidity < 40,
>> Barometer > 31", I'd again have to either use the WholeRowIterator to 'see'
>> each entire SensorReading in memory on the server for the compound query,
>> or I could take the intersection of the results of 3 parallel, independent
>> queries on the client side.
>>
>> Where I am going with this is, what are the thoughts around creating a
>> Java, Protobuf, Avro (etc) object with these 4 CF, CQ pairs as fields and
>> storing each SensorReading as a single 'Document'?
>>
>> RowKey: DeviceID-YYYMMDD
>> CF: ReadingID Value: Protobuf(Timestamp=123, Temperature=80.4,
>> Humidity=40.2, Barometer = 29.1)
>>
>> This way you avoid having to use the WholeRowIterator and unless you
>> often have queries that only look at a tiny subset of your fields (let's
>> say just "Temperature"), the serialization costs seem similar since Value
>> is just bytes anyway.
>>
>> Appreciate folks' experience and wisdom here. Hope this makes sense,
>> happy to clarify.
>>
>> Best.
>>
>> -Mike
>>
>>
>>
>>
>>
>

Mime
View raw message