Return-Path: X-Original-To: apmail-accumulo-user-archive@www.apache.org Delivered-To: apmail-accumulo-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id A5EC718ACE for ; Fri, 4 Sep 2015 15:58:16 +0000 (UTC) Received: (qmail 34319 invoked by uid 500); 4 Sep 2015 15:58:16 -0000 Delivered-To: apmail-accumulo-user-archive@accumulo.apache.org Received: (qmail 34267 invoked by uid 500); 4 Sep 2015 15:58:16 -0000 Mailing-List: contact user-help@accumulo.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@accumulo.apache.org Delivered-To: mailing list user@accumulo.apache.org Received: (qmail 34257 invoked by uid 99); 4 Sep 2015 15:58:16 -0000 Received: from Unknown (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 04 Sep 2015 15:58:16 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 002511AB56A for ; Fri, 4 Sep 2015 15:58:16 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 4.403 X-Spam-Level: **** X-Spam-Status: No, score=4.403 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=3, RCVD_IN_MSPIKE_BL=0.01, RCVD_IN_MSPIKE_L3=1.5, RP_MATCHES_RCVD=-0.006, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd2-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=comcast.net Received: from mx1-us-east.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id 5HcXD-IdXJ6n for ; Fri, 4 Sep 2015 15:58:06 +0000 (UTC) Received: from resqmta-ch2-05v.sys.comcast.net (resqmta-ch2-05v.sys.comcast.net [69.252.207.37]) by mx1-us-east.apache.org (ASF Mail Server at mx1-us-east.apache.org) with ESMTPS id 3800C42B1A for ; Fri, 4 Sep 2015 15:58:04 +0000 (UTC) Received: from resomta-ch2-04v.sys.comcast.net ([69.252.207.100]) by resqmta-ch2-05v.sys.comcast.net with comcast id D3xL1r0062AWL2D013xyNR; Fri, 04 Sep 2015 15:57:58 +0000 Received: from resmail-ch2-129v.sys.comcast.net ([162.150.48.163]) by resomta-ch2-04v.sys.comcast.net with comcast id D3xy1r00G3XFKay013xy8f; Fri, 04 Sep 2015 15:57:58 +0000 Date: Fri, 4 Sep 2015 15:57:57 +0000 (UTC) From: dlmarion@comcast.net To: user@accumulo.apache.org Message-ID: <720171860.18313820.1441382277779.JavaMail.zimbra@comcast.net> In-Reply-To: References: Subject: Re: Accumulo: "BigTable" vs. "Document Model" MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_Part_18313819_2146376538.1441382277778" X-Originating-IP: [::ffff:144.51.241.32] X-Mailer: Zimbra 8.0.7_GA_6031 (ZimbraWebClient - FF38 (Linux)/8.0.7_GA_6031) Thread-Topic: Accumulo: "BigTable" vs. "Document Model" Thread-Index: L9ByHcyzlQceN+sc8IbEj+Qqz/0oyg== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=comcast.net; s=q20140121; t=1441382278; bh=U63y3i/T8YtnyxdKwKti2y5tNpih4x6sLGO2OOowL6Q=; h=Received:Received:Date:From:To:Message-ID:Subject:MIME-Version: Content-Type; b=J8juJ1N4ie3nfS9I7nS9mjfo51XYazsr+5N2oN4HtP6CIBjtZemKYOH6KzSjwLxgK Qdfy+gZ8f8oazhNt8Rb9K+E8zyKk1AEExEGAWbcIkuzM/qfaQUQvHk36cIv8CRn4WG R37ZThnJRmHJbqNWsaqcMlpPhNLIB7OUK/zrbt1wbKUQWA2u4vQVFJvxMAoiW4loWg +JqS9Lb7rFv7bTN48vL9JJ+bcXq5If0mieMV8lqdlqClfhuwX1j5nJbKccwbh2OFI7 ClKnHkd2Zn3BFPo8P++Y3dw2Ik6jKXpW9rOkPCLgGlcYUQDEuT6r8mEM39/lrHw9zq /xSsXzJdR/NKg== ------=_Part_18313819_2146376538.1441382277778 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Both will work, but I think the answer depends on the amount of data that you will be querying over and your query latency requirements. I would include the wikisearch[1] storage scheme into your list as well (k/v table + indices). Then, personally, I would rate them in the following order as database size increases and query latency requirements decrease: 1. Document Model 2. K/V Model 3. K/V Model with indices (wikisearch) [1] https://accumulo.apache.org/example/wikisearch.html ----- Original Message ----- From: "Michael Moss" To: user@accumulo.apache.org Sent: Friday, September 4, 2015 11:42:20 AM Subject: Accumulo: "BigTable" vs. "Document Model" Hello, everyone. I'd love to hear folks' input on using the "natural" data model of Accumulo ("BigTable" style) vs more of a Document Model. I'll try to succinctly describe with a contrived example. Let's say I have one domain object I'd like to model, "SensorReadings". A single entry might look something like the following with 4 distinct CF, CQ pairs. RowKey: DeviceID-YYYMMDD-ReadingID (i.e. - 1-20150101-1234) CF: "Meta", CQ: "Timestamp", Value: CF: "Sensor", CQ: "Temperature", Value: 80.4 CF: "Sensor", CQ: "Humidity", Value: 40.2 CF: "Sensor", CQ: "Barometer", Value: 29.1 I might do queries like "get me all SensorReadings for 2015 for DeviceID = 1" and if I wanted to operate on each SensorReading as a single unit (and not as the 4 'rows' it returns for each one), I'd either have to aggregate the 4 CF, CQ pairs for each RowKey client side, or use something like the WholeRowIterator. In addition, if I wanted to write a query like, "for DeviceID = 1 in 2015, return me SensorReadings where Temperature > 90, Humidity < 40, Barometer > 31", I'd again have to either use the WholeRowIterator to 'see' each entire SensorReading in memory on the server for the compound query, or I could take the intersection of the results of 3 parallel, independent queries on the client side. Where I am going with this is, what are the thoughts around creating a Java, Protobuf, Avro (etc) object with these 4 CF, CQ pairs as fields and storing each SensorReading as a single 'Document'? RowKey: DeviceID-YYYMMDD CF: ReadingID Value: Protobuf(Timestamp=123, Temperature=80.4, Humidity=40.2, Barometer = 29.1) This way you avoid having to use the WholeRowIterator and unless you often have queries that only look at a tiny subset of your fields (let's say just "Temperature"), the serialization costs seem similar since Value is just bytes anyway. Appreciate folks' experience and wisdom here. Hope this makes sense, happy to clarify. Best. -Mike ------=_Part_18313819_2146376538.1441382277778 Content-Type: text/html; charset=utf-8 Content-Transfer-Encoding: quoted-printable

 Both will work, but I think the answer depend= s on the amount of data that you will be querying over and your query laten= cy requirements. I would include the wikisearch[1] storage scheme into your= list as well (k/v table + indices). Then, personally, I would rate them in= the following order as database size increases and query latency requireme= nts decrease:

 1. Document Model
 2. K/V Model
 3. K/V Model with indices (wik= isearch)

[1] https://accumulo.apache.org/examp= le/wikisearch.html


From: "Michael Moss= " <michael.moss@gmail.com>
To: user@accumulo.apache.org
= Sent: Friday, September 4, 2015 11:42:20 AM
Subject: Accum= ulo: "BigTable" vs. "Document Model"

Hel= lo, everyone.

I'd love to hear folks' input on using the= "natural" data model of Accumulo ("BigTable" style) vs more of a Document = Model. I'll try to succinctly describe with a contrived example.
=
Let's say I have one domain object I'd like to model, "Senso= rReadings". A single entry might look something like the following with 4 d= istinct CF, CQ pairs.

RowKey: DeviceID-YYYMMDD-Rea= dingID (i.e. - 1-20150101-1234)
CF: "Meta", CQ: "Timestamp", Valu= e: <Some timestamp>
CF: "Sensor", CQ: "Temperature", Value:= 80.4
CF: "Sensor", CQ: "Humidity", Value: 40.2
CF:= "Sensor", CQ: "Barometer", Value: 29.1

I might do= queries like "get me all SensorReadings for 2015 for DeviceID =3D 1" and i= f I wanted to operate on each SensorReading as a single unit (and not as th= e 4 'rows' it returns for each one), I'd either have to aggregate the 4 CF,= CQ pairs for each RowKey client side, or use something like the WholeRowIt= erator.

In addition, if I wanted to write a query = like, "for DeviceID =3D 1 in 2015, return me SensorReadings where Temperatu= re > 90, Humidity < 40, Barometer > 31", I'd again have to either = use the WholeRowIterator to 'see' each entire SensorReading in memory on th= e server for the compound query, or I could take the intersection of the re= sults of 3 parallel, independent queries on the client side.

=
Where I am going with this is, what are the thoughts around crea= ting a Java, Protobuf, Avro (etc) object with these 4 CF, CQ pairs as field= s and storing each SensorReading as a single 'Document'? 
RowKey: DeviceID-YYYMMDD
CF: ReadingID Value: Protob= uf(Timestamp=3D123, Temperature=3D80.4, Humidity=3D40.2, Barometer =3D 29.1= )

This way you avoid having to use the WholeRowIte= rator and unless you often have queries that only look at a tiny subset of = your fields (let's say just "Temperature"), the serialization costs seem si= milar since Value is just bytes anyway.

Appreciate= folks' experience and wisdom here. Hope this makes sense, happy to clarify= .

Best.

-Mike
<= br>




------=_Part_18313819_2146376538.1441382277778--