Return-Path: X-Original-To: apmail-accumulo-dev-archive@www.apache.org Delivered-To: apmail-accumulo-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 26B521851C for ; Fri, 28 Aug 2015 12:47:25 +0000 (UTC) Received: (qmail 87254 invoked by uid 500); 28 Aug 2015 12:47:25 -0000 Delivered-To: apmail-accumulo-dev-archive@accumulo.apache.org Received: (qmail 87213 invoked by uid 500); 28 Aug 2015 12:47:24 -0000 Mailing-List: contact dev-help@accumulo.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@accumulo.apache.org Delivered-To: mailing list dev@accumulo.apache.org Delivered-To: moderator for dev@accumulo.apache.org Received: (qmail 45192 invoked by uid 99); 28 Aug 2015 06:51:33 -0000 X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 4.307 X-Spam-Level: **** X-Spam-Status: No, score=4.307 tagged_above=-999 required=6.31 tests=[HTML_MESSAGE=3, RP_MATCHES_RCVD=-0.006, SPF_PASS=-0.001, URIBL_BLOCKED=0.001, URI_HEX=1.313] autolearn=disabled X-AuditID: 1209190e-f79c76d000002631-c5-55e004de7494 X-Received: by 10.107.155.20 with SMTP id d20mr12645659ioe.54.1440744668900; Thu, 27 Aug 2015 23:51:08 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <1440702591743-14988.post@n5.nabble.com> References: <1440627066236-14979.post@n5.nabble.com> <1440693111736-14984.post@n5.nabble.com> <1440702591743-14988.post@n5.nabble.com> From: Dylan Hutchison Date: Fri, 28 Aug 2015 02:50:49 -0400 Message-ID: Subject: Re: using combiner vs. building stats cache To: Accumulo Dev List Content-Type: multipart/alternative; boundary=001a1141bd0233b48c051e598276 X-Brightmail-Tracker: H4sIAAAAAAAAA+NgFprBKsWRmVeSWpSXmKPExsUixG6nrnuP5UGowdNl1hYf9rayODB6TNt0 nTGAMYrLJiU1J7MstUjfLoEro+GUTsF+zYrj636xNjC+UOpi5OSQEDCRWHzvKTuELSZx4d56 ti5GLg4hgcVMElu2TYJyzjNK/Fr0lQXCecwk8W/GHlYIp5lRYtO23cwQ/SUSL6dMZQOxeQUE JU7OfMICYgsJ+EjcfXKJCcTmFDCV+LJsLVT8NpPE5GPpIDabgIbEl84GsDiLgKrEwl+/GCHm BEi0flgGFhcGunXyuotgc0QE9CX+HpwFZjMLeEksvDyVdQKj4Cwkq2chSc1i5ACy1SXWzxOC CKtJ3N52lR3C1pZYtvA18wJG1lWMsim5Vbq5iZk5xanJusXJiXl5qUW6xnq5mSV6qSmlmxhB wc0pybeD8etBpUOMAhyMSjy8lhvuhwqxJpYVV+YeYpTkYFIS5f3yFSjEl5SfUpmRWJwRX1Sa k1p8iFGCg1lJhPfXN6Acb0piZVVqUT5MSpqDRUmcd9MPvhAhgfTEktTs1NSC1CKYrAwHh5IE 73bmB6FCgkWp6akVaZk5JQhpJg5OkOE8QMPngNTwFhck5hZnpkPkTzEac2z5fW0tE8e2qffW Mgmx5OXnpUqJ8/aClAqAlGaU5sFNgyWoV4ziQM8J8+4CqeIBJje4ea+AVjEBrXoZfxdkVUki QkqqgZHNWm/j9jO7n2/8eGzF257Uh8Kllzbv5XIRbXz9v5V3027b1fYC/h+k920zsJ146/eP lTYdhpHmbG+nvdnz/Ybyj/J3bu2h/VvVtz0s9pz6/MkN5Y0n8rhFnTzKnR9vd0s9mOveeXjZ MZ/YV9sWr1z/QnJvwZbUVbtytp2q7FLYX82yYs/pA936SizFGYmGWsxFxYkAr3sjbysDAAA= --001a1141bd0233b48c051e598276 Content-Type: text/plain; charset=UTF-8 Sounds like you have the idea now Z. There are three places an iterator can be applied: scan time, minor compaction time, and major compaction time. Minor compactions help your case a lot-- when enough entries are written to a tablet server that the tablet server needs to dump them to a new Hadoop RFile, the minor compaction iterators run on the entries as they stream to the RFile. This means that each RFile has only one entry for each unique (row, column family, column qualifier) tuple. Entries with the same (row, column family, column qualifier) in distinct RFiles will get combined at the next major compaction, or on the fly during the next scan. For example, let say there are 100 rows of [foo, 1], it will actually be > 'combined' to a single row [foo, 100]? Careful-- Accumulo's combiners combine on Keys with identical row, column family and column qualifier. You'd have to make a more fancy iterator if you want to combine all the entries that share the same row. Let us know if you need help doing that. On Thu, Aug 27, 2015 at 3:09 PM, z11373 wrote: > Thanks again Russ! > > "but it might not be in this case if most of the data has already been > combined" > Does this mean Accumulo actually combine and persist the combined result > after the scan/compaction (depending on which op the combiner is applied)? > For example, let say there are 100 rows of [foo, 1], it will actually be > 'combined' to a single row [foo, 100]? If that is the case, then combiner > is > not expensive. > > Wow! that's brilliant using -1 approach, I didn't even think about it > before. Yes, this will work for my case because i only need to know the > count. > > Thanks, > Z > > > > -- > View this message in context: > http://apache-accumulo.1065345.n5.nabble.com/using-combiner-vs-building-stats-cache-tp14979p14988.html > Sent from the Developers mailing list archive at Nabble.com. > --001a1141bd0233b48c051e598276--