Mailing-List: contact dev-help@accumulo.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@accumulo.apache.org
MIME-Version: 1.0
In-Reply-To: <1440702591743-14988.post@n5.nabble.com>
References: <1440627066236-14979.post@n5.nabble.com>
 <CAPx=JkZYRBZwBp412g5W9tw4Xpvtao+YRci5oY0Yb09rF+ArAw@mail.gmail.com>
 <CAPVnKE=RVe6u4B9CuTUq5APDF5TATrfW=uxBMCuWZWqO=0PY=A@mail.gmail.com>
 <1440693111736-14984.post@n5.nabble.com>
 <CAPVnKE=3GJzYreqg+h=bg8vcNqWf8fA2BFy73VPTMvhrZjD6Lw@mail.gmail.com>
 <1440702591743-14988.post@n5.nabble.com>
From: Dylan Hutchison <dhutchis@mit.edu>
Date: Fri, 28 Aug 2015 02:50:49 -0400
Message-ID: 
 <CAPx=JkYr7XR3cdungOAU7vLgmKjmUDWnBUJeTVdKJfaeus1QNA@mail.gmail.com>
Subject: Re: using combiner vs. building stats cache
To: Accumulo Dev List <dev@accumulo.apache.org>
Content-Type: multipart/alternative; boundary=001a1141bd0233b48c051e598276

--001a1141bd0233b48c051e598276
Content-Type: text/plain; charset=UTF-8

Sounds like you have the idea now Z.  There are three places an iterator
can be applied: scan time, minor compaction time, and major compaction
time.  Minor compactions help your case a lot-- when enough entries are
written to a tablet server that the tablet server needs to dump them to a
new Hadoop RFile, the minor compaction iterators run on the entries as they
stream to the RFile.  This means that each RFile has only one entry for
each unique (row, column family, column qualifier) tuple.

Entries with the same (row, column family, column qualifier) in distinct
RFiles will get combined at the next major compaction, or on the fly during
the next scan.

For example, let say there are 100 rows of [foo, 1], it will actually be
> 'combined' to a single row [foo, 100]?


Careful-- Accumulo's combiners combine on Keys with identical row, column
family and column qualifier.  You'd have to make a more fancy iterator if
you want to combine all the entries that share the same row.  Let us know
if you need help doing that.


On Thu, Aug 27, 2015 at 3:09 PM, z11373 <z11373@outlook.com> wrote:

> Thanks again Russ!
>
> "but it might not be in this case if most of the data has already been
> combined"
> Does this mean Accumulo actually combine and persist the combined result
> after the scan/compaction (depending on which op the combiner is applied)?
> For example, let say there are 100 rows of [foo, 1], it will actually be
> 'combined' to a single row [foo, 100]? If that is the case, then combiner
> is
> not expensive.
>
> Wow! that's brilliant using -1 approach, I didn't even think about it
> before. Yes, this will work for my case because i only need to know the
> count.
>
> Thanks,
> Z
>
>
>
> --
> View this message in context:
> http://apache-accumulo.1065345.n5.nabble.com/using-combiner-vs-building-stats-cache-tp14979p14988.html
> Sent from the Developers mailing list archive at Nabble.com.
>

--001a1141bd0233b48c051e598276--