accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Elser <josh.el...@gmail.com>
Subject Re: micro compaction
Date Wed, 10 Jun 2015 16:28:07 GMT
I think re-using Iterators in the client-write path makes sense 
architecturally and is a logical progression for the reasons pointed out 
by Roman and Russ.

The big concern that Keith pointed out, it's hard to directly apply 
iterators on the client-write side because we're not dealing in sorted 
key-values at this point. I think there could be ways to work around this.

I'd say if we have people who are interested in pursuing this, let's 
start a new discussion on dev@ where we can start laying some groundwork 
for the scope and implementation of what this solution would look like.

roman.drapeko@baesystems.com wrote:
> My view is that introduction of ingest-time iterators would be quite a
> useful feature. Anyway. J
>
> Also, could anyone exactly explain why composite mutation perform pretty
> much in the same way as a set of individual mutations?
>
> One large composite mutation with 19 qualifiers inside is just 10-30%
> faster than 19 individual mutations.
>
> *From:*Russ Weeks [mailto:rweeks@newbrightidea.com]
> *Sent:* 09 June 2015 20:54
> *To:* accumulo-user
> *Subject:* Re: micro compaction
>
> For consistency and ease of implementation. Say I've written a stack of
> combiners that do statistical aggregation, sampling etc. on my table.
> Rather than port that logic to a Storm topology or to the DStream API
> I'd just like to turn that stack on in my BatchWriter.
>
> On Tue, Jun 9, 2015 at 12:47 PM David Medinets <david.medinets@gmail.com
> <mailto:david.medinets@gmail.com>> wrote:
>
>     Consider using Storm, Pig, Spark, or your own framework to handle
>     the in-memory aggregation before giving the data to the BatchWriter.
>     Why would any part of Accumulo code be responsible for this kind of
>     application-specific data handling?
>
>     On Tue, Jun 9, 2015 at 3:17 PM, roman.drapeko@baesystems.com
>     <mailto:roman.drapeko@baesystems.com> <roman.drapeko@baesystems.com
>     <mailto:roman.drapeko@baesystems.com>> wrote:
>
>     Just to clarify the origin of my question.
>
>     I had to do some performance tests to compare different storage
>     types of “raw” data against each other.
>
>     Hopefully, picture below is visible in the mailing list. If not, I
>     will put it somewhere else.
>
>     6 million “original” records, 1.3GB data, 233 bytes per record
>
>     Each original record is 40 fields delimited by tab, on average 19 –
>     not null
>
>     Batchwriter, single java program
>
>     First three bars represent single “heavy” mutation to insert the
>     whole tabular line / serialized object.
>
>     4,5,6,7 bars – composite mutation (all qualifiers for the same rowid
>     in one mutation)
>
>     8, 9, 10, 11 – individual mutations (all qualifiers for the same
>     rowid in separate mutations) - ~19 mutations per original record
>
>     On average, single “heavy” mutations are 7-10 times faster than
>     anything else, composite are 10%-35% faster than individual.
>
>     I am not an expert how Accumulo is implemented internally, however
>     it looks like composite mutation is treated more or less in the same
>     way as a set of individual mutations. Probably, largest overhead is
>     added by WAL.
>
>     Data utilization before and after manual compaction of test table
>     and all system tables:
>
>     It’s not clear why “accumulo du” shows twice less data used
>     comparing to “hdfs du”.
>
>     All these tests made us think that we can improve performance by
>     doing some calculations in-memory (and our use-case fits very well)
>     and reducing number of mutations. Now I am trying to understand
>     whether there is a relatively easy way to do this with Accumulo or
>     whether it’s time to look closer into something like Spark.
>
>     Thanks
>
>     Roman
>
>     *From:*Adam Fuchs [mailto:afuchs@apache.org <mailto:afuchs@apache.org>]
>     *Sent:* 09 June 2015 19:08
>
>
>     *To:* user@accumulo.apache.org <mailto:user@accumulo.apache.org>
>     *Subject:* Re: micro compaction
>
>     I think this might be the same concept as in-mapper combining, but
>     applied to data being sent to a BatchWriter rather than an
>     OutputCollector. See [1], section 3.1.1. A similar performance
>     analysis and probably a lot of the same code should apply here.
>
>     Cheers,
>
>     Adam
>
>     [1]
>     http://lintool.github.io/MapReduceAlgorithms/MapReduce-book-final.pdf
>
>     On Tue, Jun 9, 2015 at 1:02 PM, Russ Weeks <rweeks@newbrightidea.com
>     <mailto:rweeks@newbrightidea.com>> wrote:
>
>     Having a combiner stack (more generally an iterator stack) run on
>     the client-side seems to be the second most popular request on this
>     list. The most popular being, "How do I write to Accumulo from
>     inside an iterator?"
>
>     Such a thing would be very useful for me, too. I have some cycles to
>     help out, if somebody can give me an idea of where to get started
>     and where the potential land-mines are.
>
>     -Russ
>
>     On Tue, Jun 9, 2015 at 9:08 AM roman.drapeko@baesystems.com
>     <mailto:roman.drapeko@baesystems.com> <roman.drapeko@baesystems.com
>     <mailto:roman.drapeko@baesystems.com>> wrote:
>
>         Aggregated output is tiny, so if I do same calculations in
>         memory (instead of sending mutations to Accumulo) , I can reduce
>         overall number of mutations by 1000x or so
>
>
>
>         -----Original Message-----
>         From: Josh Elser [mailto:josh.elser@gmail.com
>         <mailto:josh.elser@gmail.com>]
>         Sent: 09 June 2015 16:54
>         To: user@accumulo.apache.org <mailto:user@accumulo.apache.org>
>         Subject: Re: micro compaction
>
>         Well, you win the prize for new terminology. I haven't ever
>         heard the term "micro compaction" before.
>
>         Can you clarify though, you say hundreds of millions of
>         mutations that result in megabytes of data. Is that an increase
>         or decrease in size.
>         Comparing apples to oranges :)
>
>         roman.drapeko@baesystems.com
>         <mailto:roman.drapeko@baesystems.com> wrote:
>          > Hi guys,
>          >
>          > While doing pre-analytics we generate hundreds of millions of
>          > mutations that result in 1-100 megabytes of useful data after
>         major
>          > compaction. We ingest into Accumulo using MR from Mapper job. We
>          > identified that performance really degrades while increasing
>         a number of mutations.
>          >
>          > The obvious improvement is to do some calculations in-memory
>         before
>          > sending mutations to Accumulo.
>          >
>          > Of course, at the same time we are looking for a solution to
>         minimize
>          > development effort.
>          >
>          > I guess I am asking about micro compaction/ingest-time
>         iterators on
>          > the client side (before data is sent to Accumulo).
>          >
>          > To my understanding, Accumulo does not support them, is it
>         correct?
>          > And if so, are there any plans to support this functionality
>         in the future?
>          >
>          > Thanks
>          >
>          > Roman
>          >
>          > Please consider the environment before printing this email. This
>          > message should be regarded as confidential. If you have
>         received this
>          > email in error please notify the sender and destroy it
>         immediately.
>          > Statements of intent shall only become binding when confirmed
>         in hard
>          > copy by an authorised signatory. The contents of this email
>         may relate
>          > to dealings with other companies under the control of BAE Systems
>          > Applied Intelligence Limited, details of which can be found at
>          > http://www.baesystems.com/Businesses/index.htm.
>         Please consider the environment before printing this email. This
>         message should be regarded as confidential. If you have received
>         this email in error please notify the sender and destroy it
>         immediately. Statements of intent shall only become binding when
>         confirmed in hard copy by an authorised signatory. The contents
>         of this email may relate to dealings with other companies under
>         the control of BAE Systems Applied Intelligence Limited, details
>         of which can be found at
>         http://www.baesystems.com/Businesses/index.htm.
>
>     Please consider the environment before printing this email. This
>     message should be regarded as confidential. If you have received
>     this email in error please notify the sender and destroy it
>     immediately. Statements of intent shall only become binding when
>     confirmed in hard copy by an authorised signatory. The contents of
>     this email may relate to dealings with other companies under the
>     control of BAE Systems Applied Intelligence Limited, details of
>     which can be found at http://www.baesystems.com/Businesses/index.htm.
>
> Please consider the environment before printing this email. This message
> should be regarded as confidential. If you have received this email in
> error please notify the sender and destroy it immediately. Statements of
> intent shall only become binding when confirmed in hard copy by an
> authorised signatory. The contents of this email may relate to dealings
> with other companies under the control of BAE Systems Applied
> Intelligence Limited, details of which can be found at
> http://www.baesystems.com/Businesses/index.htm.

Mime
View raw message