accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Elser <josh.el...@gmail.com>
Subject Re: micro compaction
Date Wed, 10 Jun 2015 16:39:24 GMT
Good point! Another reason why we need to start writing this stuff down 
now :)

John Vines wrote:
> Don't forget that the client may not have the same iterators in memory
> as the server JVM so that would have to be worked around.
>
> On Wed, Jun 10, 2015 at 12:28 PM Josh Elser <josh.elser@gmail.com
> <mailto:josh.elser@gmail.com>> wrote:
>
>     I think re-using Iterators in the client-write path makes sense
>     architecturally and is a logical progression for the reasons pointed out
>     by Roman and Russ.
>
>     The big concern that Keith pointed out, it's hard to directly apply
>     iterators on the client-write side because we're not dealing in sorted
>     key-values at this point. I think there could be ways to work around
>     this.
>
>     I'd say if we have people who are interested in pursuing this, let's
>     start a new discussion on dev@ where we can start laying some groundwork
>     for the scope and implementation of what this solution would look like.
>
>     roman.drapeko@baesystems.com <mailto:roman.drapeko@baesystems.com>
>     wrote:
>      > My view is that introduction of ingest-time iterators would be
>     quite a
>      > useful feature. Anyway. J
>      >
>      > Also, could anyone exactly explain why composite mutation perform
>     pretty
>      > much in the same way as a set of individual mutations?
>      >
>      > One large composite mutation with 19 qualifiers inside is just 10-30%
>      > faster than 19 individual mutations.
>      >
>      > *From:*Russ Weeks [mailto:rweeks@newbrightidea.com
>     <mailto:rweeks@newbrightidea.com>]
>      > *Sent:* 09 June 2015 20:54
>      > *To:* accumulo-user
>      > *Subject:* Re: micro compaction
>      >
>      > For consistency and ease of implementation. Say I've written a
>     stack of
>      > combiners that do statistical aggregation, sampling etc. on my table.
>      > Rather than port that logic to a Storm topology or to the DStream API
>      > I'd just like to turn that stack on in my BatchWriter.
>      >
>      > On Tue, Jun 9, 2015 at 12:47 PM David Medinets
>     <david.medinets@gmail.com <mailto:david.medinets@gmail.com>
>      > <mailto:david.medinets@gmail.com
>     <mailto:david.medinets@gmail.com>>> wrote:
>      >
>      >     Consider using Storm, Pig, Spark, or your own framework to handle
>      >     the in-memory aggregation before giving the data to the
>     BatchWriter.
>      >     Why would any part of Accumulo code be responsible for this
>     kind of
>      >     application-specific data handling?
>      >
>      >     On Tue, Jun 9, 2015 at 3:17 PM, roman.drapeko@baesystems.com
>     <mailto:roman.drapeko@baesystems.com>
>      > <mailto:roman.drapeko@baesystems.com
>     <mailto:roman.drapeko@baesystems.com>> <roman.drapeko@baesystems.com
>     <mailto:roman.drapeko@baesystems.com>
>      > <mailto:roman.drapeko@baesystems.com
>     <mailto:roman.drapeko@baesystems.com>>> wrote:
>      >
>      >     Just to clarify the origin of my question.
>      >
>      >     I had to do some performance tests to compare different storage
>      >     types of “raw” data against each other.
>      >
>      >     Hopefully, picture below is visible in the mailing list. If
>     not, I
>      >     will put it somewhere else.
>      >
>      >     6 million “original” records, 1.3GB data, 233 bytes per record
>      >
>      >     Each original record is 40 fields delimited by tab, on
>     average 19 –
>      >     not null
>      >
>      >     Batchwriter, single java program
>      >
>      >     First three bars represent single “heavy” mutation to insert the
>      >     whole tabular line / serialized object.
>      >
>      >     4,5,6,7 bars – composite mutation (all qualifiers for the
>     same rowid
>      >     in one mutation)
>      >
>      >     8, 9, 10, 11 – individual mutations (all qualifiers for the same
>      >     rowid in separate mutations) - ~19 mutations per original record
>      >
>      >     On average, single “heavy” mutations are 7-10 times faster than
>      >     anything else, composite are 10%-35% faster than individual.
>      >
>      >     I am not an expert how Accumulo is implemented internally,
>     however
>      >     it looks like composite mutation is treated more or less in
>     the same
>      >     way as a set of individual mutations. Probably, largest
>     overhead is
>      >     added by WAL.
>      >
>      >     Data utilization before and after manual compaction of test table
>      >     and all system tables:
>      >
>      >     It’s not clear why “accumulo du” shows twice less data used
>      >     comparing to “hdfs du”.
>      >
>      >     All these tests made us think that we can improve performance by
>      >     doing some calculations in-memory (and our use-case fits very
>     well)
>      >     and reducing number of mutations. Now I am trying to understand
>      >     whether there is a relatively easy way to do this with
>     Accumulo or
>      >     whether it’s time to look closer into something like Spark.
>      >
>      >     Thanks
>      >
>      >     Roman
>      >
>      >     *From:*Adam Fuchs [mailto:afuchs@apache.org
>     <mailto:afuchs@apache.org> <mailto:afuchs@apache.org
>     <mailto:afuchs@apache.org>>]
>      >     *Sent:* 09 June 2015 19:08
>      >
>      >
>      >     *To:* user@accumulo.apache.org
>     <mailto:user@accumulo.apache.org> <mailto:user@accumulo.apache.org
>     <mailto:user@accumulo.apache.org>>
>      >     *Subject:* Re: micro compaction
>      >
>      >     I think this might be the same concept as in-mapper
>     combining, but
>      >     applied to data being sent to a BatchWriter rather than an
>      >     OutputCollector. See [1], section 3.1.1. A similar performance
>      >     analysis and probably a lot of the same code should apply here.
>      >
>      >     Cheers,
>      >
>      >     Adam
>      >
>      >     [1]
>      > http://lintool.github.io/MapReduceAlgorithms/MapReduce-book-final.pdf
>      >
>      >     On Tue, Jun 9, 2015 at 1:02 PM, Russ Weeks
>     <rweeks@newbrightidea.com <mailto:rweeks@newbrightidea.com>
>      > <mailto:rweeks@newbrightidea.com
>     <mailto:rweeks@newbrightidea.com>>> wrote:
>      >
>      >     Having a combiner stack (more generally an iterator stack) run on
>      >     the client-side seems to be the second most popular request
>     on this
>      >     list. The most popular being, "How do I write to Accumulo from
>      >     inside an iterator?"
>      >
>      >     Such a thing would be very useful for me, too. I have some
>     cycles to
>      >     help out, if somebody can give me an idea of where to get started
>      >     and where the potential land-mines are.
>      >
>      >     -Russ
>      >
>      >     On Tue, Jun 9, 2015 at 9:08 AM roman.drapeko@baesystems.com
>     <mailto:roman.drapeko@baesystems.com>
>      > <mailto:roman.drapeko@baesystems.com
>     <mailto:roman.drapeko@baesystems.com>> <roman.drapeko@baesystems.com
>     <mailto:roman.drapeko@baesystems.com>
>      > <mailto:roman.drapeko@baesystems.com
>     <mailto:roman.drapeko@baesystems.com>>> wrote:
>      >
>      >         Aggregated output is tiny, so if I do same calculations in
>      >         memory (instead of sending mutations to Accumulo) , I can
>     reduce
>      >         overall number of mutations by 1000x or so
>      >
>      >
>      >
>      >         -----Original Message-----
>      >         From: Josh Elser [mailto:josh.elser@gmail.com
>     <mailto:josh.elser@gmail.com>
>      > <mailto:josh.elser@gmail.com <mailto:josh.elser@gmail.com>>]
>      >         Sent: 09 June 2015 16:54
>      >         To: user@accumulo.apache.org
>     <mailto:user@accumulo.apache.org> <mailto:user@accumulo.apache.org
>     <mailto:user@accumulo.apache.org>>
>      >         Subject: Re: micro compaction
>      >
>      >         Well, you win the prize for new terminology. I haven't ever
>      >         heard the term "micro compaction" before.
>      >
>      >         Can you clarify though, you say hundreds of millions of
>      >         mutations that result in megabytes of data. Is that an
>     increase
>      >         or decrease in size.
>      >         Comparing apples to oranges :)
>      >
>      > roman.drapeko@baesystems.com <mailto:roman.drapeko@baesystems.com>
>      > <mailto:roman.drapeko@baesystems.com
>     <mailto:roman.drapeko@baesystems.com>> wrote:
>      > > Hi guys,
>      > >
>      > > While doing pre-analytics we generate hundreds of millions of
>      > > mutations that result in 1-100 megabytes of useful data after
>      >         major
>      > > compaction. We ingest into Accumulo using MR from Mapper job. We
>      > > identified that performance really degrades while increasing
>      >         a number of mutations.
>      > >
>      > > The obvious improvement is to do some calculations in-memory
>      >         before
>      > > sending mutations to Accumulo.
>      > >
>      > > Of course, at the same time we are looking for a solution to
>      >         minimize
>      > > development effort.
>      > >
>      > > I guess I am asking about micro compaction/ingest-time
>      >         iterators on
>      > > the client side (before data is sent to Accumulo).
>      > >
>      > > To my understanding, Accumulo does not support them, is it
>      >         correct?
>      > > And if so, are there any plans to support this functionality
>      >         in the future?
>      > >
>      > > Thanks
>      > >
>      > > Roman
>      > >
>      > > Please consider the environment before printing this email. This
>      > > message should be regarded as confidential. If you have
>      >         received this
>      > > email in error please notify the sender and destroy it
>      >         immediately.
>      > > Statements of intent shall only become binding when confirmed
>      >         in hard
>      > > copy by an authorised signatory. The contents of this email
>      >         may relate
>      > > to dealings with other companies under the control of BAE Systems
>      > > Applied Intelligence Limited, details of which can be found at
>      > > http://www.baesystems.com/Businesses/index.htm.
>      >         Please consider the environment before printing this
>     email. This
>      >         message should be regarded as confidential. If you have
>     received
>      >         this email in error please notify the sender and destroy it
>      >         immediately. Statements of intent shall only become
>     binding when
>      >         confirmed in hard copy by an authorised signatory. The
>     contents
>      >         of this email may relate to dealings with other companies
>     under
>      >         the control of BAE Systems Applied Intelligence Limited,
>     details
>      >         of which can be found at
>      > http://www.baesystems.com/Businesses/index.htm.
>      >
>      >     Please consider the environment before printing this email. This
>      >     message should be regarded as confidential. If you have received
>      >     this email in error please notify the sender and destroy it
>      >     immediately. Statements of intent shall only become binding when
>      >     confirmed in hard copy by an authorised signatory. The
>     contents of
>      >     this email may relate to dealings with other companies under the
>      >     control of BAE Systems Applied Intelligence Limited, details of
>      >     which can be found at
>     http://www.baesystems.com/Businesses/index.htm.
>      >
>      > Please consider the environment before printing this email. This
>     message
>      > should be regarded as confidential. If you have received this
>     email in
>      > error please notify the sender and destroy it immediately.
>     Statements of
>      > intent shall only become binding when confirmed in hard copy by an
>      > authorised signatory. The contents of this email may relate to
>     dealings
>      > with other companies under the control of BAE Systems Applied
>      > Intelligence Limited, details of which can be found at
>      > http://www.baesystems.com/Businesses/index.htm.
>

Mime
View raw message