accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adam Fuchs <afu...@apache.org>
Subject Re: micro compaction
Date Tue, 09 Jun 2015 18:07:55 GMT
I think this might be the same concept as in-mapper combining, but applied
to data being sent to a BatchWriter rather than an OutputCollector. See
[1], section 3.1.1. A similar performance analysis and probably a lot of
the same code should apply here.

Cheers,
Adam

[1] http://lintool.github.io/MapReduceAlgorithms/MapReduce-book-final.pdf

On Tue, Jun 9, 2015 at 1:02 PM, Russ Weeks <rweeks@newbrightidea.com> wrote:

> Having a combiner stack (more generally an iterator stack) run on the
> client-side seems to be the second most popular request on this list. The
> most popular being, "How do I write to Accumulo from inside an iterator?"
>
> Such a thing would be very useful for me, too. I have some cycles to help
> out, if somebody can give me an idea of where to get started and where the
> potential land-mines are.
>
> -Russ
>
> On Tue, Jun 9, 2015 at 9:08 AM roman.drapeko@baesystems.com <
> roman.drapeko@baesystems.com> wrote:
>
>> Aggregated output is tiny,  so if I do same calculations in memory
>> (instead of sending mutations to Accumulo) , I can reduce overall number of
>> mutations by 1000x or so
>>
>>
>>
>> -----Original Message-----
>> From: Josh Elser [mailto:josh.elser@gmail.com]
>> Sent: 09 June 2015 16:54
>> To: user@accumulo.apache.org
>> Subject: Re: micro compaction
>>
>> Well, you win the prize for new terminology. I haven't ever heard the
>> term "micro compaction" before.
>>
>> Can you clarify though, you say hundreds of millions of mutations that
>> result in megabytes of data. Is that an increase or decrease in size.
>> Comparing apples to oranges :)
>>
>> roman.drapeko@baesystems.com wrote:
>> > Hi guys,
>> >
>> > While doing pre-analytics we generate hundreds of millions of
>> > mutations that result in 1-100 megabytes of useful data after major
>> > compaction. We ingest into Accumulo using MR from Mapper job. We
>> > identified that performance really degrades while increasing a number
>> of mutations.
>> >
>> > The obvious improvement is to do some calculations in-memory before
>> > sending mutations to Accumulo.
>> >
>> > Of course, at the same time we are looking for a solution to minimize
>> > development effort.
>> >
>> > I guess I am asking about micro compaction/ingest-time iterators on
>> > the client side (before data is sent to Accumulo).
>> >
>> > To my understanding, Accumulo does not support them, is it correct?
>> > And if so, are there any plans to support this functionality in the
>> future?
>> >
>> > Thanks
>> >
>> > Roman
>> >
>> > Please consider the environment before printing this email. This
>> > message should be regarded as confidential. If you have received this
>> > email in error please notify the sender and destroy it immediately.
>> > Statements of intent shall only become binding when confirmed in hard
>> > copy by an authorised signatory. The contents of this email may relate
>> > to dealings with other companies under the control of BAE Systems
>> > Applied Intelligence Limited, details of which can be found at
>> > http://www.baesystems.com/Businesses/index.htm.
>> Please consider the environment before printing this email. This message
>> should be regarded as confidential. If you have received this email in
>> error please notify the sender and destroy it immediately. Statements of
>> intent shall only become binding when confirmed in hard copy by an
>> authorised signatory. The contents of this email may relate to dealings
>> with other companies under the control of BAE Systems Applied Intelligence
>> Limited, details of which can be found at
>> http://www.baesystems.com/Businesses/index.htm.
>>
>

Mime
View raw message