accumulo-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eric Newton (JIRA)" <>
Subject [jira] [Commented] (ACCUMULO-3246) More insight into native map utilization/usage
Date Tue, 21 Oct 2014 19:54:37 GMT


Eric Newton commented on ACCUMULO-3246:

A bigger IMM will still be used.  It just doesn't help for long-running ingest (which is the
world I live in).

Let's say you have 10G to ingest, 1G / unit time, and a 1G IMM.

At .5 G, the IMM starts minor compacting.  It can write out that .5G at about the same speed
as the WAL can accept the next .5G.

So, by the time the first .5G is done writing, we can start writing the next .5G.

Doubling the IMM just moves the bar from .5G chunks to 1G chunks.  Both of these are large
enough to take advantage of compression and write buffer sizes.

You can argue that you will do fewer major compactions, and that's true.  But these also occur
in the background, and don't affect query/ingest except that they consume resources, create
disk contention and invalidate blocks/buffers.  Bigger flushes will require longer major compactions
when they finally happen, so there's no win.

So, the IMM for each actively ingesting tablet should be ~ HDFS block size.  More IMM will
be used, and will give you some big numbers on initial ingest, but sustained ingest will not

Because aggregation/combiners run only at compaction time, a larger IMM may actually hurt

> More insight into native map utilization/usage
> ----------------------------------------------
>                 Key: ACCUMULO-3246
>                 URL:
>             Project: Accumulo
>          Issue Type: Improvement
>          Components: tserver
>            Reporter: Josh Elser
>             Fix For: 1.7.0
> I often find that I choose a value for the size of the native map out of the air without
really having a good understanding of why I chose it  (aside from considerations of table.compaction.minor.logs.threshold
and tserver.walog.max.size).
> We don't have any insight into some basic metrics on the native maps. It would be nice
to be able to answer questions like
> * What is the utilization (space) of the native maps for a server
> * How much time is the server spending writing data as opposed to allocating new blocks
> I'm sure there are some other questions too.

This message was sent by Atlassian JIRA

View raw message