accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Keith Turner <>
Subject Re: Comparing Schemes in Accumulo
Date Mon, 07 Nov 2016 12:04:09 GMT
On Mon, Nov 7, 2016 at 6:54 AM, Oliver Swoboda <> wrote:
> Hello,
> I've stored weather data in two tables with different schemes. Scheme1 is
> using the month and station ID for the row key (e.g. 201601_GME00102292) and
> the days of the month (1-31) in the version column. Scheme2 is using the
> year and station ID for the row key (e.g. 2016_GME00102292) and the days of
> the year (1-366) in the version column. Of course, the version iterator has
> been removed from the tables. Because I have different metrics, like minimum
> temperature and maximum temperature of one day, I'm using locality groups,
> one group for each metric. (e.g. setgroups TMIN=TMIN, TMAX=TMAX).
> Additionaly I've done a pre splitting by year (e.g. 2014, 2015, 2016, ...).
> Now to my question: If I do a full table scan with a batch scanner, Scheme2
> is always faster than Scheme1 (with 2.5 billion entries Scheme1's scan took
> 24 minutes and Scheme2's scan took 21 minutes). Why is that? Is it because
> there are fewer seeks made when using Scheme2? Would be nice if someone can
> help me to understand what's happening here.

One possible reason is the relative encoding used in Accumulo.   When
two consecutive keys have the same row, the second key will just point
to the previous row.   This makes row comparisons faster.  Also when
data is transferred over the network from server to client, repeated
rows are not transferred.

> Yours faithfully,
> Oliver Swoboda

View raw message