From "Keith Turner (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (ACCUMULO-4669) RFile can create very large blocks when key statistics are not uniform
Date Tue, 15 Aug 2017 19:52:00 GMT
```
[ https://issues.apache.org/jira/browse/ACCUMULO-4669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16127769#comment-16127769
]

Keith Turner commented on ACCUMULO-4669:
----------------------------------------

[~afuchs] I was trying to experiment with the code snippet you provided in the initial comment.
I simplified it to experiment with just the stats and found that it quickly converges to
max int.  Below is my attempt to simplify the code.  At the 123 iteration of the loop the
len is max int.   Did I miss something in the simplification?

{code:java}
SummaryStatistics keyLenStats = new SummaryStatistics();

int len = 100;
for (int i = 0; i < 1000; i++) {
len = Math.max(len, (int) Math.ceil(keyLenStats.getMean() + keyLenStats.getStandardDeviation()
* 4 + 0.0001));

System.out.printf("%3d %d\n", i, len);
}
{code}

I wrote the following test that reproduces this problem. It eventually classifies everything
as giant and prints stats about when this happened.

{code:java}
int numRuns = 100;

// number of keys with the same len
int runSize = 10000;

Map<Integer,Long> giant = new TreeMap<>();

SummaryStatistics keyLenStats = new SummaryStatistics();

int len = 100;

for (int i = 0; i < numRuns; i++) {
for (int j = 0; j < runSize; j++) {

if (len > keyLenStats.getMean() + keyLenStats.getStandardDeviation() * 3) {
giant.compute(len, (k, v) -> v == null ? 1 : v + 1);
}
}

len = (int) (len * 1.1);
}

giant.forEach((l, n) -> System.out.printf("keylen: %,10d  giants: %,8d %3.0f%s\n",
l, n, n / (double) runSize * 100, "%"));
long totalGiants = giant.values().stream().mapToLong(l -> l).sum();
System.out.printf("totalGiants : %,8d  %3.0f%s\n", totalGiants, totalGiants / (double)
(numRuns * runSize) * 100, "%");
{code}

> The following code produces arbitrarily large RFile blocks:
> {code}
>   FileSKVWriter writer = RFileOperations.getInstance().openWriter(filename, fs, conf,
acuconf);
>   writer.startDefaultLocalityGroup();
>   SummaryStatistics keyLenStats = new SummaryStatistics();
>   Random r = new Random();
>   byte [] buffer = new byte[minRowSize];
>   for(int i = 0; i < 100000; i++) {
>     byte [] valBytes = new byte[valLength];
>     r.nextBytes(valBytes);
>     r.nextBytes(buffer);
>     ByteBuffer.wrap(buffer).putInt(i);
>     Key k = new Key(buffer, 0, buffer.length, emptyBytes, 0, 0, emptyBytes, 0, 0, emptyBytes,
0, 0, 0);
>     Value v = new Value(valBytes);
>     writer.append(k, v);
>     int newBufferSize = Math.max(buffer.length, (int) Math.ceil(keyLenStats.getMean()
+ keyLenStats.getStandardDeviation() * 4 + 0.0001));
>     buffer = new byte[newBufferSize];
>     if(keyLenStats.getSum() > targetSize)
>       break;
>   }
>       writer.close();
> {code}
with message "Requested array size exceeds VM limit". This will only happen if the block cache
size is big enough to hold the expected raw block size, 2GB in our case. This message is rare,
and really only happens when allocating an array of size Integer.MAX_VALUE or Integer.MAX_VALUE-1
on the hotspot JVM. Integer.MAX_VALUE happens in this case due to some strange handling of
raw block sizes in the BCFile code. Most OutOfMemoryExceptions have different messages.

--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

```
