Mailing-List: contact notifications-help@accumulo.apache.org; run by ezmlm
Precedence: bulk
Reply-To: jira@apache.org
Date: Sat, 19 Mar 2016 19:15:33 +0000 (UTC)
From: "ASF GitHub Bot (JIRA)" <jira@apache.org>
To: notifications@accumulo.apache.org
Message-ID: <JIRA.12949982.1457978093000.68299.1458414933500@Atlassian.JIRA>
In-Reply-To: <JIRA.12949982.1457978093000@Atlassian.JIRA>
References: <JIRA.12949982.1457978093000@Atlassian.JIRA>
 <JIRA.12949982.1457978093888@arcas>
Subject: [jira] [Commented] (ACCUMULO-4164) Avoid copy of RFile Index blocks
 when in cache
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/ACCUMULO-4164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15202926#comment-15202926 ] 

ASF GitHub Bot commented on ACCUMULO-4164:
------------------------------------------

Github user joshelser commented on a diff in the pull request:

    https://github.com/apache/accumulo/pull/80#discussion_r56753672
  
    --- Diff: core/src/main/java/org/apache/accumulo/core/file/blockfile/impl/SeekableByteArrayInputStream.java ---
    @@ -0,0 +1,132 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.accumulo.core.file.blockfile.impl;
    +
    +import java.io.IOException;
    +import java.io.InputStream;
    +
    +/**
    + * This class is like byte array input stream with two differences. It supports seeking and avoids synchronization.
    + */
    +public class SeekableByteArrayInputStream extends InputStream {
    +
    +  // make this volatile to ensure data set by one thread can be seen by another
    +  private volatile byte buffer[];
    +  private int cur;
    +  private int max;
    +
    +  @Override
    +  public int read() {
    +    if (cur < max) {
    +      return buffer[cur++] & 0xff;
    +    } else {
    +      return -1;
    +    }
    +  }
    +
    +  @Override
    +  public int read(byte b[], int offset, int length) {
    +    if (b == null) {
    +      throw new NullPointerException();
    +    }
    +
    +    if (length < 0 || offset < 0 || length > b.length - offset) {
    +      throw new IndexOutOfBoundsException();
    +    }
    +
    +    if (length == 0) {
    +      return 0;
    +    }
    +
    +    int avail = max - cur;
    +
    +    if (avail <= 0) {
    +      return -1;
    +    }
    +
    +    if (length > avail) {
    +      length = avail;
    +    }
    +
    +    System.arraycopy(buffer, cur, b, offset, length);
    +    cur += length;
    +    return length;
    +  }
    +
    +  @Override
    +  public long skip(long requestedSkip) {
    +    long actualSkip = max - cur;
    +    if (requestedSkip < actualSkip)
    +      if (requestedSkip < 0)
    +        actualSkip = 0;
    +      else
    +        actualSkip = requestedSkip;
    +
    +    cur += actualSkip;
    +    return actualSkip;
    +  }
    +
    +  @Override
    +  public int available() {
    +    return max - cur;
    +  }
    +
    +  @Override
    +  public boolean markSupported() {
    +    return false;
    +  }
    +
    +  @Override
    +  public void mark(int readAheadLimit) {
    +    throw new UnsupportedOperationException();
    +  }
    +
    +  @Override
    +  public void reset() {
    +    throw new UnsupportedOperationException();
    +  }
    +
    +  @Override
    +  public void close() throws IOException {}
    +
    +  public SeekableByteArrayInputStream(byte[] buf) {
    +    this.buffer = buf;
    +    this.cur = 0;
    +    this.max = buf.length;
    +  }
    +
    +  public SeekableByteArrayInputStream(byte[] buf, int maxOffset) {
    +    this.buffer = buf;
    +    this.cur = 0;
    +    this.max = maxOffset;
    +  }
    +
    +  public void seek(int position) {
    +    if (position < 0 || position >= max)
    +      throw new IllegalArgumentException("position = " + position + " maxOffset = " + max);
    +    this.cur = position;
    +  }
    +
    +  public int getPosition() {
    +    return this.cur;
    +  }
    +
    +  public byte[] getBuffer() {
    --- End diff --
    
    Javadoc to be explicit that this is the actual array (changes to it will be represented in the InputStream? Do we want this do be `public`?


> Avoid copy of RFile Index blocks when in cache
> ----------------------------------------------
>
>                 Key: ACCUMULO-4164
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-4164
>             Project: Accumulo
>          Issue Type: Improvement
>    Affects Versions: 1.6.5, 1.7.1
>            Reporter: Keith Turner
>            Assignee: Keith Turner
>             Fix For: 1.6.6, 1.7.2, 1.8.0
>
>
> I have been doing performance experiments with RFile.  During the course of these experiments I noticed that RFile is not as fast at it should be in the case where index blocks are in cache and the RFile is not already open.  The reason is that the RFile code copies and deserializes the index data even though its already in memory.
> I made the following change to RFile in a branch.
>  * Avoid copy of index data when its in cache
>  * Deserialize offsets lazily (instead of upfront) during binary search
>  * Stopped calling lots of synchronized methods during deserialization of index info.  The existing code use ByteArrayInputStream which results in lots of fine grained synchronization.  Switching to an inputstream that offers the same functionality w/o sync showed a measurable performance difference.  
> These changes lead to performance in the following two situations  :
>  * When an RFiles data is in cache, but its not open on the tserver.  
>  * For RFiles with multilevel indexes with index data in cache.   Currently an open RFile only keeps the root node in memory.   Lower level index nodes are always read from the cache or DFS.   The changes I made would always avoid the copy and deserialization of lower level index nodes when in cache.
> I have seen significant performance improvements testing with the two cases above.  My test are currently based on a new API I am creating for RFile, so I can not easily share them until I get that pushed.  
> For the case where a tserver has all files frequently in use already open and those files have a single level index, these changes should not make a significant performance difference.
> These change should result in less memory use for opening the same rfile multiple times for different scans (when data is in cache).  In this case all of the RFiles would share the same byte array holding the serialized index data.


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)