Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 6675B200C79 for ; Fri, 19 May 2017 22:42:09 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 64EB8160BD2; Fri, 19 May 2017 20:42:09 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id AAEC9160BB0 for ; Fri, 19 May 2017 22:42:08 +0200 (CEST) Received: (qmail 42771 invoked by uid 500); 19 May 2017 20:42:07 -0000 Mailing-List: contact dev-help@drill.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@drill.apache.org Delivered-To: mailing list dev@drill.apache.org Received: (qmail 42760 invoked by uid 99); 19 May 2017 20:42:07 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 19 May 2017 20:42:07 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 6253C1803A1 for ; Fri, 19 May 2017 20:42:07 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -100.002 X-Spam-Level: X-Spam-Status: No, score=-100.002 tagged_above=-999 required=6.31 tests=[RP_MATCHES_RCVD=-0.001, SPF_PASS=-0.001, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id WsdQprsE8KBl for ; Fri, 19 May 2017 20:42:06 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTP id 061B75FC84 for ; Fri, 19 May 2017 20:42:06 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 5B353E0933 for ; Fri, 19 May 2017 20:42:05 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 433FB21B59 for ; Fri, 19 May 2017 20:42:04 +0000 (UTC) Date: Fri, 19 May 2017 20:42:04 +0000 (UTC) From: "Paul Rogers (JIRA)" To: dev@drill.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Created] (DRILL-5529) Repeated vectors missing "fill empties" logic MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Fri, 19 May 2017 20:42:09 -0000 Paul Rogers created DRILL-5529: ---------------------------------- Summary: Repeated vectors missing "fill empties" logic Key: DRILL-5529 URL: https://issues.apache.org/jira/browse/DRILL-5529 Project: Apache Drill Issue Type: Bug Affects Versions: 1.8.0 Reporter: Paul Rogers Assignee: Paul Rogers Fix For: 1.11.0 Consider the Drill {{OptionalVarCharVector}} type. This vector is composed of three buffers (also called vectors): * Is-set (bit) vector: contains 1 if the value is set, 0 if it is null. * Data vector; effectively a byte array in which each value is packed one after another. * Offset vector, in which the entry for each row points to the first byte of the value in the data vector. Suppose we have the values "foo", null, "bar". Then, the vectors contain: {code} Is-Set: [1 0 1] Offsets: [0 3 3 6] Data: [f o o b a r] {code} (Yes, there is one more offset entry than rows.) Suppose that the code creating the vector writes values for rows 1 and 3, but omits 2 (it is null, which is the default). How do we get that required value of 3 in the entry for row 2? The answer is that the logic for setting a value keeps track of the last write position and "backfills" missing offset values: {code} public void setSafe(int index, ByteBuffer value, int start, int length) { if (index > lastSet + 1) { fillEmpties(index); } ... {code} So, when we write the value for row 3 ("bar") we back-fill the missing offset for row 2. So far so good. We can now generalize. We must to the same trick any time that we use a vector that uses an offset vector. There are three other cases: * Required variable-width vectors (where a missing value is the same as an empty string). * A repeated fixed-width vector. * A repeated variable-width vector (which has *two* offset vectors). The problem is, none of these actually provide the required code. The caller must implement its own back-fill logic else the offset vectors become corrupted. Consider the required {{VarCharVector}}: {code} protected void set(int index, byte[] bytes, int start, int length) { assert index >= 0; final int currentOffset = offsetVector.getAccessor().get(index); offsetVector.getMutator().set(index + 1, currentOffset + length); data.setBytes(currentOffset, bytes, start, length); } {code} As a result of this omission, any client which skips null values will corrupt offset vectors. Consider an example: "try", "foo", "", "bar". We omit writing record 2 (empty string). Desired result: {code} Data: [t r y f o o b a r] Offsets: [0 3 6 6 9] {code} Actual result: {code} Data: [t r y f o o b a r] Offsets: [0 3 6 0 9] {code} The result is that we compute the width of field 2 as -6, not 3. The value of the empty field is 9, not 0. A similar issue arrises with repeated vectors. Consider {{RepeatedVarCharVector}}: {code} public void addSafe(int index, byte[] bytes, int start, int length) { final int nextOffset = offsets.getAccessor().get(index+1); values.getMutator().setSafe(nextOffset, bytes, start, length); offsets.getMutator().setSafe(index+1, nextOffset+1); } {code} Consider this example: (\["a", "b"], \[ ], \["d", "e"]). Expected: {code} Array Offset: [0 2 2 4] Value Offset: [0 1 2 3 4] Data: [a b d e] {code} Actual: {code} Array Offset: [0 2 0 4] Value Offset: [0 1 2 3 4] Data: [a b d e] {code} The entry for the (unwritten) position 2 is missing. This bug may be the root cause of several other issues found recently. (Potentially DRILL-5470 -- need to verify.) Two resolutions are possible: * Require that client code write all values, backfilling empty or null values as needed. * Generalize the mutators to back-fill in all cases, not just a nullable var char. A related issue occurs when a reader fails to do a "final fill" at the end of a batch (DRILL-5487). -- This message was sent by Atlassian JIRA (v6.3.15#6346)