Return-Path: X-Original-To: apmail-arrow-dev-archive@minotaur.apache.org Delivered-To: apmail-arrow-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id D910A189BD for ; Thu, 10 Mar 2016 07:31:57 +0000 (UTC) Received: (qmail 65976 invoked by uid 500); 10 Mar 2016 07:31:57 -0000 Delivered-To: apmail-arrow-dev-archive@arrow.apache.org Received: (qmail 65915 invoked by uid 500); 10 Mar 2016 07:31:57 -0000 Mailing-List: contact dev-help@arrow.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@arrow.apache.org Delivered-To: mailing list dev@arrow.apache.org Received: (qmail 65903 invoked by uid 99); 10 Mar 2016 07:31:57 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 10 Mar 2016 07:31:57 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 00E73C03E9 for ; Thu, 10 Mar 2016 07:31:57 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.429 X-Spam-Level: * X-Spam-Status: No, score=1.429 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_ENVFROM_END_DIGIT=0.25, HTML_MESSAGE=2, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd4-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id 54EExeuU0mnO for ; Thu, 10 Mar 2016 07:31:55 +0000 (UTC) Received: from mail-vk0-f50.google.com (mail-vk0-f50.google.com [209.85.213.50]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id C67965F640 for ; Thu, 10 Mar 2016 07:31:54 +0000 (UTC) Received: by mail-vk0-f50.google.com with SMTP id k1so85738499vkb.0 for ; Wed, 09 Mar 2016 23:31:54 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to; bh=2v8YTpQFg3Ic/hGG/m654HKGN1XqvP/xzyTou7OPfFo=; b=F3Jajh9LQeN/mO0eAK/tY2+OAHNBM0+4XkliGG2+O88mFPbPW6vEj45m++UYcJLfvf GVedCheQmoopdCIRa8dljXy7jSvZShml86EnF9izr2GTN/rRXcUByXH0KCivgBsB4Uzr QdjkU5SdEuO/rbpH93CKy0TCK4B7NJSPrZAQxXvWUhgKx3nh8HOp6Kak45lo7tTURcQP SwluXVEUZ+Xp84wNI7Y96Y2j6O54ebixdOmhk66uifTACvzuhqPM725+5VpJJbgQX/ZW SWZMSjwHob+ykeHAWXMitzGi0cXoBQomzkft02AP8KLYRE+vfMZrll/lSGa4yK8rlTge 0P2A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to; bh=2v8YTpQFg3Ic/hGG/m654HKGN1XqvP/xzyTou7OPfFo=; b=SwoELltNrom/Nk2eSeIdj7zMrI42LJATDTJsyKh3WRPRJdEVLl/YFw2stnfEiLct6J JTXZeNUCltumcfQIRPilt+N8FiGbSInHWJlcAw6cn/3RfyTI7BCHFl0MvL6AHEs4jU8m sxD8hNUiTEDYqsw19zFGUSOci+kJ+poyIRcb0GalEHnXFccSkVbZCj0mLWXAa2xASFxS 751CrJGAbGh5wPkRjlHaNju/RgoNA3iTZxjGetoI/lXWSR8Cj3ZVyfkTNOP0NYNbJvDL duMlgysbHxYHPX+2hNDDn3f73sNkuWv6xA5puIILYt5H3gFR25bhEo5P8DAaj2NI5w8v Y7kw== X-Gm-Message-State: AD7BkJKv/RWpCEo6/G+C2i58zEHU+NU60ce68W6DW+Wib2jo8SZtEzlrlOBwu5fhzQZx/VN4b8PT3+M9pjy0aA== MIME-Version: 1.0 X-Received: by 10.31.6.17 with SMTP id 17mr2080594vkg.75.1457595113587; Wed, 09 Mar 2016 23:31:53 -0800 (PST) Received: by 10.176.4.164 with HTTP; Wed, 9 Mar 2016 23:31:53 -0800 (PST) In-Reply-To: References: Date: Thu, 10 Mar 2016 02:31:53 -0500 Message-ID: Subject: "Empty" slots From: Daniel Robinson To: "dev@arrow.apache.org" Content-Type: multipart/alternative; boundary=001a1143f24af8c7c0052daccec7 --001a1143f24af8c7c0052daccec7 Content-Type: text/plain; charset=UTF-8 Thanks, Wes. For hashing, my thought was that you could just hash the null bitmask concatenated with the values array (this should be doable in constant space for ordinary one-pass hash functions). If nulls are zeroed out, this will guarantee that equal primitive arrays hash to the same value. But if the empty slots in the value array are undefined and could have any value, this method would give you false negatives, and you would need to do something more complex and slower. I was also curious about how this is treated in Drill ValueVectors. The example illustration in the introductory explanation at https://drill.apache.org/docs/value-vectors/ uses 0 for nulled values, whatever that's worth. This is a tangent, but I noticed Drill (and thus for the moment Java Arrow: https://github.com/apache/arrow/blob/master/java/vector/src/main/codegen/templates/NullableValueVectors.java#L369) uses 0 in the null bitmask to represent null values, while you use the NumPy MaskedArray convention of having 1 represent null values. Is there an advantage? On Wednesday, March 9, 2016, Wes McKinney wrote: > hey Dan, > > You bring up a good point. It's unclear whether this would make > hashing much less complex (if you ignore the null bits when computing > the hash function, then it probably would, especially on repeated > fields. We'd need to do some experiments to see if there are cases > where skipping the null bits would yield bad hashes. I'm not sure) > > For the memory comparison question, I'd like to hear from the Drill > team on their experience with ValueVectors and how they handle the > "null" space in arrays. In the case of list / repeated values, you > *do* need to initialize the null offset slot to the previous value so > that offset computations always work. For example, the list > array > > [[0, 1, 2], null, null, [4, 5, 6]] > > would need to have > > nulls [LSB right-to-left ordering]: 0 0 0 0 0 1 1 0 > offsets: 0 3 3 3 6 > values 0 1 2 4 5 6 > > - Wes > > On Mon, Mar 7, 2016 at 4:08 PM, Daniel Robinson > wrote: > > Am I correct that under the current draft spec, the values in meaningless > > slots (such as slots corresponding to a 1 in an associated null bitmask, > or > > unused slots in a sparse union) are undefined? > > > > If so, might it be worth considering requiring that they be zeroed out? > > This would add some time overhead to array building (for sparse unions, > > memory could be zeroed out when it is allocated; for nullable arrays, it > > may be more efficient to do so when nulls are added), but would remove > most > > of the complexity from hash calculations and equality comparisons. (See > > ARROW-32 and ARROW-38.) For example, comparison of nullable primitive > > arrays could be done with 2 calls to memcmp. As a bonus, it would speed > up > > operations for which null and 0 are equivalent or near-equivalent (such > as > > sum, sort on unsigned integers, ). > > > > Other than the overhead, one potential downside is that it would make it > > impossible to just add a null_bitmask over an existing array without > > copying all of the memory. Perhaps this would best be done with a > > separately defined masking vector, which would not require that all > masked > > values be set to 0 (and would thus require more complex hashing > algorithms). > --001a1143f24af8c7c0052daccec7--