From dev-return-11883-archive-asf-public=cust-asf.ponee.io@arrow.apache.org Mon Apr 29 18:53:41 2019 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [207.244.88.153]) by mx-eu-01.ponee.io (Postfix) with SMTP id 74E2618061A for ; Mon, 29 Apr 2019 20:53:41 +0200 (CEST) Received: (qmail 52491 invoked by uid 500); 29 Apr 2019 18:53:39 -0000 Mailing-List: contact dev-help@arrow.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@arrow.apache.org Delivered-To: mailing list dev@arrow.apache.org Received: (qmail 52449 invoked by uid 99); 29 Apr 2019 18:53:34 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 29 Apr 2019 18:53:34 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 1B4D6C5B09 for ; Mon, 29 Apr 2019 18:53:34 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -2.566 X-Spam-Level: X-Spam-Status: No, score=-2.566 tagged_above=-999 required=6.31 tests=[DKIMWL_WL_HIGH=-0.065, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_MED=-2.3, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (1024-bit key) header.d=python.org Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id tMjqrytKQr1n for ; Mon, 29 Apr 2019 18:53:32 +0000 (UTC) Received: from mail.python.org (mail.python.org [188.166.95.178]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 4341A60E85 for ; Mon, 29 Apr 2019 18:53:32 +0000 (UTC) Received: from [192.168.1.98] (221-98-190-109.dsl.ovh.fr [109.190.98.221]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.python.org (Postfix) with ESMTPSA id 44tDMB4KDMzndbj for ; Mon, 29 Apr 2019 14:53:26 -0400 (EDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=python.org; s=200901; t=1556564006; bh=GiBDA/191Nz9QE6vgVxBTAPmj9xmJ7UaSq/9NwrQ5No=; h=Subject:To:References:From:Date:In-Reply-To:From; b=O6BAmeeO2knBoeDembmcCeLveIPSUcDiMQoG/GGspYnMRtCm3h9LWOZI9c3r+SUJx Q7EB/jDWyhIgId6caJkPP9jmL1tH3+yUdCaocD6Y69Dos3qrBZhnWBpvvW9AkSGB34 YpPjUcwe+QUjabayUxsRK9BcDTXEdGzEBVZiierY= Subject: Re: [DISCUSS][C++] Static versus variable Arrow dictionary encoding To: dev@arrow.apache.org References: From: Antoine Pitrou Openpgp: preference=signencrypt Autocrypt: addr=antoine@python.org; prefer-encrypt=mutual; keydata= mQINBFQIqH8BEADUlB6Q7oEmm535PJ8ZebpN0buM4zFEHDMOukMfuoz9bBN0rVvvYRfXv9ID EYR1cHcie8oMudeXgHpZJ7M6KJPrHDOeR66dw+M5BYUhy1dJGaKSNYST9iXHuRrS21yhbBaG 7JhAuTE/qDiDNztu9q94Kw4vkrK8xuoAy9fQWIfSPPhQHFctA/NlTOC5CcRaWE7MQWU3XgH8 5VcaO0I7Ri7C2shkzGZAuns3owlmRSlkS1sMtnh2UEl2QBy2ckLGjaNB6aSlqnfOnwE3iodR nScgkAv7hvV/DePO/xNZQjWYynRZLdgCj+UQd5UGd/gTv0M0lqOCNsdkVDPA6VSvR8z0x8Sr MwpKXwz0sEeISeoY64EBVx9AhA/p6NaE8cLi7XCQI9iOCe6FWj89FpBfLx853glZOWlO5G/F nOYycB/zWLyGcRG5M1jVOsvccthQzeLKOqRZQ+J5ohZ5czM6xMcq1wm1a3SgdrIS/RsmCQWg 3EgZQDgNttJC0wDPcd5PmwXSJ23lDfiJ6xoUdwrhkkgdlQLDVLxsVP90E4iEZeSOIwzCIGTu mmYx9R83BomN8S9qj2ZfRXomYDGpYI5CSs08MClTPdSbA+3alviu4cqC/4eeagE4U5UZwdI7 0HKjPZD4Y0bFHttr6w5tZ60IEhEbboZMMIM5Q80Iv2nZ45biYwARAQABtCVrZXliYXNlLmlv L3BpdHJvdSA8cGl0cm91QGtleWJhc2UuaW8+iQItBBMBCgAXBQJUCKh/AhsvAwsJBwMVCggC HgECF4AACgkQdpvcISnCZYy1dw//VNDTsAuja/JoFiypTqXsVS9WBJjaZ3Vcu5x6ydn2MVms JUOnTmcXMsJllN0xj+YBlNJheHOK/luh5koh3RPevV/agKUJjtqvND9vqtEH7JsnIeIbHNFa 1QwiEN8fN/OCxWRBJ2CR+igPqsZEuy1cYg3IHRsKoHSBYGRIcvO5//pqp7ZxTaLyhHzUYrMZ fhCZ5Vc6TlIoKMrtTDhkR+anCZ6SBn/GwE2O8O+METPQBhkwbtlm/Qyzw0MbvfJiyKKfb0zR PmB+K0Ah0auauPWhda+1+b09h3sNnuQFDoGndwqURbJvGXG5fisQSPHwrWgU9cHbRjo/QjSY 0tH73WJkFRFo7NB5uvvjJSW3upX9qnjnxySfnzG6vAtiloxZ5cvcgZjQMldErslQ/eC7Nc8T KqM0Ku1wcHPeKWlYjxsV+U4Ae0MTKm5r02zmPZyAmS/FlvzKtuAO4KeDF8UxphiwUeKrSQ3i MQc17bxFJkwIIBDUgm7S7XFlSKS5tWBeSLckHwu6F57lKGENlJoJE31Tfo4EM97yd4nqFkzn nMrq54pnMSECUdZh7W0zIph69X/7L3D8AoD4AxiVNE/EzkZQ5B5m8vac2eQK8eHxzcSu1MtV OlqHvXqU3LpXMV5iuNfnYw0M/FVZVPSliolGTByNe+m/GviZHXOo1/hC2SmuAyu5AQ0EVAio fwEIANgh7945oQ5pT6Idaq6MMe+6cgnrxHirdOFbOhELqQEh1uLtFVtgzxf8iEzAbZgKVwSC Cl7lpvHMTLIib1q4EPLYBBTREHe3PfJ7aszxOQTVYJ0VD6752VUTd9jr0ueALbbki8zUNoRP 8iFocMnlKMrdS0A6iAQyk6JUoCHGjsL3uwNUvIeNshhinawLgiIbC1v+Vwyp2JqI2j1MZUfy ekS6TSYpESvWmsYum2w1xXctP6bVDTN1IL0ANZ1w/5h1+YpZBFN4syf8bRvGj89VrFLyQ2Gj GF7qqpPF8wOfkvXndNeyP1BL9SNE2JXuS58K+yyvadXmuDUl67evYHUsWPEAEQEAAYkDRAQY AQoADwUCVAiofwUJDwmcAAIbAgEpCRB2m9whKcJljMBdIAQZAQoABgUCVAiofwAKCRBDM7/d tpZBeIX7CACio47vOUTVMojsOcpmdlZReSsrjeOBnCCACheYV+R/ZQCjVevu1vO50dTG/Wsg RYYEXkEzcmXTpTbltmIhtzpT/66lNcgrIVCE5ln/Zm+OBlpcUDpYawL662JePo1TzUnrfRlo TwC3ahM/RqGbLXLFP35QxjyY1261WR8KMZN4/JqwSzirIjfMF7h27p+lvQJXG33kD3SkDwbZ 5tuSbvrsNiry+uaAlvrJEaQkb1AtB5e6IBRRFwNmaD0ltv6ohDpN0nOV+RUFCE2rB1VomvpK er4AvQGFforVPy8O6N7+ypqEmZk0FUFgp2nZ2qqyuPybactqmH6WTNhXY6bCddVPJEoP/1hT Yn9PSynGPb8t87D3YIR4FFyKhJf6D3kYgSTfc99lzlcCe9TwCGtoux1jhwDKS/u7HMaTJfSd grAfIpi9txnNFh/2gHNa23bYR+VFBlWqW/ItJG3+r38vbtAeA0ZhryydWxyI7M4xAXSHiiqP MGezgk+9jv9lpWft40Zwii4LMIiFyT5kad+Bvqn4LjvNx/8NS6o7mL80IvTsbmSV48Dge4tF ZiBRQAujHIWLr+NEMbScwipce+H3iFDDS8hpHuUwlGGM42spXDd8PtecUpk3aCTYinuZ6VSE pIRK5SlNm7hTPQTxptUByCBm4Et4obVvcuK9hQ88RTz+QIFc3p7G4Se162zX8klmtllXWGhy xbHbqtntteMyRvcC7hxSTYYR3mbw8QaqsOz+eKSbbg9+Q7briZ7dcgf1DuvUxibQYhqtycy2 Ozc44nmCxlzV22/vWswD2r/TkCQu/wtTm/ZNrc40G3cRSgu3ewlCl3E5hnzWmzMB/mjd5gMq blHJxdOB4u2w6KG9w1P0oQ1TMls5SsG6Ev6Ja5huEczptG97LEbWw62gUR/mcQ/bgBYOTwBY GkFXqcf9h56zETpUNw60KGl3meA3ZEQ3bdQgwPyOnPS2EcAVLq7hDoFbz8WiiImmNZMYdPSr XoYKGkTLiTxw5xtgHmHUPXtmqNJfrcbpuQENBFQIqH8BCACxfvvYStbXDSYVEK31kE7J3vpF J0TQ9V1/rYJLS5ji/gF1pb1dOYJW9oJuy2JnpsCcdfiDtaQnF7PJdyyBrNVrD4uTk+5/ynP8 +cFLjtNmK/Drd1Z7XNbVYw88Y+2EgFSZAGxROVxHUDceT0TtKfkisjK2vFJgR2ycJQH5gGek rIRMg4Cyl4SOzShF4p9RFVni42ZKCn42Q/7uR18ph0ZTWveW4pNC0vxy/XeUCXXillchSHWe RxNy35ZkDpzjpyHmJn5xaHiMUOqL0PyIxwxMIr3wuc+2Cl3r969vrTvqmkOVIUFLJReMLcCp ZhJYBXwrr3G/C7EdjTlW475c3eNjABEBAAGJA0QEGAEKAA8FAlQIqH8FCQ8JnAACGwwBKQkQ dpvcISnCZYzAXSAEGQEKAAYFAlQIqH8ACgkQs0Q9SUv4ymSxVgf/aVTRjVcW0Tahmm1cFm3y Lvk9zOkGmGdZTxGteQApUwucgM7KKYu0S0LRMcLqMmimZU6G48DMJa0N9sXzIp6LbliG7GBF QvZ2QPMBuBKHm5JiZwQ4CAjdm5/hiwJDA9PTAnxl0gF4DAnMl+sktofS4843AvASwdbx0A1q mAev+zVqdC3XznFYaSv6a0qxMJPSzMlEuq8/gjgBtbKwswuirrZp4ffFApc7lVrYcaRQh0j4 Xu5T/Q0Eb3v8XD2xBkPJppl3MEWq6loJBnrGyN8pT5rpPkWY0FQdkGBkYKMNz22iis9kQu9H yCcwrgufAJHVQ0RgcHn9Gs+yVHbttUodyTFCD/4xJcPcNmmfZYAx4El7Ob6IVpsa7O57Mljp 3MgKWxi+/s5/qNDDc6mTMJ0H9snp4DJEqLFTMIGN9sO+oa2c7CFyiB/jaie/hMdH7v8LeDxg Wq2wV7mNPTpKzX2dCTbKOU6DGAHauVqyzrF7osqH4czJNppv4e1/U3k7cjR5ui1i/zI4DxLr QHGavyJ6F/DGQXeDv8RizB2OV3qWXzSkwhqfVCadGqVnYR0ONUSjk+MnsmVPa1K7+x1WzRUV 90Uw29naj1KgLjoAtLicgRBsk95TGCRTLfqivqq5XtTQgi1L3OlCpRNym9RP63sSFg4u2CHD J550+/lZ1JKAX0+4T5bxjNQvKJPy9+lP+pgBBV3dVYZqU/6g5JVPT/2M0yZRMSWC/9fEI6xq CPb/5REu95qfm2p/qIAoN2JnXiF6aITdS3JNkY7tYfXo2WnCE8O3pWOrbbfTtwKVLccZH7So fj46U5ZtUfZoa4EuI9LdkqRg3N6npT/yP6ij+w7ti/dYgCP6tmRQExSC4YQt6V7SKEyHuW7m rkHWEg1/ldxyreuKDq37Pm6HiapYItnoXwQhoFNOr1vEqhPgABYFJw1ZB+2vn99sKIKlSUtk 7lYOVdexznPIkEibye/+oBVGs1KkawVT58d7UzH6C/l3BI/6narZBtNe84BR0briZf3euDMZ bA== Message-ID: <9746c6d1-ba81-6287-209a-803c65c98e32@python.org> Date: Mon, 29 Apr 2019 20:53:26 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.6.1 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 8bit Hi Wes, Le 29/04/2019 à 20:10, Wes McKinney a écrit : > > * Receiving a record batch schema without the dictionaries attached > (e.g. in Arrow Flight), see also experimental patch [2] Note that this was finally done in a separate PR, and only required changes in the IPC implementation. > Here is my proposal to reconcile these issues in C++ > > * Add a new "synthetic" data type called "variable dictionary" to be > used alongside the existing "static dictionary" type. An instance of > VariableDictionaryType (name TBD) will not know what the dictionary > is, only the data type of the dictionary (e.g. utf8()) and the index > type (e.g. int32()) Interesting idea. I'm curious to see a PR. > * Define common abstract API for instances of static vs variable > dictionary arrays. Mainly this means making > DictionaryArray::dictionary [3] virtual I'm not sure this is required, especially if the following is implemented: > * The _actual_ dictionary values for a particular Array must be stored > somewhere and lifetime managed. I propose to put these as a single > entry in ArrayData::child_data [4]. An alternative to this would be to > modify ArrayData to have a dictionary field that would be unused > except for encoded datasets `child_data` is supposed to mirror more or less the order of buffers in an IPC stream, right? Therefore I would favour a dedicated dictionary field (also makes fetching the dictionary trivial). > This proposal does create some ongoing implementation and maintenance > burden, but to that I would make these points: > > * Many algorithms will dispatch from one type to the other (probably > static dispatching to the variable path), so there will not be a need > to implement multiple times in most cases Sounds believable indeed. Regards Antoine.