From dev-return-2748-archive-asf-public=cust-asf.ponee.io@orc.apache.org Wed Feb 13 08:40:13 2019 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id 3050D180674 for ; Wed, 13 Feb 2019 09:40:13 +0100 (CET) Received: (qmail 23666 invoked by uid 500); 13 Feb 2019 08:40:12 -0000 Mailing-List: contact dev-help@orc.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@orc.apache.org Delivered-To: mailing list dev@orc.apache.org Received: (qmail 23654 invoked by uid 99); 13 Feb 2019 08:40:11 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 13 Feb 2019 08:40:11 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id F12E7C2380 for ; Wed, 13 Feb 2019 08:40:10 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.907 X-Spam-Level: * X-Spam-Status: No, score=1.907 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FROM_EXCESS_BASE64=0.105, HTML_MESSAGE=2, MIME_QP_LONG_LINE=0.001, SPF_PASS=-0.001, UNPARSEABLE_RELAY=0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd4-us-west.apache.org (amavisd-new); dkim=pass (1024-bit key) header.d=alibaba-inc.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id 5bdHJ2FLeArD for ; Wed, 13 Feb 2019 08:40:07 +0000 (UTC) Received: from out0-147.mail.aliyun.com (out0-147.mail.aliyun.com [140.205.0.147]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id CD1835F381 for ; Wed, 13 Feb 2019 08:40:06 +0000 (UTC) DKIM-Signature:v=1; a=rsa-sha256; c=relaxed/relaxed; d=alibaba-inc.com; s=default; t=1550047196; h=Date:Subject:From:To:Message-ID:Mime-version:Content-type; bh=uxlUFc8Bmlzl+RqeYpBonf0HonWanLhcPZhdYtElQ28=; b=R/sUq6F/hAyURZvSzHQfxD3gBDJcr3QPsJeFRqwpNuiWsfidTFzWkuDlkEqQ2QWErZipL2xIekaqSy44EReWcVIIOq4f5uOE/V0/rg4FmRYvJ6xSSTvVdF/kVSC8MFPpl2brMzYCdqHLpQMJMDDVgV3TregakCmAG8kMWH3H2Rs= X-Alimail-AntiSpam:AC=PASS;BC=-1|-1;BR=01201311R111e4;CH=green;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e02c03307;MF=yurui.zyr@alibaba-inc.com;NM=1;PH=DS;RN=1;SR=0;TI=SMTPD_---.DxzjzMr_1550047195; Received: from 30.5.25.82(mailfrom:yurui.zyr@alibaba-inc.com fp:SMTPD_---.DxzjzMr_1550047195) by smtp.aliyun-inc.com(127.0.0.1); Wed, 13 Feb 2019 16:39:55 +0800 User-Agent: Microsoft-MacOutlook/10.16.0.190211 Date: Wed, 13 Feb 2019 16:37:59 +0800 Subject: Propose to add EncodedStringVectorBatch to expose string dictionary From: "=?UTF-8?B?5ZGo5a6H552/KOmXu+aLmSk=?=" To: "dev@orc.apache.org" Message-ID: Thread-Topic: Propose to add EncodedStringVectorBatch to expose string dictionary Mime-version: 1.0 Content-type: multipart/alternative; boundary="B_3632920795_1070253452" --B_3632920795_1070253452 Content-type: text/plain; charset="UTF-8" Content-transfer-encoding: 7bit Hi All, Currently the Orc Reader StringVectorBatch does not bring any information about its encoding information, while the string dictionary can bring great benefits in various situation. I would like to add an EncodedStringVectorBatch to Orc Reader that expose the string dictionary (if available) to external consumer. The string dictionary will following benefits: Enable computation over encoded data. By exposing dictionary in vector batch we would be able to implement filter operator in a more efficient way. In our POC, by enabling encoded data based filter, we achieved 8% E2E perf improvement on tpch q1. Make data serialization more efficient. Currently when serializing Orc Vectorbatch, we have to copy all the strings in a vector even though the string data is already dictionary encoded. Exposing String Dictionary will enable vector batch serializer to remove unnecessary string memcpy, which will greatly improve serialization efficiency I opened a Jira at https://jira.apache.org/jira/browse/ORC-469 Any thoughts? Sincerely Yurui --B_3632920795_1070253452--