Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 602B0200B74 for ; Thu, 18 Aug 2016 00:14:00 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 5E924160AB5; Wed, 17 Aug 2016 22:14:00 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 80159160A8C for ; Thu, 18 Aug 2016 00:13:59 +0200 (CEST) Received: (qmail 93059 invoked by uid 500); 17 Aug 2016 22:13:58 -0000 Mailing-List: contact dev-help@orc.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@orc.apache.org Delivered-To: mailing list dev@orc.apache.org Received: (qmail 93019 invoked by uid 99); 17 Aug 2016 22:13:58 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 17 Aug 2016 22:13:58 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 15DFFC0BCA for ; Wed, 17 Aug 2016 22:13:58 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -0.802 X-Spam-Level: X-Spam-Status: No, score=-0.802 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H2=-0.001, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd4-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx2-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id 6C5kbYIDgdCt for ; Wed, 17 Aug 2016 22:13:55 +0000 (UTC) Received: from mail-pf0-f176.google.com (mail-pf0-f176.google.com [209.85.192.176]) by mx2-lw-eu.apache.org (ASF Mail Server at mx2-lw-eu.apache.org) with ESMTPS id 0CA865FBB8 for ; Wed, 17 Aug 2016 22:13:55 +0000 (UTC) Received: by mail-pf0-f176.google.com with SMTP id y134so394700pfg.0 for ; Wed, 17 Aug 2016 15:13:54 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:subject:from:in-reply-to:date :content-transfer-encoding:message-id:references:to; bh=D47KM8g1atONXgDYGfsZlJCMWWmlK51UTgAk2qFeBe4=; b=i56xPrxWWJ4FVs5rVbh+29Ih/ThTzJ57rQlGd22Wkq+Ap/ZZYjGHinYuIV/ZZWBKqP bXKLySnMkGo6GTLmQMH5Yucgx/Mef2iSgopKW0k0Gvdw6xkkNil0SuBj2bZNcGCyO6k/ da3KdMcQ2B+D1jtG2OYkXcfG37suGjvEILy1KDQ5NxoypZkROGtibc9YeRIY1J5iksg0 Eo8sM+MfVwgTs77K6JBJPtN33gDHPV/37VC93mnJbAaebg+Uci3pA3fLixPtIn5FLaiN Ly/amxPqNy8WW579H4l5lvGPg2UCofxzOmeZqI2HlVl8wm1Kw/JE7OSUOcolEO9A++tc cGJQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:subject:from:in-reply-to:date :content-transfer-encoding:message-id:references:to; bh=D47KM8g1atONXgDYGfsZlJCMWWmlK51UTgAk2qFeBe4=; b=WnFkgQtaaTHbjwdlTZ3lxDXvproFPYjkETrbkuEUN/3bNRH3LbwS0Jlw0hJni/YLnU 1Y4QHmcOqvrVXrglFfc+UCvQKpoXh7MqwKX4gey62sDsk9Y0I8VfJr7F+bl5GmrNykG9 7gAREterLGvJznqUIj91A74Ft8eNXvXGVaTGLKTGkTKiSmOzUKB/7HAumMRrwlK2RC6+ eX2yUkKQMexxU+ET7+h5GpATYJJQxUQT0QJrA3a+SSXl87jY/CNzYTyw5PO7+3aEkYwJ CTW5gFZV/LldrAeSUz1iPckfvzuVLWhjG2Di/qkf1lXY88L7ObojeqFu7DoqxZfoUFT5 5Jeg== X-Gm-Message-State: AEkoouuW4dFnK7AmhGvJm/CX/C+kvxUSE9Xg0LOevjw2ePI17E1GeyOnSxFDcFrvyKrP/g== X-Received: by 10.98.16.193 with SMTP id 62mr78271751pfq.132.1471472027887; Wed, 17 Aug 2016 15:13:47 -0700 (PDT) Received: from [10.22.16.106] ([192.175.27.10]) by smtp.gmail.com with ESMTPSA id d3sm49517402pfk.37.2016.08.17.15.13.46 for (version=TLS1 cipher=ECDHE-RSA-AES128-SHA bits=128/128); Wed, 17 Aug 2016 15:13:46 -0700 (PDT) Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Mac OS X Mail 9.3 \(3124\)) Subject: Re: Issue bloom filters with orc? From: Prasanth J In-Reply-To: Date: Wed, 17 Aug 2016 15:13:46 -0700 Content-Transfer-Encoding: quoted-printable Message-Id: References: To: dev@orc.apache.org X-Mailer: Apple Mail (2.3124) archived-at: Wed, 17 Aug 2016 22:14:00 -0000 I can confirm that ORC-54 fixes the issue. I ran the test case initially provided by Aaron, and I am getting the = expected test results. Total Batches Added [977] Total Batches Read [90] with columnNames [[a1]] for sarg [leaf-0 =3D = (EQUALS a1 91760645-5296-83b7-1fcd-955395a8db38), expr =3D leaf-0]. Total Batches Read [90] with columnNames [[a1, a2]] for sarg [leaf-0 =3D = (EQUALS a1 91760645-5296-83b7-1fcd-955395a8db38), expr =3D leaf-0]. Total Batches Read [90] with columnNames [[a2, a1]] for sarg [leaf-0 =3D = (EQUALS a1 91760645-5296-83b7-1fcd-955395a8db38), expr =3D leaf-0]. Total Batches Read [90] with columnNames [[a2]] for sarg [leaf-0 =3D = (EQUALS a2 91760645-5296-83b7-1fcd-955395a8db38), expr =3D leaf-0]. Total Batches Read [90] with columnNames [[a1, a2]] for sarg [leaf-0 =3D = (EQUALS a2 91760645-5296-83b7-1fcd-955395a8db38), expr =3D leaf-0]. Total Batches Read [90] with columnNames [[a2, a1]] for sarg [leaf-0 =3D = (EQUALS a2 91760645-5296-83b7-1fcd-955395a8db38), expr =3D leaf-0]. Total Batches Read [90] with columnNames [[a1, a2]] for sarg [leaf-0 =3D = (EQUALS a1 91760645-5296-83b7-1fcd-955395a8db38), leaf-1 =3D (EQUALS a2 = 91760645-5296-83b7-1fcd-955395a8db38), expr =3D (and leaf-0 leaf-1)]. Total Batches Read [90] with columnNames [[a2, a1]] for sarg [leaf-0 =3D = (EQUALS a1 91760645-5296-83b7-1fcd-955395a8db38), leaf-1 =3D (EQUALS a2 = 91760645-5296-83b7-1fcd-955395a8db38), expr =3D (and leaf-0 leaf-1)]. Total Batches Read [90] with columnNames [[a1]] for sarg [leaf-0 =3D = (EQUALS a1 91760645-5296-83b7-1fcd-955395a8db38), expr =3D (and leaf-0 = leaf-0)]. Total Batches Read [90] with columnNames [[a1, a2]] for sarg [leaf-0 =3D = (EQUALS a1 91760645-5296-83b7-1fcd-955395a8db38), expr =3D (and leaf-0 = leaf-0)]. Total Batches Read [90] with columnNames [[a2, a1]] for sarg [leaf-0 =3D = (EQUALS a1 91760645-5296-83b7-1fcd-955395a8db38), expr =3D (and leaf-0 = leaf-0)]. Total Batches Read [90] with columnNames [[a2]] for sarg [leaf-0 =3D = (EQUALS a2 91760645-5296-83b7-1fcd-955395a8db38), expr =3D (and leaf-0 = leaf-0)]. Total Batches Read [90] with columnNames [[a1, a2]] for sarg [leaf-0 =3D = (EQUALS a2 91760645-5296-83b7-1fcd-955395a8db38), expr =3D (and leaf-0 = leaf-0)]. Total Batches Read [90] with columnNames [[a2, a1]] for sarg [leaf-0 =3D = (EQUALS a2 91760645-5296-83b7-1fcd-955395a8db38), expr =3D (and leaf-0 = leaf-0)]. Thanks Prasanth > On Aug 17, 2016, at 3:09 PM, Owen O'Malley wrote: >=20 > This issue might have been fixed as part of ORC-54, which got = committed > this morning. Do you have a testcase already? >=20 > .. Owen >=20 > On Mon, Aug 15, 2016 at 1:08 PM, Aaron McCurry = wrote: >=20 >> I have been writing some test code that creates a simple orc writer = and >> reader with bloom filters enabled. The issue I have is when the >> SearchArgument matches the first column name provided in the Options >> searchArgument method ( >> https://github.com/apache/orc/blob/rel/release-1.1.2/java/ >> core/src/java/org/apache/orc/Reader.java#L197) >> the bloom filter doesn't seem to get applied. >>=20 >> The test program creates an orc file file with 2 string columns. = Then it >> populates the orc file with 1 million records with same UUID in both >> columns, but different values for each row. Then it performs a = series of >> reads on the file and counts the number of batches read and displays = the >> output. >>=20 >> Test program: >> https://gist.github.com/amccurry/a25a9dad1e657da5f4a1d8aec5e49118 >>=20 >> NOTE: I'm assuming the searchArgument ( >> https://github.com/apache/orc/blob/rel/release-1.1.2/java/ >> core/src/java/org/apache/orc/Reader.java#L197) >> method that contains the columns names are to inform the orc reader = what >> indexes it should read to perform the search operations. >>=20 >> High Level Output: >>=20 >> where a1 =3D=3D literal >> colNames : ["a1"] reads 977 batches >> colNames : ["a1", "a2"] reads 977 batches >> colNames : ["a2", "a1"] reads 90 batches >>=20 >> where a2 =3D=3D literal >> colNames : ["a2"] reads 977 batches >> colNames : ["a1", "a2"] reads 90 batches >> colNames : ["a2", "a1"] reads 977 batches >>=20 >> where a1 =3D=3D literal AND where a2 =3D=3D literal >> colNames : ["a1", "a2"] reads 90 batches >> colNames : ["a2", "a1"] reads 90 batches >>=20 >> where a1 =3D=3D literal AND where a1 =3D=3D literal >> colNames : ["a1"] reads 977 batches >> colNames : ["a1", "a2"] reads 977 batches >> colNames : ["a2", "a1"] reads 90 batches >>=20 >> where a2 =3D=3D literal AND where a2 =3D=3D literal >> colNames : ["a2"] reads 977 batches >> colNames : ["a1", "a2"] reads 90 batches >> colNames : ["a2", "a1"] reads 977 batches >>=20 >> Given that every row has the same value in both columns a1 and a2 I = would >> assume that every one of these test runs would yield the same number = of >> batches read, which should be 90. >>=20 >> Raw Output: >> https://gist.github.com/amccurry/962744f35b19bd013ec48c9bcbfb15e4 >>=20 >> I think the issue is from mapSargColumnsToOrcInternalColIdx method = where >> the rootColumn value is hard coded to '0': >> https://github.com/apache/orc/blob/rel/release-1.1.2/java/ >> core/src/java/org/apache/orc/impl/RecordReaderImpl.java#L713 >>=20 >> The mapSargColumnsToOrcInternalColIdx method checks each provided = column >> against the columns in the orc schema. During this it calls = findColumns ( >> https://github.com/apache/orc/blob/rel/release-1.1.2/java/ >> core/src/java/org/apache/orc/impl/RecordReaderImpl.java#L104) >> where if the column name matches one of the values in the columnNames >> array, the index and rootColumn are added and returned. >>=20 >> Then when the mapSargColumnsToOrcInternalColIdx returns it checks = each >> value in the filterColumns array to make sure it's value is greater = than >> '0'. If the column index is the first column and the rootColumn is = '0' >> then it's return value is '0' and the logical column filter gets = omitted. >>=20 >> I think the rootColumn literal should be '1' instead of '0' ( >> https://github.com/apache/orc/blob/rel/release-1.1.2/java/ >> core/src/java/org/apache/orc/impl/RecordReaderImpl.java#L713 >> ). >>=20 >> Thoughts? >>=20 >> Thanks, >>=20 >> Aaron >>=20