Return-Path: X-Original-To: apmail-orc-user-archive@minotaur.apache.org Delivered-To: apmail-orc-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 825DA18667 for ; Thu, 16 Jul 2015 16:17:25 +0000 (UTC) Received: (qmail 90314 invoked by uid 500); 16 Jul 2015 16:17:15 -0000 Delivered-To: apmail-orc-user-archive@orc.apache.org Received: (qmail 90285 invoked by uid 500); 16 Jul 2015 16:17:15 -0000 Mailing-List: contact user-help@orc.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@orc.apache.org Delivered-To: mailing list user@orc.apache.org Received: (qmail 90275 invoked by uid 99); 16 Jul 2015 16:17:15 -0000 Received: from Unknown (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 16 Jul 2015 16:17:15 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 32970181A37 for ; Thu, 16 Jul 2015 16:17:15 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.9 X-Spam-Level: ** X-Spam-Status: No, score=2.9 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=3, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-us-east.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id OCoL3z-zSQYA for ; Thu, 16 Jul 2015 16:17:07 +0000 (UTC) Received: from mail-pd0-f181.google.com (mail-pd0-f181.google.com [209.85.192.181]) by mx1-us-east.apache.org (ASF Mail Server at mx1-us-east.apache.org) with ESMTPS id 526F243DDB for ; Thu, 16 Jul 2015 16:17:07 +0000 (UTC) Received: by pdbqm3 with SMTP id qm3so46697616pdb.0 for ; Thu, 16 Jul 2015 09:16:15 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=from:content-type:message-id:mime-version:subject:date:references :to:in-reply-to; bh=aihDcgjYmHVMcaLEtwbCOpZ7O6Mvpx1dFZ3CQwhOID8=; b=gJ5SGLJBZDVe5vfv14a2AaniKPMhloTFjtLo084KUgYVaUKmqopY3YJEgztFpzFq/e jP4c00jDy117hlCtXiffKI/nk0nxspYE6lBNtvMPEDI75yPFg6YFAXvOSZdiD75lB5Mw p/7djZEEgfEKuu92b1xhOwAiu8/s2lk4YXm9OQVFdAF8Q9oaxRYC429y4DDPK9Y7D/kx i3dn+pk1VqFBOPF4DuLTso9y1sokn2nJNNlBhtCpqlPflfNecaLcEI7a1WILUIFzwmea X42rx29D/AFBPBlhUOsqy7lHAwIdmRDkiJtK++ieUv/YO0/pwzRihnxRgUR3i+RYqcb/ safg== X-Received: by 10.70.45.168 with SMTP id o8mr19965159pdm.152.1437063375441; Thu, 16 Jul 2015 09:16:15 -0700 (PDT) Received: from ?IPv6:2601:646:8701:c848:5853:73dd:eef:398a? ([2601:646:8701:c848:5853:73dd:eef:398a]) by smtp.gmail.com with ESMTPSA id t2sm8405270pdo.81.2015.07.16.09.16.14 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Thu, 16 Jul 2015 09:16:14 -0700 (PDT) From: Prasanth J Content-Type: multipart/alternative; boundary="Apple-Mail=_0037954B-56B8-44E0-AA53-2658007DBA28" Message-Id: Mime-Version: 1.0 (Mac OS X Mail 8.2 \(2102\)) Subject: Re: ORC Indexing Date: Thu, 16 Jul 2015 09:16:13 -0700 References: To: user@orc.apache.org In-Reply-To: X-Mailer: Apple Mail (2.2102) --Apple-Mail=_0037954B-56B8-44E0-AA53-2658007DBA28 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=us-ascii Recently, bloom filter index is added to ORC which is much more accurate = in row group elimination than min/max based index. Thanks Prasanth > On Jul 16, 2015, at 9:07 AM, Thomas Abeler = wrote: >=20 > Hey, >=20 > =20 >=20 > i have an question about how indexing in ORC works >=20 > =20 >=20 > The way I understood ORC indexing is, that ORC keeps statistics (min, = max, sum) about the rows every 10'000 rows (by default )and if I query = the data it looks at the statistics to figure out if it needs to read = the row chunk or not. >=20 > =20 >=20 > If that's true - is it possible to build an index on an ORC file that = is more similar to an database index - meaning that i want to create = another sorted data structure which holds the field value and a pointer = to the record it relates to. >=20 > =20 >=20 > The problem i have is that i have a huge dataset. >300TB and 69 = columns. There is no 'key' column that gets frequently queried and i = would like to perform ad-hoc queries on nearly every of these columns. I = think building an index on ever column would be a good approach to get = this ability. >=20 > =20 >=20 > Regards, >=20 > Thomas >=20 --Apple-Mail=_0037954B-56B8-44E0-AA53-2658007DBA28 Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=us-ascii
Recently, bloom filter index is added to ORC = which is much more accurate in row group elimination than min/max based = index.

Thanks
Prasanth

On = Jul 16, 2015, at 9:07 AM, Thomas Abeler <thomas@sensenetworks.com> wrote:

Hey,

 

i have = an question about how indexing in ORC works

 

The= way I understood ORC indexing is, that ORC keeps statistics (min, max, = sum) about the rows every 10'000 rows (by default )and if I query the = data it looks at the statistics to figure out if it needs to read the = row chunk or not.

 

If = that's true - is it possible to build an index on an ORC file that is = more similar to an database index - meaning that i want to create = another sorted data structure which holds the field value and a pointer = to the record it relates to.

 

The= problem i have is that i have a huge dataset. >300TB and 69 columns. = There is no 'key' column that gets frequently queried and i would like = to perform ad-hoc queries on nearly every of these columns. I think = building an index on ever column would be a good approach to get this = ability.

 

Regards,

Thomas


= --Apple-Mail=_0037954B-56B8-44E0-AA53-2658007DBA28--