Mailing-List: contact user-help@orc.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@orc.apache.org
From: Prasanth J <j.prasanth.j@gmail.com>
Content-Type: multipart/alternative;
 boundary="Apple-Mail=_0037954B-56B8-44E0-AA53-2658007DBA28"
Message-Id: <D954A5DE-B2A2-4527-99C3-B66473CA306E@gmail.com>
Mime-Version: 1.0 (Mac OS X Mail 8.2 \(2102\))
Subject: Re: ORC Indexing
Date: Thu, 16 Jul 2015 09:16:13 -0700
References: 
 <CABmSMcd2mmwE=F7jipf=MK2FkHNXHY3Y9k9tZZi2bahEmnKvnA@mail.gmail.com>
To: user@orc.apache.org
In-Reply-To: 
 <CABmSMcd2mmwE=F7jipf=MK2FkHNXHY3Y9k9tZZi2bahEmnKvnA@mail.gmail.com>


--Apple-Mail=_0037954B-56B8-44E0-AA53-2658007DBA28
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain;
	charset=us-ascii

Recently, bloom filter index is added to ORC which is much more accurate =
in row group elimination than min/max based index.

Thanks
Prasanth

> On Jul 16, 2015, at 9:07 AM, Thomas Abeler <thomas@sensenetworks.com> =
wrote:
>=20
> Hey,
>=20
> =20
>=20
> i have an question about how indexing in ORC works
>=20
> =20
>=20
> The way I understood ORC indexing is, that ORC keeps statistics (min, =
max, sum) about the rows every 10'000 rows (by default )and if I query =
the data it looks at the statistics to figure out if it needs to read =
the row chunk or not.
>=20
> =20
>=20
> If that's true - is it possible to build an index on an ORC file that =
is more similar to an database index - meaning that i want to create =
another sorted data structure which holds the field value and a pointer =
to the record it relates to.
>=20
> =20
>=20
> The problem i have is that i have a huge dataset. >300TB and 69 =
columns. There is no 'key' column that gets frequently queried and i =
would like to perform ad-hoc queries on nearly every of these columns. I =
think building an index on ever column would be a good approach to get =
this ability.
>=20
> =20
>=20
> Regards,
>=20
> Thomas
>=20


--Apple-Mail=_0037954B-56B8-44E0-AA53-2658007DBA28
Content-Transfer-Encoding: quoted-printable
Content-Type: text/html;
	charset=us-ascii

<html><head><meta http-equiv=3D"Content-Type" content=3D"text/html =
charset=3Dus-ascii"></head><body style=3D"word-wrap: break-word; =
-webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" =
class=3D""><div class=3D"">Recently, bloom filter index is added to ORC =
which is much more accurate in row group elimination than min/max based =
index.</div><div class=3D""><br class=3D""></div><div =
class=3D"">Thanks</div><div class=3D"">Prasanth</div><br =
class=3D""><div><blockquote type=3D"cite" class=3D""><div class=3D"">On =
Jul 16, 2015, at 9:07 AM, Thomas Abeler &lt;<a =
href=3D"mailto:thomas@sensenetworks.com" =
class=3D"">thomas@sensenetworks.com</a>&gt; wrote:</div><br =
class=3D"Apple-interchange-newline"><div class=3D""><div dir=3D"ltr" =
class=3D""><p class=3D"MsoNormal" =
style=3D"font-size:12.8000001907349px"><span lang=3D"EN-US" =
style=3D"font-size:9.5pt;font-family:Arial,sans-serif;background-image:ini=
tial;background-repeat:initial" class=3D"">Hey,</span><span lang=3D"EN-US"=
 style=3D"font-size:12pt;font-family:'Times New Roman',serif" =
class=3D""><u class=3D""></u><u class=3D""></u></span></p><p =
class=3D"MsoNormal" =
style=3D"font-size:12.8000001907349px;background-image:initial;background-=
repeat:initial"><span lang=3D"EN-US" =
style=3D"font-size:9.5pt;font-family:Arial,sans-serif" class=3D""><u =
class=3D""></u>&nbsp;<u class=3D""></u></span></p><p class=3D"MsoNormal" =
style=3D"font-size:12.8000001907349px;background-image:initial;background-=
repeat:initial"><span lang=3D"EN-US" =
style=3D"font-size:9.5pt;font-family:Arial,sans-serif" class=3D"">i have =
an question about how indexing in ORC works<u class=3D""></u><u =
class=3D""></u></span></p><p class=3D"MsoNormal" =
style=3D"font-size:12.8000001907349px;background-image:initial;background-=
repeat:initial"><span lang=3D"EN-US" =
style=3D"font-size:9.5pt;font-family:Arial,sans-serif" class=3D""><u =
class=3D""></u>&nbsp;<u class=3D""></u></span></p><p class=3D"MsoNormal" =
style=3D"font-size:12.8000001907349px;background-image:initial;background-=
repeat:initial"><span lang=3D"EN-US" =
style=3D"font-size:11.5pt;font-family:Helvetica,sans-serif" class=3D"">The=
 way I understood ORC indexing is, that ORC keeps statistics (min, max, =
sum) about the rows every 10'000 rows (by default )and if I query the =
data it looks at the statistics to figure out if it needs to read the =
row chunk or not.</span><span lang=3D"EN-US" =
style=3D"font-size:9.5pt;font-family:Arial,sans-serif" class=3D""><u =
class=3D""></u><u class=3D""></u></span></p><p class=3D"MsoNormal" =
style=3D"font-size:12.8000001907349px;background-image:initial;background-=
repeat:initial"><span lang=3D"EN-US" =
style=3D"font-size:9.5pt;font-family:Arial,sans-serif" class=3D""><u =
class=3D""></u>&nbsp;<u class=3D""></u></span></p><p class=3D"MsoNormal" =
style=3D"font-size:12.8000001907349px;background-image:initial;background-=
repeat:initial"><span lang=3D"EN-US" =
style=3D"font-size:11.5pt;font-family:Helvetica,sans-serif" class=3D"">If =
that's true - is it possible to build an index on an ORC file that is =
more similar to an database index - meaning that i want to create =
another sorted data structure which holds the field value and a pointer =
to the record it relates to.</span><span lang=3D"EN-US" =
style=3D"font-size:9.5pt;font-family:Arial,sans-serif" class=3D""><u =
class=3D""></u><u class=3D""></u></span></p><p class=3D"MsoNormal" =
style=3D"font-size:12.8000001907349px;background-image:initial;background-=
repeat:initial"><span lang=3D"EN-US" =
style=3D"font-size:9.5pt;font-family:Arial,sans-serif" class=3D""><u =
class=3D""></u>&nbsp;<u class=3D""></u></span></p><p class=3D"MsoNormal" =
style=3D"font-size:12.8000001907349px;background-image:initial;background-=
repeat:initial"><span lang=3D"EN-US" =
style=3D"font-size:11.5pt;font-family:Helvetica,sans-serif" class=3D"">The=
 problem i have is that i have a huge dataset. &gt;300TB and 69 columns. =
There is no 'key' column that gets frequently queried and i would like =
to perform ad-hoc queries on nearly every of these columns. I think =
building an index on ever column would be a good approach to get this =
ability.</span><span lang=3D"EN-US" =
style=3D"font-size:9.5pt;font-family:Arial,sans-serif" class=3D""><u =
class=3D""></u><u class=3D""></u></span></p><p class=3D"MsoNormal" =
style=3D"font-size:12.8000001907349px;background-image:initial;background-=
repeat:initial"><span lang=3D"EN-US" =
style=3D"font-size:9.5pt;font-family:Arial,sans-serif" class=3D""><u =
class=3D""></u>&nbsp;<u class=3D""></u></span></p><p class=3D"MsoNormal" =
style=3D"font-size:12.8000001907349px;background-image:initial;background-=
repeat:initial"><span =
style=3D"font-size:11.5pt;font-family:Helvetica,sans-serif" =
class=3D"">Regards,</span><span =
style=3D"font-size:9.5pt;font-family:Arial,sans-serif" class=3D""><u =
class=3D""></u><u class=3D""></u></span></p><p class=3D"MsoNormal" =
style=3D"font-size:12.8000001907349px;background-image:initial;background-=
repeat:initial"><span =
style=3D"font-size:11.5pt;font-family:Helvetica,sans-serif" =
class=3D"">Thomas</span></p></div>
</div></blockquote></div><br class=3D""></body></html>=

--Apple-Mail=_0037954B-56B8-44E0-AA53-2658007DBA28--