From dev-return-2829-archive-asf-public=cust-asf.ponee.io@orc.apache.org  Sun Jun  2 18:04:33 2019
Return-Path: <dev-return-2829-archive-asf-public=cust-asf.ponee.io@orc.apache.org>
X-Original-To: archive-asf-public@cust-asf.ponee.io
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mail.apache.org (hermes.apache.org [207.244.88.153])
	by mx-eu-01.ponee.io (Postfix) with SMTP id A83F818064E
	for <archive-asf-public@cust-asf.ponee.io>; Sun,  2 Jun 2019 20:04:32 +0200 (CEST)
Received: (qmail 50933 invoked by uid 500); 2 Jun 2019 18:04:32 -0000
Mailing-List: contact dev-help@orc.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:dev-help@orc.apache.org>
List-Unsubscribe: <mailto:dev-unsubscribe@orc.apache.org>
List-Post: <mailto:dev@orc.apache.org>
List-Id: <dev.orc.apache.org>
Reply-To: dev@orc.apache.org
Delivered-To: mailing list dev@orc.apache.org
Received: (qmail 50921 invoked by uid 99); 2 Jun 2019 18:04:31 -0000
Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142)
    by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 02 Jun 2019 18:04:31 +0000
Received: from localhost (localhost [127.0.0.1])
	by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 21469C0800
	for <dev@orc.apache.org>; Sun,  2 Jun 2019 18:04:31 +0000 (UTC)
X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org
X-Spam-Flag: NO
X-Spam-Score: 0.801
X-Spam-Level:
X-Spam-Status: No, score=0.801 tagged_above=-999 required=6.31
	tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1,
	DKIM_VALID_EF=-0.1, FREEMAIL_REPLY=1, RCVD_IN_DNSWL_NONE=-0.0001,
	SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001]
	autolearn=disabled
Authentication-Results: spamd4-us-west.apache.org (amavisd-new);
	dkim=pass (2048-bit key) header.d=gmail.com
Received: from mx1-lw-eu.apache.org ([10.40.0.8])
	by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024)
	with ESMTP id 0O-wf6blfofy for <dev@orc.apache.org>;
	Sun,  2 Jun 2019 18:04:27 +0000 (UTC)
Received: from mail-pg1-f170.google.com (mail-pg1-f170.google.com [209.85.215.170])
	by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 519B85F4D3
	for <dev@orc.apache.org>; Sun,  2 Jun 2019 18:04:26 +0000 (UTC)
Received: by mail-pg1-f170.google.com with SMTP id v9so6885560pgr.13
        for <dev@orc.apache.org>; Sun, 02 Jun 2019 11:04:26 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20161025;
        h=from:content-transfer-encoding:mime-version:subject:date:references
         :to:in-reply-to:message-id;
        bh=papQkRVeKgCmCJPNDcm4Xg8zSQ/ArXFGQg9UWGSmIXU=;
        b=C9/ywZhHaATJU+mJRMspJfuQDcpK6AZwS/JcNQ5TmpmJDgzoW2MH8Vr+bHH5nktL3k
         tRQkxOYc05RzYtRmWpkXsEs8HiCJjzR/cnkb/QKX5npCEmmPbTqEbzEPpiWlHhRB/ych
         Q5oJcItAExrTny+WkSD3yvtte3F5xcHoyntCwMsw838M0oVR/47xS6SlB20t5m6HLhJz
         8knkhCxYpCc4m+2L9v0Rf14RTIUmjqrFwfy5PZHdaVzJzir7ZgYEXad5td02e9cSkrbK
         O5izrCaKj91UdMBznqJpZd4f+pv4dBa50BjMbKZMQPVwT8UO+VD7szfZp77GEmamqYdt
         mLHw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:from:content-transfer-encoding:mime-version
         :subject:date:references:to:in-reply-to:message-id;
        bh=papQkRVeKgCmCJPNDcm4Xg8zSQ/ArXFGQg9UWGSmIXU=;
        b=DR1800pakc5fMfOq3Z5TBnUsCyH7RtUmnZExOVbWiPilfMvKlVCgswT60YQU4mPVi9
         oL6JyLogGcM5x88QZTy2UprInGpCBSUFjvdBy27iN5ZbOOBZXP8WYpHtKYCxSju8vQvo
         ecoy9weETADwZVSZ7RZ33G/ed4ozwbXxuvnVOVEXvXbBbtBnRJ0mDiuLFoH2yTHHQsxi
         l2/2ypQdkLI99U0u1Se8r6yc2ZObBiCj44NokGFd4l/6QEUgmLhMngBfCDQZj4Hob2vL
         fzgCXVisvgbx24Oy5xwGbnFcNXaS029VTQRA4RiDq1GVs8aGWaUf0v6+ysSWJbND49sW
         65Og==
X-Gm-Message-State: APjAAAW3QfFZpcjZ/fdBbkjeKVBw8HEcZxzO2DCwPWS7AzdRWPvpihd/
	R33fEJcP2odA3TLQFFZ6JQJ7f+S0
X-Google-Smtp-Source: APXvYqzw8SkoD0zDpqNv4ZyX2QduUK8DIzx1EKkcVJMW1kgkA+LOkA/P3mH43IEWU/zGvCPJuqUY7g==
X-Received: by 2002:a62:6241:: with SMTP id w62mr26432056pfb.226.1559498658060;
        Sun, 02 Jun 2019 11:04:18 -0700 (PDT)
Received: from ?IPv6:2601:647:4800:eb48:8c50:1874:5cd1:9d6? ([2601:647:4800:eb48:8c50:1874:5cd1:9d6])
        by smtp.gmail.com with ESMTPSA id r3sm12707577pgn.12.2019.06.02.11.04.16
        for <dev@orc.apache.org>
        (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
        Sun, 02 Jun 2019 11:04:17 -0700 (PDT)
From: Owen O'Malley <owen.omalley@gmail.com>
Content-Type: text/plain;
	charset=utf-8
Content-Transfer-Encoding: quoted-printable
Mime-Version: 1.0 (Mac OS X Mail 12.4 \(3445.104.11\))
Subject: Re: C++ API seekToRow() performance.
Date: Sun, 2 Jun 2019 11:04:14 -0700
References: <CAOWu5N=deBgjWnfx8jgXYxfHsHPtxRmWFNM-p4PXia+rKCKKzg@mail.gmail.com>
 <23CE4973-40F8-490C-84A3-7E5EFBE22CEE@live.com>
 <CAEokuX-8xLycnYxg01cm2oNWmSo1fr7wuBRWJD2Bw0Ygv2oxQA@mail.gmail.com>
To: dev@orc.apache.org
In-Reply-To: <CAEokuX-8xLycnYxg01cm2oNWmSo1fr7wuBRWJD2Bw0Ygv2oxQA@mail.gmail.com>
Message-Id: <47EE7D5E-AAE2-4FD9-895D-7FEA87A8FA32@gmail.com>
X-Mailer: Apple Mail (2.3445.104.11)


> On Jun 2, 2019, at 5:43 AM, Gang Wu <gangwu@apache.org> wrote:
>=20
> I can open a JIRA for the issue and port our fix back.

That would be great.

>=20
> For the last suggestion, we can add the optimization as a writer =
option if
> anyone is interested.

It does significantly hurt compression to flush the streams every 10k =
rows.

.. Owen

>=20
> Gang
>=20
> On Sat, Jun 1, 2019 at 7:33 AM Xiening Dai <xndai.git@live.com> wrote:
>=20
>> Hi Shankar,
>>=20
>> This is a known issue. As far as I know, there are two issues here -
>>=20
>> 1. The reader doesn=E2=80=99t use row group index to skip unnecessary =
rows.
>> Instead it read through every row until the cursor moves to the =
desired
>> position. [1]
>> 2. We could have skip the entire compression block when current =
offset +
>> decompressed size <=3D desired offset. But we are currently not doing =
that.
>> [2]
>>=20
>> These issues can be fixed. Feel free to open a JIRA.
>>=20
>> There=E2=80=99s one more thing we could discuss here. Currently the =
compression
>> block and RLE run can span across two row groups, which means even =
for
>> seeking to the beginning of a row group, it will possibly require
>> decompression and decoding. This might not be desirable in cases =
where
>> latency is sensitive. In our setup, we modify the writer to close the =
RLE
>> runs and compression blocks at the end of each row group. So seeking =
to a
>> row group doesn=E2=80=99t require any decompression. The difference =
in terms of
>> storage efficiency is barely noticeable (< 1%). I would suggest we =
make
>> this change into Orc v2. The other benefit is we could greatly simply
>> current row position index design.
>>=20
>>=20
>> [1]
>> =
https://github.com/apache/orc/blob/bfd63b8e4df35472d8d9d89c328c5b74b7af6e1=
a/c%2B%2B/src/Reader.cc#L294
>> <
>> =
https://github.com/apache/orc/blob/bfd63b8e4df35472d8d9d89c328c5b74b7af6e1=
a/c++/src/Reader.cc#L294
>>>=20
>> [2]
>> =
https://github.com/apache/orc/blob/728b1d19c7fa0f09e460aea37092f76cbdefd14=
0/c%2B%2B/src/Compression.cc#L545
>> <
>> =
https://github.com/apache/orc/blob/728b1d19c7fa0f09e460aea37092f76cbdefd14=
0/c++/src/Compression.cc#L545
>>>=20
>>=20
>>=20
>>=20
>>=20
>> On May 30, 2019, at 11:17 PM, Shankar Iyer =
<shiyer22@gmail.com<mailto:
>> shiyer22@gmail.com>> wrote:
>>=20
>> Hello,
>>=20
>> We are developing a data store based on ORC files and using the C++ =
API. We
>> are using min/max statistics from the row index, bloom filters and =
our
>> custom partitioning stuff to read only the required rows from the ORC
>> files. This implementation relies on the seekToRow() method in the
>> RowReader class to seek the appropriate row groups and then read the =
batch.
>> I am noticing that the seekToRow() is not efficient and degrades the
>> performance, even if just a few row groups have to be read. Some =
numbers
>> from my testing :-
>>=20
>> Number of rows in ORC file : 30 million
>> File Size : 845 MB (7 stripes)
>> Number of Columns : 16 (tpc-h lineitem table)
>>=20
>> Sequential read of all rows/all columns : 10 seconds
>> Read only 1% of the row groups using seek (forward direction only) : =
1.5
>> seconds
>> Read only 3% of the row groups using seek (forward direction only) : =
12
>> seconds
>> Read only 4% of the row groups using seek (forward direction only) : =
20
>> seconds
>> Read only 5% of the row groups using seek (forward direction only) : =
33
>> seconds
>>=20
>>=20
>> I tried the Java API and implemented the same filtering logic via =
predicate
>> push down and got good numbers with the same ORC file :-
>>=20
>> Sequential read of all rows/all columns : 18 seconds
>> Match & read 20% of row groups : 7 seconds
>> Match & read 33% of row groups.: 11 seconds
>> Match & read 50% of row groups : 13.5 seconds
>>=20
>> I think the seekToRow() implementation needs to use the row index =
positions
>> and read only the appropriate stream portions(like the Java API). The
>> current seekToRow() implementation starts over from the beginning of =
the
>> stripe for each seek. I would like to work on changing the =
seekToRow()
>> implementation, if this is not actively being worked on right now by
>> anyone. The seek is critical for us as we have multiple feature paths =
that
>> need to read only portions of the ORC file.
>>=20
>> I am looking for opinion from the community and contributors.
>>=20
>> Thanks,
>> Shankar
>>=20
>>=20