From dev-return-2829-archive-asf-public=cust-asf.ponee.io@orc.apache.org Sun Jun 2 18:04:33 2019 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [207.244.88.153]) by mx-eu-01.ponee.io (Postfix) with SMTP id A83F818064E for ; Sun, 2 Jun 2019 20:04:32 +0200 (CEST) Received: (qmail 50933 invoked by uid 500); 2 Jun 2019 18:04:32 -0000 Mailing-List: contact dev-help@orc.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@orc.apache.org Delivered-To: mailing list dev@orc.apache.org Received: (qmail 50921 invoked by uid 99); 2 Jun 2019 18:04:31 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 02 Jun 2019 18:04:31 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 21469C0800 for ; Sun, 2 Jun 2019 18:04:31 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 0.801 X-Spam-Level: X-Spam-Status: No, score=0.801 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_REPLY=1, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd4-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id 0O-wf6blfofy for ; Sun, 2 Jun 2019 18:04:27 +0000 (UTC) Received: from mail-pg1-f170.google.com (mail-pg1-f170.google.com [209.85.215.170]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 519B85F4D3 for ; Sun, 2 Jun 2019 18:04:26 +0000 (UTC) Received: by mail-pg1-f170.google.com with SMTP id v9so6885560pgr.13 for ; Sun, 02 Jun 2019 11:04:26 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:content-transfer-encoding:mime-version:subject:date:references :to:in-reply-to:message-id; bh=papQkRVeKgCmCJPNDcm4Xg8zSQ/ArXFGQg9UWGSmIXU=; b=C9/ywZhHaATJU+mJRMspJfuQDcpK6AZwS/JcNQ5TmpmJDgzoW2MH8Vr+bHH5nktL3k tRQkxOYc05RzYtRmWpkXsEs8HiCJjzR/cnkb/QKX5npCEmmPbTqEbzEPpiWlHhRB/ych Q5oJcItAExrTny+WkSD3yvtte3F5xcHoyntCwMsw838M0oVR/47xS6SlB20t5m6HLhJz 8knkhCxYpCc4m+2L9v0Rf14RTIUmjqrFwfy5PZHdaVzJzir7ZgYEXad5td02e9cSkrbK O5izrCaKj91UdMBznqJpZd4f+pv4dBa50BjMbKZMQPVwT8UO+VD7szfZp77GEmamqYdt mLHw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:content-transfer-encoding:mime-version :subject:date:references:to:in-reply-to:message-id; bh=papQkRVeKgCmCJPNDcm4Xg8zSQ/ArXFGQg9UWGSmIXU=; b=DR1800pakc5fMfOq3Z5TBnUsCyH7RtUmnZExOVbWiPilfMvKlVCgswT60YQU4mPVi9 oL6JyLogGcM5x88QZTy2UprInGpCBSUFjvdBy27iN5ZbOOBZXP8WYpHtKYCxSju8vQvo ecoy9weETADwZVSZ7RZ33G/ed4ozwbXxuvnVOVEXvXbBbtBnRJ0mDiuLFoH2yTHHQsxi l2/2ypQdkLI99U0u1Se8r6yc2ZObBiCj44NokGFd4l/6QEUgmLhMngBfCDQZj4Hob2vL fzgCXVisvgbx24Oy5xwGbnFcNXaS029VTQRA4RiDq1GVs8aGWaUf0v6+ysSWJbND49sW 65Og== X-Gm-Message-State: APjAAAW3QfFZpcjZ/fdBbkjeKVBw8HEcZxzO2DCwPWS7AzdRWPvpihd/ R33fEJcP2odA3TLQFFZ6JQJ7f+S0 X-Google-Smtp-Source: APXvYqzw8SkoD0zDpqNv4ZyX2QduUK8DIzx1EKkcVJMW1kgkA+LOkA/P3mH43IEWU/zGvCPJuqUY7g== X-Received: by 2002:a62:6241:: with SMTP id w62mr26432056pfb.226.1559498658060; Sun, 02 Jun 2019 11:04:18 -0700 (PDT) Received: from ?IPv6:2601:647:4800:eb48:8c50:1874:5cd1:9d6? ([2601:647:4800:eb48:8c50:1874:5cd1:9d6]) by smtp.gmail.com with ESMTPSA id r3sm12707577pgn.12.2019.06.02.11.04.16 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sun, 02 Jun 2019 11:04:17 -0700 (PDT) From: Owen O'Malley Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Mime-Version: 1.0 (Mac OS X Mail 12.4 \(3445.104.11\)) Subject: Re: C++ API seekToRow() performance. Date: Sun, 2 Jun 2019 11:04:14 -0700 References: <23CE4973-40F8-490C-84A3-7E5EFBE22CEE@live.com> To: dev@orc.apache.org In-Reply-To: Message-Id: <47EE7D5E-AAE2-4FD9-895D-7FEA87A8FA32@gmail.com> X-Mailer: Apple Mail (2.3445.104.11) > On Jun 2, 2019, at 5:43 AM, Gang Wu wrote: >=20 > I can open a JIRA for the issue and port our fix back. That would be great. >=20 > For the last suggestion, we can add the optimization as a writer = option if > anyone is interested. It does significantly hurt compression to flush the streams every 10k = rows. .. Owen >=20 > Gang >=20 > On Sat, Jun 1, 2019 at 7:33 AM Xiening Dai wrote: >=20 >> Hi Shankar, >>=20 >> This is a known issue. As far as I know, there are two issues here - >>=20 >> 1. The reader doesn=E2=80=99t use row group index to skip unnecessary = rows. >> Instead it read through every row until the cursor moves to the = desired >> position. [1] >> 2. We could have skip the entire compression block when current = offset + >> decompressed size <=3D desired offset. But we are currently not doing = that. >> [2] >>=20 >> These issues can be fixed. Feel free to open a JIRA. >>=20 >> There=E2=80=99s one more thing we could discuss here. Currently the = compression >> block and RLE run can span across two row groups, which means even = for >> seeking to the beginning of a row group, it will possibly require >> decompression and decoding. This might not be desirable in cases = where >> latency is sensitive. In our setup, we modify the writer to close the = RLE >> runs and compression blocks at the end of each row group. So seeking = to a >> row group doesn=E2=80=99t require any decompression. The difference = in terms of >> storage efficiency is barely noticeable (< 1%). I would suggest we = make >> this change into Orc v2. The other benefit is we could greatly simply >> current row position index design. >>=20 >>=20 >> [1] >> = https://github.com/apache/orc/blob/bfd63b8e4df35472d8d9d89c328c5b74b7af6e1= a/c%2B%2B/src/Reader.cc#L294 >> < >> = https://github.com/apache/orc/blob/bfd63b8e4df35472d8d9d89c328c5b74b7af6e1= a/c++/src/Reader.cc#L294 >>>=20 >> [2] >> = https://github.com/apache/orc/blob/728b1d19c7fa0f09e460aea37092f76cbdefd14= 0/c%2B%2B/src/Compression.cc#L545 >> < >> = https://github.com/apache/orc/blob/728b1d19c7fa0f09e460aea37092f76cbdefd14= 0/c++/src/Compression.cc#L545 >>>=20 >>=20 >>=20 >>=20 >>=20 >> On May 30, 2019, at 11:17 PM, Shankar Iyer = > shiyer22@gmail.com>> wrote: >>=20 >> Hello, >>=20 >> We are developing a data store based on ORC files and using the C++ = API. We >> are using min/max statistics from the row index, bloom filters and = our >> custom partitioning stuff to read only the required rows from the ORC >> files. This implementation relies on the seekToRow() method in the >> RowReader class to seek the appropriate row groups and then read the = batch. >> I am noticing that the seekToRow() is not efficient and degrades the >> performance, even if just a few row groups have to be read. Some = numbers >> from my testing :- >>=20 >> Number of rows in ORC file : 30 million >> File Size : 845 MB (7 stripes) >> Number of Columns : 16 (tpc-h lineitem table) >>=20 >> Sequential read of all rows/all columns : 10 seconds >> Read only 1% of the row groups using seek (forward direction only) : = 1.5 >> seconds >> Read only 3% of the row groups using seek (forward direction only) : = 12 >> seconds >> Read only 4% of the row groups using seek (forward direction only) : = 20 >> seconds >> Read only 5% of the row groups using seek (forward direction only) : = 33 >> seconds >>=20 >>=20 >> I tried the Java API and implemented the same filtering logic via = predicate >> push down and got good numbers with the same ORC file :- >>=20 >> Sequential read of all rows/all columns : 18 seconds >> Match & read 20% of row groups : 7 seconds >> Match & read 33% of row groups.: 11 seconds >> Match & read 50% of row groups : 13.5 seconds >>=20 >> I think the seekToRow() implementation needs to use the row index = positions >> and read only the appropriate stream portions(like the Java API). The >> current seekToRow() implementation starts over from the beginning of = the >> stripe for each seek. I would like to work on changing the = seekToRow() >> implementation, if this is not actively being worked on right now by >> anyone. The seek is critical for us as we have multiple feature paths = that >> need to read only portions of the ORC file. >>=20 >> I am looking for opinion from the community and contributors. >>=20 >> Thanks, >> Shankar >>=20 >>=20