Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 7DAC5200D50 for ; Mon, 4 Dec 2017 12:41:05 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 7C03E160C05; Mon, 4 Dec 2017 11:41:05 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id C390A160BF9 for ; Mon, 4 Dec 2017 12:41:04 +0100 (CET) Received: (qmail 17216 invoked by uid 500); 4 Dec 2017 11:41:03 -0000 Mailing-List: contact dev-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list dev@spark.apache.org Received: (qmail 17057 invoked by uid 99); 4 Dec 2017 11:41:03 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 04 Dec 2017 11:41:03 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 4FD9C1A0ED5 for ; Mon, 4 Dec 2017 11:41:02 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -1.101 X-Spam-Level: X-Spam-Status: No, score=-1.101 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_REPLY=1, KAM_ASCII_DIVIDERS=0.8, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H2=-2.8, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd2-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id 8X6YvMYRib8w for ; Mon, 4 Dec 2017 11:41:01 +0000 (UTC) Received: from mail-wr0-f180.google.com (mail-wr0-f180.google.com [209.85.128.180]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id EC3F35F2A9 for ; Mon, 4 Dec 2017 11:41:00 +0000 (UTC) Received: by mail-wr0-f180.google.com with SMTP id v22so16906921wrb.0 for ; Mon, 04 Dec 2017 03:41:00 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=content-transfer-encoding:from:mime-version:subject:date:message-id :references:cc:in-reply-to:to; bh=NI0SwUXtkteMi7+u56zvB70ADTx19piuY8UqWt+EqOo=; b=miLDhJev2tLOrDWLiroCxOpigoYe6/t0uSFij42RSESXigmKm8gNVduD9o0j8aeHTi IfiKShjs+lL9W/XEXC9P5lb5ZYtz68U/zvPKP8upUGCygqilvb53hSSfXaRUtAxis7SA xbSV0B36ZtB6jIYGcCBvdtpp55lWKhMKMCILzUa0ZiQp2WUKUUCkK67WOlHGKq415RLi okAROdIZvYl6Spt6wQw9KuBxsknACd+M66Ty5Dt1CnjOmjwXEVSSOMVuMJeFCpXKDqxJ 2eXxP2xBEi+ioQFw6J8wEWjjebZD9WZEmTRxnhOuPS3RUtztdGwjcMuAKAnPhT2Vcm1+ EA+w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:content-transfer-encoding:from:mime-version :subject:date:message-id:references:cc:in-reply-to:to; bh=NI0SwUXtkteMi7+u56zvB70ADTx19piuY8UqWt+EqOo=; b=DAVUetcL1/CtUSBHHRTcKlo/tHVyU8cfhJES0hU6hCG77KzeD+QuPuuMJ8coDjAngN TmWiUflipXxsLKcRaOoRcCVxc3A5c4nIVjPIer2pCbJNT4kpF6UKl6ACuQEaCKJsGtTr Y1soz2YSkJcxQNHu9Z0Kz7laE0Y2dEii20aK3v32VmvubM2AELB/Ovhx4uz3/RcAerTc kXcuMiG4WYROSMZd/89y6g53yDiZJdpEX77FJqxNxUnUJ9CRCOHHwmhgV3jcepBOrlNb eOfYPmMnJ5IzoAAmbJs+5j+75JNxwIHOmJzRs6mLBI9xEJAooXjktBnkkpLmS6HzUE9A uAEA== X-Gm-Message-State: AJaThX6zlI1iOw/0CGRi5z79Norp+Cjz9Jm8HPLH7DuwLQoBvid5oe2y psrFDmK6oi2RlwGyppuQstd5D9BCR8M= X-Google-Smtp-Source: AGs4zMYCSe1wER/pwfeJ/RKzqVWSVYQNBABfmPcG/FN574wtf5ubVrtiwyB8IRYch3rrN6bR2bt9Gw== X-Received: by 10.223.154.244 with SMTP id a107mr13959760wrc.8.1512387659628; Mon, 04 Dec 2017 03:40:59 -0800 (PST) Received: from [10.155.199.172] (x527162ac.dyn.telefonica.de. [82.113.98.172]) by smtp.gmail.com with ESMTPSA id d18sm16797684wrd.54.2017.12.04.03.40.45 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 04 Dec 2017 03:40:59 -0800 (PST) Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable From: =?utf-8?Q?J=C3=B6rn_Franke?= Mime-Version: 1.0 (1.0) Subject: Re: Spark Data Frame. PreSorded partitions Date: Mon, 4 Dec 2017 10:55:18 +0100 Message-Id: <5249F2A5-5557-4EFB-9D6E-293FB27D114C@gmail.com> References: <62975827-2987-caec-8fd9-5c97446e648f@gmail.com> <1726cbe4-8280-290d-388a-2160fd5fc223@gmail.com> Cc: dev@spark.apache.org In-Reply-To: To: =?utf-8?B?0J3QuNC60L7Qu9Cw0Lkg0JjQttC40LrQvtCy?= X-Mailer: iPhone Mail (15C114) archived-at: Mon, 04 Dec 2017 11:41:05 -0000 I do not think that the data source api exposes such a thing. You can howeve= r proposes to the data source api 2 to be included. However there are some caveats , because sorted can mean two different thing= s (weak vs strict order). Then, is really a lot of time lost because of sorting? The best thing is to n= ot read data that is not needed at all (see min/max indexes in orc/parquet o= r bloom filters in Orc). What is not read does not need to be sorted. See al= so predicate pushdown. > On 4. Dec 2017, at 07:50, =D0=9D=D0=B8=D0=BA=D0=BE=D0=BB=D0=B0=D0=B9 =D0=98= =D0=B6=D0=B8=D0=BA=D0=BE=D0=B2 wrote: >=20 > Cross-posting from @user. >=20 > Hello, guys! >=20 > I work on implementation of custom DataSource for Spark Data Frame API and= have a question: >=20 > If I have a `SELECT * FROM table1 ORDER BY some_column` query I can sort d= ata inside a partition in my data source. >=20 > Do I have a built-in option to tell spark that data from each partition al= ready sorted? >=20 > It seems that Spark can benefit from usage of already sorted partitions. > By using of distributed merge sort algorithm, for example. >=20 > Does it make sense for you? >=20 >=20 > 28.11.2017 18:42, Michael Artz =D0=BF=D0=B8=D1=88=D0=B5=D1=82: >> I'm not sure other than retrieving from a hive table that is already sort= ed. This sounds cool though, would be interested to know this as well >> On Nov 28, 2017 10:40 AM, "=D0=9D=D0=B8=D0=BA=D0=BE=D0=BB=D0=B0=D0=B9 =D0= =98=D0=B6=D0=B8=D0=BA=D0=BE=D0=B2" > wrote: >> Hello, guys! >> I work on implementation of custom DataSource for Spark Data Frame API= and have a question: >> If I have a `SELECT * FROM table1 ORDER BY some_column` query I can so= rt data inside a partition in my data source. >> Do I have a built-in option to tell spark that data from each partitio= n already sorted? >> It seems that Spark can benefit from usage of already sorted partition= s. >> By using of distributed merge sort algorithm, for example. >> Does it make sense for you? >> --------------------------------------------------------------------- >> To unsubscribe e-mail: user-unsubscribe@spark.apache.org >=20 > --------------------------------------------------------------------- > To unsubscribe e-mail: dev-unsubscribe@spark.apache.org >=20 --------------------------------------------------------------------- To unsubscribe e-mail: dev-unsubscribe@spark.apache.org