Mailing-List: contact dev-help@spark.apache.org; run by ezmlm
Precedence: bulk
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
From: =?utf-8?Q?J=C3=B6rn_Franke?= <jornfranke@gmail.com>
Mime-Version: 1.0 (1.0)
Subject: Re: Spark Data Frame. PreSorded partitions
Date: Mon, 4 Dec 2017 10:55:18 +0100
Message-Id: <5249F2A5-5557-4EFB-9D6E-293FB27D114C@gmail.com>
References: <62975827-2987-caec-8fd9-5c97446e648f@gmail.com> <1726cbe4-8280-290d-388a-2160fd5fc223@gmail.com> <CA+pG8eOFqQEvWJ5qqO=hkrzq1W2ZUVuPUnQWQ2spe=WetFjz7A@mail.gmail.com> <ab3d0247-8e3f-84b0-94d3-ffd14d0cf02b@gmail.com>
Cc: dev@spark.apache.org
In-Reply-To: <ab3d0247-8e3f-84b0-94d3-ffd14d0cf02b@gmail.com>
To: =?utf-8?B?0J3QuNC60L7Qu9Cw0Lkg0JjQttC40LrQvtCy?= <nizhikov.dev@gmail.com>
archived-at: Mon, 04 Dec 2017 11:41:05 -0000

I do not think that the data source api exposes such a thing. You can howeve=
r proposes to the data source api 2 to be included.

However there are some caveats , because sorted can mean two different thing=
s (weak vs strict order).

Then, is really a lot of time lost because of sorting? The best thing is to n=
ot read data that is not needed at all (see min/max indexes in orc/parquet o=
r bloom filters in Orc). What is not read does not need to be sorted. See al=
so predicate pushdown.

> On 4. Dec 2017, at 07:50, =D0=9D=D0=B8=D0=BA=D0=BE=D0=BB=D0=B0=D0=B9 =D0=98=
=D0=B6=D0=B8=D0=BA=D0=BE=D0=B2 <nizhikov.dev@gmail.com> wrote:
>=20
> Cross-posting from @user.
>=20
> Hello, guys!
>=20
> I work on implementation of custom DataSource for Spark Data Frame API and=
 have a question:
>=20
> If I have a `SELECT * FROM table1 ORDER BY some_column` query I can sort d=
ata inside a partition in my data source.
>=20
> Do I have a built-in option to tell spark that data from each partition al=
ready sorted?
>=20
> It seems that Spark can benefit from usage of already sorted partitions.
> By using of distributed merge sort algorithm, for example.
>=20
> Does it make sense for you?
>=20
>=20
> 28.11.2017 18:42, Michael Artz =D0=BF=D0=B8=D1=88=D0=B5=D1=82:
>> I'm not sure other than retrieving from a hive table that is already sort=
ed.  This sounds cool though, would be interested to know this as well
>> On Nov 28, 2017 10:40 AM, "=D0=9D=D0=B8=D0=BA=D0=BE=D0=BB=D0=B0=D0=B9 =D0=
=98=D0=B6=D0=B8=D0=BA=D0=BE=D0=B2" <nizhikov.dev@gmail.com <mailto:nizhikov.=
dev@gmail.com>> wrote:
>>    Hello, guys!
>>    I work on implementation of custom DataSource for Spark Data Frame API=
 and have a question:
>>    If I have a `SELECT * FROM table1 ORDER BY some_column` query I can so=
rt data inside a partition in my data source.
>>    Do I have a built-in option to tell spark that data from each partitio=
n already sorted?
>>    It seems that Spark can benefit from usage of already sorted partition=
s.
>>    By using of distributed merge sort algorithm, for example.
>>    Does it make sense for you?
>>    ---------------------------------------------------------------------
>>    To unsubscribe e-mail: user-unsubscribe@spark.apache.org <mailto:user-=
unsubscribe@spark.apache.org>
>=20
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>=20

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org