Mailing-List: contact dev-help@ignite.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@ignite.apache.org
MIME-Version: 1.0
In-Reply-To: <CABuYRcpbq7p8ouFcC3Hg7htF0g4zJOztKk8ciDQXhkuiwmdUmg@mail.gmail.com>
References: <CA+0=VoW7ctibZ+8qYde+0r3EW_9BDXxj8zwupSnKBcV2WSy=BQ@mail.gmail.com>
 <F891E31F-3537-4266-ACA7-7C04E2D01C18@gmail.com> <CA+0=VoUeDAu939fRusKQfZw95uyMXzLMBgxfDMoLgkD4DoAw9w@mail.gmail.com>
 <62B3CBCA-7217-4FD1-9C24-E46493D542CB@gmail.com> <CA+0=VoVPYmxzt5md94cW42-Cu8GexF+yQR23V0pYffisXzdz3g@mail.gmail.com>
 <293EC953-9A98-44F6-A224-855BC58DC4BA@gmail.com> <CA+0=VoUJs9LngjtfofTysv6Z5Go4aeG_a=Zy+_R8HNCPQ=DHLA@mail.gmail.com>
 <CABuYRcpbq7p8ouFcC3Hg7htF0g4zJOztKk8ciDQXhkuiwmdUmg@mail.gmail.com>
From: Dmitriy Setrakyan <d@gridgain.com>
Date: Sat, 5 Aug 2017 00:41:23 +0200
Message-ID: <CA+0=VoVKH5f5W8j9JNAOhNVGG1PwWSyWzHGyhn63bzxaDNLWWA@mail.gmail.com>
Subject: Re: Spark Data Frame support in Ignite
To: dev@ignite.apache.org
Content-Type: multipart/alternative; boundary="f403045e262cc0ae810555f5349c"
archived-at: Fri, 04 Aug 2017 22:42:11 -0000

--f403045e262cc0ae810555f5349c
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

On Thu, Aug 3, 2017 at 9:04 PM, Valentin Kulichenko <
valentin.kulichenko@gmail.com> wrote:

> This JDBC integration is just a Spark data source, which means that Spark
> will fetch data in its local memory first, and only then apply filters,
> aggregations, etc. This is obviously slow and doesn't use all advantages
> Ignite provides.
>
> To create useful and valuable integration, we should create a custom
> Strategy that will convert Spark's logical plan into a SQL query and
> execute it directly on Ignite.
>

I get it, but we have been talking about Data Frame support for longer than
a year. I think we should advise our users to switch to JDBC until the
community gets someone to implement it.


>
> -Val
>
> On Thu, Aug 3, 2017 at 12:12 AM, Dmitriy Setrakyan <dsetrakyan@apache.org=
>
> wrote:
>
> > On Thu, Aug 3, 2017 at 9:04 AM, J=C3=B6rn Franke <jornfranke@gmail.com>
> wrote:
> >
> > > I think the development effort would still be higher. Everything woul=
d
> > > have to be put via JDBC into Ignite, then checkpointing would have to
> be
> > > done via JDBC (again additional development effort), a lot of
> conversion
> > > from spark internal format to JDBC and back to ignite internal format=
.
> > > Pagination I do not see as a useful feature for managing large data
> > volumes
> > > from databases - on the contrary it is very inefficient (and one woul=
d
> to
> > > have to implement logic to fetch al pages). Pagination was also never
> > > thought of for fetching large data volumes, but for web pages showing=
 a
> > > small result set over several pages, where the user can click manuall=
y
> > for
> > > the next page (what they anyway not do most of the time).
> > >
> > > While it might be a quick solution , I think a deeper integration tha=
n
> > > JDBC would be more beneficial.
> > >
> >
> > Jorn, I completely agree. However, we have not been able to find a
> > contributor for this feature. You sound like you have sufficient domain
> > expertise in Spark and Ignite. Would you be willing to help out?
> >
> >
> > > > On 3. Aug 2017, at 08:57, Dmitriy Setrakyan <dsetrakyan@apache.org>
> > > wrote:
> > > >
> > > >> On Thu, Aug 3, 2017 at 8:45 AM, J=C3=B6rn Franke <jornfranke@gmail=
.com>
> > > wrote:
> > > >>
> > > >> I think the JDBC one is more inefficient, slower requires too much
> > > >> development effort. You can also check the integration of Alluxio
> with
> > > >> Spark.
> > > >>
> > > >
> > > > As far as I know, Alluxio is a file system, so it cannot use JDBC.
> > > Ignite,
> > > > on the other hand, is an SQL system and works well with JDBC. As fa=
r
> as
> > > the
> > > > development effort, we are dealing with SQL, so I am not sure why
> JDBC
> > > > would be harder.
> > > >
> > > > Generally speaking, until Ignite provides native data frame
> > integration,
> > > > having JDBC-based integration out of the box is minimally acceptabl=
e.
> > > >
> > > >
> > > >> Then, in general I think JDBC has never designed for large data
> > volumes.
> > > >> It is for executing queries and getting a small or aggregated resu=
lt
> > set
> > > >> back. Alternatively for inserting / updating single rows.
> > > >>
> > > >
> > > > Agree in general. However, Ignite JDBC is designed to work with
> larger
> > > data
> > > > volumes and supports data pagination automatically.
> > > >
> > > >
> > > >>> On 3. Aug 2017, at 08:17, Dmitriy Setrakyan <dsetrakyan@apache.or=
g
> >
> > > >> wrote:
> > > >>>
> > > >>> Jorn, thanks for your feedback!
> > > >>>
> > > >>> Can you explain how the direct support would be different from th=
e
> > JDBC
> > > >>> support?
> > > >>>
> > > >>> Thanks,
> > > >>> D.
> > > >>>
> > > >>>> On Thu, Aug 3, 2017 at 7:40 AM, J=C3=B6rn Franke <jornfranke@gma=
il.com
> >
> > > >> wrote:
> > > >>>>
> > > >>>> These are two different things. Spark applications themselves do
> not
> > > use
> > > >>>> JDBC - it is more for non-spark applications to access Spark
> > > DataFrames.
> > > >>>>
> > > >>>> A direct support by Ignite would make more sense. Although you
> have
> > in
> > > >>>> theory IGFS, if the user is using HDFS, which might not be the
> case.
> > > It
> > > >> is
> > > >>>> now also very common to use Object stores, such as S3.
> > > >>>> Direct support could be leverage for interactive analysis or
> > different
> > > >>>> Spark applications sharing data.
> > > >>>>
> > > >>>>> On 3. Aug 2017, at 05:12, Dmitriy Setrakyan <
> dsetrakyan@apache.org
> > >
> > > >>>> wrote:
> > > >>>>>
> > > >>>>> Igniters,
> > > >>>>>
> > > >>>>> We have had the integration with Spark Data Frames on our roadm=
ap
> > > for a
> > > >>>>> while:
> > > >>>>> https://issues.apache.org/jira/browse/IGNITE-3084
> > > >>>>>
> > > >>>>> However, while browsing Spark documentation, I cam across the
> > generic
> > > >>>> JDBC
> > > >>>>> data frame support in Spark:
> > > >>>>> https://spark.apache.org/docs/latest/sql-programming-guide.
> > > >>>> html#jdbc-to-other-databases
> > > >>>>>
> > > >>>>> Given that Ignite has a JDBC driver, does it mean that it
> > > transitively
> > > >>>> also
> > > >>>>> supports Spark data frames? If yes, we should document it.
> > > >>>>>
> > > >>>>> D.
> > > >>>>
> > > >>
> > >
> >
>

--f403045e262cc0ae810555f5349c--