ignite-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nikolay Izhikov <nizhi...@apache.org>
Subject Re: [DISCUSSION] Spark Data Frame through Thin Client
Date Sun, 21 Oct 2018 08:08:25 GMT
Valentin.

Seems, You made several suggestions, which is not always true, from my point of view:

1. "We have access to Spark cluster installation to perform deployment steps" - this is not
true in cloud or enterprise environment.

2. "Spark cluster is used only for Ignite integration".
From what I know computational resources for big Spark cluster is divided by many business
divisions.
And it is not convenient to perform some deployment steps on this cluster.

3. "When Ignite + Spark are used in real production it's OK to have reasonable deployment
overhead"
What about developer who want to play with this integration?
And want to do it quickly to see how it works in real life examples.
Can we do his life much easier?

> First of all, they will exist with thin client either.

Spark have an ability to deploy jars on worker and add it to application tasks classpath.
For 2.6 we must deploy 11 additional jars to start using Ignite.
Please, see my example on the bottom of documentation page [1]

Does cache-api-1.0.0.jar and h2-1.4.195.jar seems like obvious dependencies for Ignite integration
for you?
And for our users? :)

Actually, list of dependencies will be changed in 2.7 - new version of jcache, new version
of h2
So user should change it in code or perform additional deployment steps.

It overkill for me.

On the other hand - thin client requires only 1 jar.
Moreover, thin client protocol have the backward compatibility.
So thin client will perform correctly when Ignite cluster will be updated from 2.6 to 2.7.
So, with Spark integration via thin client we will be able to update Ignite cluster and Spark
integration separately.
For now, we should do it in one big step.

What do you think?

[1] https://apacheignite-fs.readme.io/docs/installation-deployment

В Сб, 20/10/2018 в 18:33 -0700, Valentin Kulichenko пишет:
> Guys,
> 
> From my experience, Ignite and Spark clusters typically run in the same
> environment, which makes client node a more preferable option. Mainly,
> because of performance. BTW, I doubt partition-awareness on thin client
> will help either, because in dataframes we only run SQL queries and I
> believe thin client will execute them through a proxy anyway. But correct
> me if I’m wrong.
> 
> Either way, it sounds like we just have usability issues with Ignite/Spark
> integration. Why don’t we concentrate on fixing them then? For example, #3
> can be fixed by loading XML content on master and then distributing it to
> workers, instead of loading on every worker independently. Then there are
> certain procedures like deploying JARs, etc. First of all, they will exist
> with thin client either. Second of all, I’m sure there are ways to simplify
> this procedures and make integration easier. My opinion is that working on
> such improvements is going to add more value than another implementation
> based on thin client.
> 
> -Val
> 
> On Sat, Oct 20, 2018 at 4:03 PM Denis Magda <dmagda@apache.org> wrote:
> 
> > Hello Nikolay,
> > 
> > Your proposal sounds reasonable. However, I would suggest us to wait while
> > partition-awareness is supported for Java thin client first. With that
> > feature, the client can connect to any node directly while presently all
> > the communication goes through a proxy (a node the client is connected to).
> > All of that is bad for performance.
> > 
> > 
> > Vladimir, how hard would it be to support the partition-awareness for Java
> > client? Probably, Nikolay can take over.
> > 
> > --
> > Denis
> > 
> > 
> > On Sat, Oct 20, 2018 at 2:09 PM Nikolay Izhikov <nizhikov@apache.org>
> > wrote:
> > 
> > > Hello, Igniters.
> > > 
> > > Currently, Spark Data Frame integration implemented via client node
> > > connection.
> > > Whenever we need to retrieve some data into Spark worker(or master) from
> > > Ignite we start a client node.
> > > 
> > > It has several major disadvantages:
> > > 
> > >         1. We should copy whole Ignite distribution on to each Spark
> > > worker [1]
> > >         2. We should copy whole Ignite distribution on to Spark master to
> > > get catalogue works.
> > >         3. We should have the same absolute path to Ignite configuration
> > > file on every worker and provide it during data frame construction [2]
> > >         4. We should additionally configure Spark workerks classpath to
> > > include Ignite libraries.
> > > 
> > > For now, almost all operation we need to do in Spark Data Frame
> > > integration is supported by Java Thin Client.
> > >         * obtain the list of caches.
> > >         * get cache configuration.
> > >         * execute SQL query.
> > >         * stream data to the table - don't support by the thin client for
> > > now, but can be implemented using simple SQL INSERT statements.
> > > 
> > > Advantages of usage Java Thin Client in Spark integration(they all known
> > > from Java Thin Client advantages):
> > >         1. Easy to configure: only IP addresses of server nodes are
> > > required.
> > >         2. Easy to deploy: only 1 additional jar required. No server
> > > side(Ignite worker) configuration required.
> > > 
> > > I propose to implement Spark Data Frame integration through Java Thin
> > > Client.
> > > 
> > > Thoughts?
> > > 
> > > [1] https://apacheignite-fs.readme.io/docs/installation-deployment
> > > [2]
> > > 
> > 
> > https://apacheignite-fs.readme.io/docs/ignite-data-frame#section-ignite-dataframe-options
> > > 

Mime
View raw message