ignite-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stephen Darlington <stephen.darling...@gridgain.com>
Subject Re: [DISCUSSION] Spark Data Frame through Thin Client
Date Mon, 22 Oct 2018 10:16:40 GMT
Ignite doesn’t currently support Spark Structured Streaming:

https://issues.apache.org/jira/browse/IGNITE-9357 <https://issues.apache.org/jira/browse/IGNITE-9357>

There’s a working patch associated with it.

Regards,
Stephen

> On 22 Oct 2018, at 10:43, Nikolay Izhikov <nizhikov@apache.org> wrote:
> 
> Hello, Stephen.
> 
> I suggest thin client deployment as a second option together with existing integration
that use Client Node.
> 
>> I’m thinking specifically about better support for Spark Streaming, where the lack
 of continuous query support in thin clients removes a significant optimisation option. 
> 
> It's very interesting.
> Can you share you thoughts?
> What can be improved in Spark integration?
> 
> В Пн, 22/10/2018 в 10:22 +0100, Stephen Darlington пишет:
>> Are you suggesting making the Thin Client deployment an option or as a replacement
for the thick-client? If the latter, do we risk making future desirable changes more difficult
(or impossible)? I’m thinking specifically about better support for Spark Streaming, where
the lack  of continuous query support in thin clients removes a significant optimisation option.
I’m sure there are other use cases.
>> 
>> Regards,
>> Stephen
>> 
>>> On 21 Oct 2018, at 09:08, Nikolay Izhikov <nizhikov@apache.org> wrote:
>>> 
>>> Valentin.
>>> 
>>> Seems, You made several suggestions, which is not always true, from my point
of view:
>>> 
>>> 1. "We have access to Spark cluster installation to perform deployment steps"
- this is not true in cloud or enterprise environment.
>>> 
>>> 2. "Spark cluster is used only for Ignite integration".
>>> From what I know computational resources for big Spark cluster is divided by
many business divisions.
>>> And it is not convenient to perform some deployment steps on this cluster.
>>> 
>>> 3. "When Ignite + Spark are used in real production it's OK to have reasonable
deployment overhead"
>>> What about developer who want to play with this integration?
>>> And want to do it quickly to see how it works in real life examples.
>>> Can we do his life much easier?
>>> 
>>>> First of all, they will exist with thin client either.
>>> 
>>> Spark have an ability to deploy jars on worker and add it to application tasks
classpath.
>>> For 2.6 we must deploy 11 additional jars to start using Ignite.
>>> Please, see my example on the bottom of documentation page [1]
>>> 
>>> Does cache-api-1.0.0.jar and h2-1.4.195.jar seems like obvious dependencies for
Ignite integration for you?
>>> And for our users? :)
>>> 
>>> Actually, list of dependencies will be changed in 2.7 - new version of jcache,
new version of h2
>>> So user should change it in code or perform additional deployment steps.
>>> 
>>> It overkill for me.
>>> 
>>> On the other hand - thin client requires only 1 jar.
>>> Moreover, thin client protocol have the backward compatibility.
>>> So thin client will perform correctly when Ignite cluster will be updated from
2.6 to 2.7.
>>> So, with Spark integration via thin client we will be able to update Ignite cluster
and Spark integration separately.
>>> For now, we should do it in one big step.
>>> 
>>> What do you think?
>>> 
>>> [1] https://apacheignite-fs.readme.io/docs/installation-deployment
>>> 
>>> В Сб, 20/10/2018 в 18:33 -0700, Valentin Kulichenko пишет:
>>>> Guys,
>>>> 
>>>> From my experience, Ignite and Spark clusters typically run in the same
>>>> environment, which makes client node a more preferable option. Mainly,
>>>> because of performance. BTW, I doubt partition-awareness on thin client
>>>> will help either, because in dataframes we only run SQL queries and I
>>>> believe thin client will execute them through a proxy anyway. But correct
>>>> me if I’m wrong.
>>>> 
>>>> Either way, it sounds like we just have usability issues with Ignite/Spark
>>>> integration. Why don’t we concentrate on fixing them then? For example,
#3
>>>> can be fixed by loading XML content on master and then distributing it to
>>>> workers, instead of loading on every worker independently. Then there are
>>>> certain procedures like deploying JARs, etc. First of all, they will exist
>>>> with thin client either. Second of all, I’m sure there are ways to simplify
>>>> this procedures and make integration easier. My opinion is that working on
>>>> such improvements is going to add more value than another implementation
>>>> based on thin client.
>>>> 
>>>> -Val
>>>> 
>>>> On Sat, Oct 20, 2018 at 4:03 PM Denis Magda <dmagda@apache.org> wrote:
>>>> 
>>>>> Hello Nikolay,
>>>>> 
>>>>> Your proposal sounds reasonable. However, I would suggest us to wait
while
>>>>> partition-awareness is supported for Java thin client first. With that
>>>>> feature, the client can connect to any node directly while presently
all
>>>>> the communication goes through a proxy (a node the client is connected
to).
>>>>> All of that is bad for performance.
>>>>> 
>>>>> 
>>>>> Vladimir, how hard would it be to support the partition-awareness for
Java
>>>>> client? Probably, Nikolay can take over.
>>>>> 
>>>>> --
>>>>> Denis
>>>>> 
>>>>> 
>>>>> On Sat, Oct 20, 2018 at 2:09 PM Nikolay Izhikov <nizhikov@apache.org>
>>>>> wrote:
>>>>> 
>>>>>> Hello, Igniters.
>>>>>> 
>>>>>> Currently, Spark Data Frame integration implemented via client node
>>>>>> connection.
>>>>>> Whenever we need to retrieve some data into Spark worker(or master)
from
>>>>>> Ignite we start a client node.
>>>>>> 
>>>>>> It has several major disadvantages:
>>>>>> 
>>>>>>       1. We should copy whole Ignite distribution on to each Spark
>>>>>> worker [1]
>>>>>>       2. We should copy whole Ignite distribution on to Spark master
to
>>>>>> get catalogue works.
>>>>>>       3. We should have the same absolute path to Ignite configuration
>>>>>> file on every worker and provide it during data frame construction
[2]
>>>>>>       4. We should additionally configure Spark workerks classpath
to
>>>>>> include Ignite libraries.
>>>>>> 
>>>>>> For now, almost all operation we need to do in Spark Data Frame
>>>>>> integration is supported by Java Thin Client.
>>>>>>       * obtain the list of caches.
>>>>>>       * get cache configuration.
>>>>>>       * execute SQL query.
>>>>>>       * stream data to the table - don't support by the thin client
for
>>>>>> now, but can be implemented using simple SQL INSERT statements.
>>>>>> 
>>>>>> Advantages of usage Java Thin Client in Spark integration(they all
known
>>>>>> from Java Thin Client advantages):
>>>>>>       1. Easy to configure: only IP addresses of server nodes are
>>>>>> required.
>>>>>>       2. Easy to deploy: only 1 additional jar required. No server
>>>>>> side(Ignite worker) configuration required.
>>>>>> 
>>>>>> I propose to implement Spark Data Frame integration through Java
Thin
>>>>>> Client.
>>>>>> 
>>>>>> Thoughts?
>>>>>> 
>>>>>> [1] https://apacheignite-fs.readme.io/docs/installation-deployment
>>>>>> [2]
>>>>>> 
>>>>> 
>>>>> https://apacheignite-fs.readme.io/docs/ignite-data-frame#section-ignite-dataframe-options
>>>>>> 
>> 
>> 



Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message