spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vikram Kone <vikramk...@gmail.com>
Subject Re: Spark job workflow engine recommendations
Date Wed, 07 Oct 2015 17:18:40 GMT
Hien,
I saw this pull request and from what I understand this is geared towards
running spark jobs over hadoop. We are using spark over cassandra and not
sure if this new jobtype supports that. I haven't seen any documentation in
regards to how to use this spark job plugin, so that I can test it out on
our cluster.
We are currently submitting our spark jobs using command job type using the
following command  "dse spark-submit --class com.org.classname ./test.jar"
etc. What would be the advantage of using the native spark job type over
command job type?

I didn't understand from your reply if azkaban already supports long
running jobs like spark streaming..does it? streaming jobs generally need
to be running indefinitely or forever and needs to be restarted if for some
reason they fail (lack of resources may be..). I can probably use the auto
retry feature for this, but not sure

I'm looking forward to the multiple executor support which will greatly
enhance the scalability issue.

On Wed, Oct 7, 2015 at 9:56 AM, Hien Luu <hluu@linkedin.com> wrote:

> The spark job type was added recently - see this pull request
> https://github.com/azkaban/azkaban-plugins/pull/195.  You can leverage
> the SLA feature to kill a job if it ran longer than expected.
>
> BTW, we just solved the scalability issue by supporting multiple
> executors.  Within a week or two, the code for that should be merged in the
> main trunk.
>
> Hien
>
> On Tue, Oct 6, 2015 at 9:40 PM, Vikram Kone <vikramkone@gmail.com> wrote:
>
>> Does Azkaban support scheduling long running jobs like spark steaming
>> jobs? Will Azkaban kill a job if it's running for a long time.
>>
>>
>> On Friday, August 7, 2015, Vikram Kone <vikramkone@gmail.com> wrote:
>>
>>> Hien,
>>> Is Azkaban being phased out at linkedin as rumored? If so, what's
>>> linkedin going to use for workflow scheduling? Is there something else
>>> that's going to replace Azkaban?
>>>
>>> On Fri, Aug 7, 2015 at 11:25 AM, Ted Yu <yuzhihong@gmail.com> wrote:
>>>
>>>> In my opinion, choosing some particular project among its peers should
>>>> leave enough room for future growth (which may come faster than you
>>>> initially think).
>>>>
>>>> Cheers
>>>>
>>>> On Fri, Aug 7, 2015 at 11:23 AM, Hien Luu <hluu@linkedin.com> wrote:
>>>>
>>>>> Scalability is a known issue due the the current architecture.
>>>>> However this will be applicable if you run more 20K jobs per day.
>>>>>
>>>>> On Fri, Aug 7, 2015 at 10:30 AM, Ted Yu <yuzhihong@gmail.com> wrote:
>>>>>
>>>>>> From what I heard (an ex-coworker who is Oozie committer), Azkaban
>>>>>> is being phased out at LinkedIn because of scalability issues (though
>>>>>> UI-wise, Azkaban seems better).
>>>>>>
>>>>>> Vikram:
>>>>>> I suggest you do more research in related projects (maybe using their
>>>>>> mailing lists).
>>>>>>
>>>>>> Disclaimer: I don't work for LinkedIn.
>>>>>>
>>>>>> On Fri, Aug 7, 2015 at 10:12 AM, Nick Pentreath <
>>>>>> nick.pentreath@gmail.com> wrote:
>>>>>>
>>>>>>> Hi Vikram,
>>>>>>>
>>>>>>> We use Azkaban (2.5.0) in our production workflow scheduling.
We
>>>>>>> just use local mode deployment and it is fairly easy to set up.
It is
>>>>>>> pretty easy to use and has a nice scheduling and logging interface,
as well
>>>>>>> as SLAs (like kill job and notify if it doesn't complete in 3
hours or
>>>>>>> whatever).
>>>>>>>
>>>>>>> However Spark support is not present directly - we run everything
>>>>>>> with shell scripts and spark-submit. There is a plugin interface
where one
>>>>>>> could create a Spark plugin, but I found it very cumbersome when
I did
>>>>>>> investigate and didn't have the time to work through it to develop
that.
>>>>>>>
>>>>>>> It has some quirks and while there is actually a REST API for
adding
>>>>>>> jos and dynamically scheduling jobs, it is not documented anywhere
so you
>>>>>>> kinda have to figure it out for yourself. But in terms of ease
of use I
>>>>>>> found it way better than Oozie. I haven't tried Chronos, and
it seemed
>>>>>>> quite involved to set up. Haven't tried Luigi either.
>>>>>>>
>>>>>>> Spark job server is good but as you say lacks some stuff like
>>>>>>> scheduling and DAG type workflows (independent of spark-defined
job flows).
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Aug 7, 2015 at 7:00 PM, Jörn Franke <jornfranke@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Check also falcon in combination with oozie
>>>>>>>>
>>>>>>>> Le ven. 7 août 2015 à 17:51, Hien Luu <hluu@linkedin.com.invalid>
>>>>>>>> a écrit :
>>>>>>>>
>>>>>>>>> Looks like Oozie can satisfy most of your requirements.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Aug 7, 2015 at 8:43 AM, Vikram Kone <vikramkone@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>> I'm looking for open source workflow tools/engines
that allow us
>>>>>>>>>> to schedule spark jobs on a datastax cassandra cluster.
Since there are
>>>>>>>>>> tonnes of alternatives out there like Ozzie, Azkaban,
Luigi , Chronos etc,
>>>>>>>>>> I wanted to check with people here to see what they
are using today.
>>>>>>>>>>
>>>>>>>>>> Some of the requirements of the workflow engine that
I'm looking
>>>>>>>>>> for are
>>>>>>>>>>
>>>>>>>>>> 1. First class support for submitting Spark jobs
on Cassandra.
>>>>>>>>>> Not some wrapper Java code to submit tasks.
>>>>>>>>>> 2. Active open source community support and well
tested at
>>>>>>>>>> production scale.
>>>>>>>>>> 3. Should be dead easy to write job dependencices
using XML or
>>>>>>>>>> web interface . Ex; job A depends on Job B and Job
C, so run Job A after B
>>>>>>>>>> and C are finished. Don't need to write full blown
java applications to
>>>>>>>>>> specify job parameters and dependencies. Should be
very simple to use.
>>>>>>>>>> 4. Time based  recurrent scheduling. Run the spark
jobs at a
>>>>>>>>>> given time every hour or day or week or month.
>>>>>>>>>> 5. Job monitoring, alerting on failures and email
notifications
>>>>>>>>>> on daily basis.
>>>>>>>>>>
>>>>>>>>>> I have looked at Ooyala's spark job server which
seems to be
>>>>>>>>>> hated towards making spark jobs run faster by sharing
contexts between the
>>>>>>>>>> jobs but isn't a full blown workflow engine per se.
A combination of spark
>>>>>>>>>> job server and workflow engine would be ideal
>>>>>>>>>>
>>>>>>>>>> Thanks for the inputs
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>

Mime
View raw message