reef-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mariia Mykhailova <mamyk...@microsoft.com.INVALID>
Subject Re: 0.16?
Date Fri, 17 Feb 2017 17:18:47 GMT
For high availability feature, the fixes which allow to run DriverRestart example on our Yarn
test clusters are in the master. However, to the best of my knowledge nobody has tried to
use HA in production code/real-life scenarios yet.


For transient test failures in CI, there are two issues on Java side (REEF-1668 and REEF-1729)
and a whole bunch of issues on .NET side (umbrella REEF-1462). The ones on .NET side can't
be reproduced locally, so you have to set up an instance of AppVeyor for your for of REEF
repository, as described in https://github.com/apache/reef/blob/master/lang/cs/BUILD.md


-Mariia

________________________________
From: Saikat Kanjilal <sxk1969@gmail.com>
Sent: Thursday, February 16, 2017 8:07:40 PM
To: dev@reef.apache.org
Subject: Re: 0.16?

Sergei,
I definitely have more experience with Java than .Net, maybe this is a JIRA that I also add
to my collection and help you, might be a good case for pair coding as well, let me know how
you want to move forward.
Thanks

Sent from my iPad

> On Feb 16, 2017, at 6:23 PM, Sergiy Matusevych <sergiy.matusevych@gmail.com> wrote:
>
> Hi Saikat,
>
> The cleanup work is purely Java, so if you are working on the .NET side of
> things, I don't see much sense to switch the environment just for these
> issues. Still, it would be nice to get some help - maybe there are
> volunteers willing to debug some race conditions in Java and on YARN?
>
> Thank you,
> Sergiy.
>
>> On Thu, Feb 16, 2017 at 6:11 PM, Saikat Kanjilal <sxk1969@gmail.com> wrote:
>>
>> Me and my big mouth :))))), just kidding, I am already working on .Net
>> core 2.0 conversion JIRA's , what sort of dev/test help can I provide?
>>
>> Sent from my iPhone
>>
>>> On Feb 16, 2017, at 5:41 PM, Sergiy Matusevych <
>> sergiy.matusevych@gmail.com> wrote:
>>>
>>> Hi Saikat,
>>>
>>> The failures are sporadic and most likely are due to some race conditions
>>> during the cleanup process. You don't need CI to replicate them, but we
>>> need to debug the issues not only in local mode, but also on YARN (and,
>>> ideally, for all other runtimes that we provide). A good indicator of
>>> successful cleanup would be JIRA issue
>>> https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FREEF-1715&data=02%7C01%7Cmamykhai%40microsoft.com%7Cf59d099955eb4db4334908d456ea88f1%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636229012705797305&sdata=d7onJx7fUX%2BjvYQYsvf8U2y2DuMfls%2Fw%2FAlVkDeYq4I%3D&reserved=0
- when all threads are
>>> closed properly, we would no longer need System.exit() call at the end of
>>> the Driver or Evaluator processes (regardless of the runtime). Would you
>> be
>>> interested in helping me with that part?
>>>
>>> Thank you,
>>> Sergiy.
>>>
>>>
>>>> On Thu, Feb 16, 2017 at 5:29 PM, Saikat Kanjilal <sxk1969@gmail.com>
>> wrote:
>>>>
>>>> Out of curiosity have we been able to replicate these failures locally ,
>>>> am wondering whether there's a need to have a local version of Travis ci
>>>> setup?
>>>>
>>>> Sent from my iPhone
>>>>
>>>>> On Feb 16, 2017, at 5:22 PM, Boris Shulman <shulmanb@gmail.com>
wrote:
>>>>>
>>>>> Is AM HA part of 0.16?
>>>>>
>>>>> Sent from my iPhone
>>>>>
>>>>>> On Feb 16, 2017, at 12:22 PM, Sergiy Matusevych <
>>>> sergiy.matusevych@gmail.com> wrote:
>>>>>>
>>>>>> Hi Markus,
>>>>>>
>>>>>> I think we can safely announce that Unmanaged AM and REEF-on-REEF
will
>>>> be
>>>>>> part of 0.16, but the bugs that Mariia mentions prevent us from
>> calling
>>>>>> this release REEF-as-a-Library. Even for the unmanaged AM, I need
some
>>>> time
>>>>>> (likely till the end of this sprint) to make sure that Unmanaged
AM
>>>> works
>>>>>> properly on Hadoop 2.7.3 and above.
>>>>>>
>>>>>> Thanks,
>>>>>> Sergiy.
>>>>>>
>>>>>> On Thu, Feb 16, 2017 at 9:43 AM, Mariia Mykhailova <
>>>>>> mamykhai@microsoft.com.invalid> wrote:
>>>>>>
>>>>>>> There are several transient test failures in both Java and .NET
tests
>>>> and
>>>>>>> Travis CI job timeout (which indicates hidden problems in terminating
>>>> Java
>>>>>>> REEF jobs) which we've introduced since 0.15. I don't think we
should
>>>> do a
>>>>>>> release with these issues uninvestigated, especially Travis timeout.
>>>> For
>>>>>>> now I've marked them as blocking REEF-1444.
>>>>>>>
>>>>>>> -Mariia
>>>>>>>
>>>>>>>
>>>>>>> ________________________________
>>>>>>> From: Markus Weimer <markus@weimo.de>
>>>>>>> Sent: Thursday, February 16, 2017 9:31:35 AM
>>>>>>> To: REEF Developers Mailinglist
>>>>>>> Subject: 0.16?
>>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> where are we in the process of releasing 0.16? In other words:
If we
>>>> called
>>>>>>> the release today, what amazing feature that is on the cusp of
>> getting
>>>> in
>>>>>>> would we loose?
>>>>>>>
>>>>>>> I'm not suggesting to literally do it today, but a release around
the
>>>> VS
>>>>>>> 2017 availability would be convenient for us to switch to the
new
>> build
>>>>>>> system and all early in the works towards 0.17.
>>>>>>>
>>>>>>> Markus
>>>>>>>
>>>>
>>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message