mxnet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Olivier <cjolivie...@gmail.com>
Subject Re: [apache/incubator-mxnet] Failed OpenMP assertion when loading MXNet compiled with DEBUG=1 (#10856)
Date Sat, 07 Dec 2019 17:04:45 GMT
if it is really a problem, then it would be prioritized. all the necessary
info is in that issue (and i already mentioned just yesterday or today on
that ticket) what it was again and it’s like i was talking to no one, as it
has been, simply an immediate revert to “remove the library”.  in the time
wasted on all this, it could have been resolved 100 times over.


I can remove just about every bug from mxnet by turning off ALL of the
features in CMakeLists.txt. no features, no bugs. This is roughly
equivalent to the approach that has been taken so far for 1.5 years, which
is not good engineering practice, ad a suggestion that I am surprised to
see championed by a committer.


Here’s another example:


Not too long ago (maybe 8 months?) there was a crash at shutdown in debug
mode in tcmalloc (gperf version of malloc, which is similar to jemalloc)
with an error message about bad pointer to free() or something like that.  At
the time, I didn’t know what caused it and so I did not block it’s removal.


fast-forward to about two months ago, where I saw the same error in a
different code base. Since it was happening to me, I was in a position to
debug it, so I did and found that a the same small static library was
linked into two different shared objects, and occasionally (depending upon
link order, I presume), a global string variable was created and destroyed
twice, because when linking, both shared object c-runtime init functions
had the same name, so mapped to the same startup routine and global data
address, so when both shared objects initialized, they called the same
address. This caused both a memory leak because the first startup string
memory allocation was discarded by the second call to the constructor and
at shutdown,  an assert in tcmalloc because the same second memory pointer
allocated was freed twice.  When tcmalloc was removed, the assert went away
but the bug, to the best of my knowledge, is still there.  If I knew then
what I know now, I would have asked the bug to be fixed rather than remove
tcmalloc.  Not because of a love for tcmalloc, but because there is
something telling you there is a bug and the bug should be fixed, because
if you just hide the bug (comment out the assert) then it’s likely to cause
other (harder to track down) problems later. So now that bug is probably
still there causing who-knows-what random crashes or undefined behavior.


This is the kind of root causing that should be done and not effectively
commenting out the assert. I believe we should insist on the highest
standards. I understand if a person does CI all day and if they find
something they can do via CI (ie turn off a feature) which makes the
problem go away, then they might feel compelled to champion that option.
Like the saying goes, “When you have a hammer in your hand, everything
looks like a nail”.


But this is not always the best solution for the project. There is a bug,
and it should be fixed because commenting out the assert just hides the bug
from plain view, but the bug remains.  Or sufficient evidence otherwise.


-Chris



On Sat, Dec 7, 2019 at 8:06 AM Leonard Lausen <notifications@github.com>
wrote:

> It appears to me that this issue only occurs when having multiple openmp
> libraries at runtime. I don't understand why we need to support this
> use-case. MKL-DNN works with whatever openmp runtime is provided by the
> compiler [1
> <https://github.com/intel/mkl-dnn/blob/433e086bf5d9e5ccfc9ec0b70322f931b6b1921d/doc/build/build_options.md#openmp>].
> If you think this use-case is important, please give some more reasoning.
> If you convince me I'm happy to help to root-cause it.
>
> Otherwise I suggest to follow the simplistic approach of using the compile
> openmp runtime. If any specific openmp runtime is needed, then we can
> compile with the associated compiler (GCC, LLVM, Intel Compiler).
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <https://github.com/apache/incubator-mxnet/issues/10856?email_source=notifications&email_token=ACVWZ7MIEQL7BQDDEKJT7G3QXPCZXA5CNFSM4E66F4P2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEGGJ5MA#issuecomment-562863792>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/ACVWZ7MC2MIHMWK72IDVX4TQXPCZXANCNFSM4E66F4PQ>
> .
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message