harmony-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Egor Pasko <egor.pa...@gmail.com>
Subject Re: [drlvm][jitrino][test] Large contribution of reliability test cases for DRLVM+Dacapo
Date Tue, 02 Dec 2008 14:16:26 GMT
On the 0x50E day of Apache Harmony Aleksey Shipilev wrote:
> Hi, Egor!
> Your thoughts are truly pessimistic like everyone who develop at least
> one compiler has. Of course, there's no silver bullet, e.g. there's no
> such system where you can press the big red button and the system will
> say where're the bugs :)
> The whole thing about that fuzzy testing is:
>  a. Yes, there can be false-positives.
>  b. Yes, there can be plenty of false-positives.
>  c. Somewhere behind the stack there are real issues covered.
> The problem is, no matter what we are thinking about automated testing
> of compiler, any testing results would produce nearly the same amount
> of garbage above the real issues.
> You'll make the random search, you'll have the whole search space to
> track: 200 boolean params effectively produce 2^200 possible tuples.
> What these results are for, they are more focused on near-optimal
> configurations and so we needn't to scratch our heads on whether we
> should take care of configuration that lies far away from optimal.
> Again, there can be lots of garbage in those tests, but 5.400+ is the
> number I could live with, not with the 2^200. Having only this little
> of the tests enables me to actually tackle them, without having
> another young-looking Universe to run that tests in. <g>
> But the discussion is really inspiring, thanks! The point of
> contributing those tests were the impression that JIT developers are
> crying for tests and bugs to fix. Ian Rogers from JikesRVM had asked
> me to contribute the failure reports for JikesRVM, solely for testing
> of deep dark corners of RVM, so I extrapolated the same intention on
> Harmony. I for sure underestimated the failure rate for RVM and
> Harmony and now have to think how to make the worth of that pile of
> crashed configurations. For now on, I just disclosed them to community
> without clear thoughts what to do next. Nevertheless, in the
> background we all thinking what to do.
> Please don't take the offense :) I perfectly know the tests have to go
> for human-assisted post-processing, I know there is a lot of garbage,
> I know there are lots of implications and complications around. I also
> suspect that this kind of work is like running ahead the train. But
> anyway, the work is done, it was an auxiliary result so we can just
> dump it -- but can we make any use of it?

offense? why? :) I really enjoy the conversation.

> There's an excellent idea with re-testing that issues in debug mode,
> to make more clear taxonomy of the crashes. Though it's not related to
> my job and thesis anymore, I also have an idea how to sweep the tests
> and make them more fine-grained, by introducing the similarity metric
> and searching for nearest non-failure configuration. Any other ideas?

Aleksey, is there a combined solution, where I push the red button,
which makes the silver bullet to fire? :)

Your argument about these being 'near' most effective configurations
is interesting indeed. And result is interesting overall. True. Big
respect, etc.

My concern is: is it effective to look through configurations one by
one to find issues in this compiler? Lots of false-positives really
worry me. Seems like without traversing 10s or even 100s of emconfs by
hand it is hard to find something valuable. For Jikes the situation
might be completely different, Jikes is not an argument here :)

However, there is one idea .. why are you classifying the
configurations based only by end result status? clusters are obviously
too big. I would also take the configurations as a parameter for
clustering failures. Yes, you'll need a fair amount of machine
learning efforts to cluster them. But that may pay off really
well. Looking into trained model BDD could give an insight on what
optpass compatibility rules are. Or it can show that the rules are too
compicated (that will also be the case if you overtrain the model, ha

The latter idea might look like a purple button, however.. :)

Egor Pasko

View raw message