On 2015-02-16 11:19, Jérôme Gardou wrote:
However I
don't see a practical way to implement your solution. For one,
the timeout seems like it would have to cause a machine reboot, which
significantly increases test time. I guess rosautotest could keep track
of the time and kill the child process, but I don't know how reliable
that would be.
I think the whole point here is "test hung" vs
"ReactOS hung". If time
is the matter here, one could imagine that test bot could communicate
with rosautotests with a socket or COM2 polling for activity. If
rosautotests is dead, this means that something bad is happening -->
reboot. This would be way faster than waiting x seconds for debug spam
to happen.
I don't see why rosautotest keeping track of its child process would be
unreliable. As far as I know, even wine tests do this all the time.
Yeah it's probably fine to have rosautotest check for timeout. We could
even make it output something that will indicate this to Testman.
I don't know if it will be able to reliably kill all child processes
though, but I guess sysreg's timeout would still be around anyway for
those that it can't.
A second communication channel might be overkill... I mean things work
pretty okay as-is. :p
Secondly and
more importantly, killing the test on timeout seems worse
than skipping the offending part, because then any tests running
_after_ the code causing the timeout would never be executed. That means
instead of skipping a test that's known to be problematic, we skip
completely unrelated and innocent tests.
Well, now I'm confused regarding the reason why you skipped the tests. I
thought this was because the debug spam (or the wait call in the
advapi32:service case) took too long, and then test bot considered it as
timed-out and hence, rebooted the VM.
Right now most tests either get skipped because they hang (i.e. will
cause a sysreg timeout and thus reboot), or crash (and thus make the
results of the tests that did/would run successfully invisible in
Testman).
advapi32:service is kind of its own category: it's designed to take two
minutes -- and I simply don't think a "does the service manager
correctly time out after 30 seconds" test is worth losing that much
time on every single test run. I'm happy to personally retest this
every 2-3 months and do a proper regression test if we do find a
problem with it -- I just think it's extremely unlikely in this case
and the time savings are worth orders of magnitude more.
I'm not
sure if there's a way to win this. Let me know if you have any
ideas. As it stands I think skipping the offenders is an okay solution
-- we just need to stay very aware of them and make sure to re-test more
regularly than we do now.
Well, as long as we skip them because "they take
too long" and not
because of valid, identified bugs, we will lose. We (as in ReactOS and
Wine) add more and more tests each release. If we don't find a long-term
solution, we're doomed to chase ghosts.
I agree, tests taking very long is usually a bug in ROS and should thus
stay visible.
advapi32:service is one of the (I don't think very many) exceptions to
this IMO because it will take two minutes on Windows as well.
Giannis and I were talking about this stuff and he calculated some
statistics... IIRC it looks like the remaining tests that take more
than ~20 seconds are actually due to issues in ROS and they should be
fast in Windows. So I don't think we'll want/need to give more tests
the "takes to long, skip" treatment.
I guess the best way to solve this issue for the most part is fixing
the bugs that cause these tests to have problems. That may sound naive,
but given how little we've usually focused on this, and seeing that we
quartered (or so) the number of test hangs/crashes, significantly
reduced the number of skipped tests and halved average test run time
just within the last few days makes me pretty optimistic about that.