On 2015-02-16 11:19, Jérôme Gardou wrote:
However I don't see a practical way to implement your solution. For one, the timeout seems like it would have to cause a machine reboot, which significantly increases test time. I guess rosautotest could keep track of the time and kill the child process, but I don't know how reliable that would be.
I think the whole point here is "test hung" vs "ReactOS hung". If time is the matter here, one could imagine that test bot could communicate with rosautotests with a socket or COM2 polling for activity. If rosautotests is dead, this means that something bad is happening --> reboot. This would be way faster than waiting x seconds for debug spam to happen. I don't see why rosautotest keeping track of its child process would be unreliable. As far as I know, even wine tests do this all the time.
Yeah it's probably fine to have rosautotest check for timeout. We could even make it output something that will indicate this to Testman. I don't know if it will be able to reliably kill all child processes though, but I guess sysreg's timeout would still be around anyway for those that it can't. A second communication channel might be overkill... I mean things work pretty okay as-is. :p
Secondly and more importantly, killing the test on timeout seems worse than skipping the offending part, because then any tests running _after_ the code causing the timeout would never be executed. That means instead of skipping a test that's known to be problematic, we skip completely unrelated and innocent tests.
Well, now I'm confused regarding the reason why you skipped the tests. I thought this was because the debug spam (or the wait call in the advapi32:service case) took too long, and then test bot considered it as timed-out and hence, rebooted the VM.
Right now most tests either get skipped because they hang (i.e. will cause a sysreg timeout and thus reboot), or crash (and thus make the results of the tests that did/would run successfully invisible in Testman). advapi32:service is kind of its own category: it's designed to take two minutes -- and I simply don't think a "does the service manager correctly time out after 30 seconds" test is worth losing that much time on every single test run. I'm happy to personally retest this every 2-3 months and do a proper regression test if we do find a problem with it -- I just think it's extremely unlikely in this case and the time savings are worth orders of magnitude more.
I'm not sure if there's a way to win this. Let me know if you have any ideas. As it stands I think skipping the offenders is an okay solution -- we just need to stay very aware of them and make sure to re-test more regularly than we do now.
Well, as long as we skip them because "they take too long" and not because of valid, identified bugs, we will lose. We (as in ReactOS and Wine) add more and more tests each release. If we don't find a long-term solution, we're doomed to chase ghosts.
I agree, tests taking very long is usually a bug in ROS and should thus stay visible. advapi32:service is one of the (I don't think very many) exceptions to this IMO because it will take two minutes on Windows as well. Giannis and I were talking about this stuff and he calculated some statistics... IIRC it looks like the remaining tests that take more than ~20 seconds are actually due to issues in ROS and they should be fast in Windows. So I don't think we'll want/need to give more tests the "takes to long, skip" treatment.
I guess the best way to solve this issue for the most part is fixing the bugs that cause these tests to have problems. That may sound naive, but given how little we've usually focused on this, and seeing that we quartered (or so) the number of test hangs/crashes, significantly reduced the number of skipped tests and halved average test run time just within the last few days makes me pretty optimistic about that.