please report *nature* of failure in tape testing

Got a cool idea that should be in R'n'D? Let's hear it!

Moderators: Flumminator, Zomis

Post Reply
filbo
Posts: 428
Joined: Fri Jun 20, 2014 10:06 am

please report *nature* of failure in tape testing

Post by filbo »

I'm running tape regression tests on 4.2.0.2 and seeing good evidence of improvements...

Request for the tape testing code: currently it reports 'solved' vs 'NOT SOLVED' on a per-tape basis and 'SOLVED' vs 'FAILED' in the levelset summary.

I'd like it to distinguish between not solved because player died vs not solved because tape ran out. In fact there are actually 3 failure states of interest:

1. tape ran out while player was still alive -- this simply indicates an incomplete run by the person making the tape

2. tape ran out exactly when player died (possibly +/- a second or so) -- basically also an incomplete run; the person died and didn't see a way forward, saved the tape as 'best progress despite overall failure'

3. player died significantly before tape ended -- this indicates some sort of engine / playback regression. Unless the tape is a fake construction, the fact that it was recorded past this point indicates that some past version of RnD (or other BD-style game with compatible recording facility) behaved differently in a manner which allowed the player to continue past this point

Now, you could just report cases 1 & 2 as 'solved'; that would make the testing facility entirely into an engine regression facility. But I understand it's also a tape-library checker, answering the question 'does this set of tapes successfully provide a solution to each level?'. For that, one wants to count all of 1-2-3 as failures.

I'm not sure there's any benefit at all to separating cases 1 & 2; but since they can clearly be distinguished both in the testing code and by the human interpreting the results, I also don't see any harm in showing them as different. And any future benefit that someone might think of would then be available.

In practice this might look like:

Level 011 [02:40]: (00:00.226 / 0.14 %) - NOT SOLVED INCOMPLETE (player alive at tape end).
Level 012 [02:40]: (00:00.226 / 0.14 %) - NOT SOLVED DEFEATED (player dies at tape end).
Level 013 [02:40]: (00:00.226 / 0.14 %) - NOT SOLVED INCOMPATIBLE (player dies before tape end).

LEVELDIR [WARN] 'jue_puzznic', SOLVED 10/13 (76%), FAILED: 011 012 013, INCOMPLETE: 011, DEFEATED: 012, INCOMPATIBLE: 013

... it occurs to me that there might be a rare 4th class, 'EXITING' (player dies at tape end while touching open exit). Or else DEFEATED should be defined as 'player dies at tape end while not touching exit'. Because: I always take the opportunity to explode the exit by having a bomb fall on it as I win, or setting off a dynamite, or a bug or spaceship is touching the exit at the same time as I leave. I expect this can easily set off timing incompatibilities in engines, so it's basically an INCOMPATIBLE not a DEFEATED. But it actually takes human interpretation to tell them apart, so it's good to flag this as a separate 4th kind of failure. (Example where EXITING is just INCOMPLETE: player got to open exit but, because some penguins had died, cannot exit; sits there and is soon killed by wandering enemy. Actually I forget whether dead penguins actually prevent you from exiting or only from winning on exiting; but I'm pretty sure there's at least one case where you've gotten the exits open but still are not allowed to depart...)

... and it seems (without looking at this part of the code) that it would be pretty easy to report these differences; there's nothing subtle or complicated to mine out of the data.

An alternative way to present this would be to just add the player's state of health & the timecode of the end of play (not end of tape) to each line; the first 3 failure states can be post-interpreted from those. But not EXITING, and the interpretation would only be second- not frame-accurate; so I think interpreting it inside the engine, where all is known, is better.
filbo
Posts: 428
Joined: Fri Jun 20, 2014 10:06 am

Re: please report *nature* of failure in tape testing

Post by filbo »

Regression test result: 33-of-38 tapes which were last solved by 4.1.4.2 are now solved by 4.2.0.2; 5 not fixed; no other regressions except one new debug msg:

rocksndiamonds: DEBUG: THIS SHOULD ONLY HAPPEN WITH PRE-1.2 LEVEL TAPES. [8]

... on rnd_jorge_jordan level 7. Which did solve despite the warning...
User avatar
Holger
Site Admin
Posts: 3360
Joined: Fri Jun 18, 2004 4:13 pm
Location: Germany
Contact:

Re: please report *nature* of failure in tape testing

Post by Holger »

Regression test result: 33-of-38 tapes which were last solved by 4.1.4.2 are now solved by 4.2.0.2; 5 not fixed;
Already wrote about it to you by PM, but I could also just post it here for public interest, as it is no secret. ;-)

Here we go:

With the new version 4.2.0.2 (released last night), all but five of the 60 EMC tapes broken by 4.2.0.0 should be fine again.

The tapes still broken should be the following:

Code: Select all

- emc_diamond_mine 028          - fake acid that is solid for robots
- emc_down_under_mine_17 068    - fake acid that is solid for gems
- emc_emerald_eater_2 034       - fake acid that is solid for robots
- emc_emerald_mine_03 042       - fake acid that is solid for eaters
- emc_exception_1 025           - fake acid that is solid for gems
These are all caused by the fixed "fake acid" element, which should behave like empty space, but did not before 4.2.0.0. I don't want to add compatibility code for this one, for two reasons: First, it will affect MANY parts of the EM engine, and cannot easily be added (especially LOTS of "case" statements). Search for "Xfake_acid" in src/game_em/logic.c, if you like. Second, this one was really wrong before 4.2.0.0, and there should be no solution tape that shows element behaviour that can never be reproduced by playing the corresponding level (while all other changes are just minor differences that won't really be noticed in the tape, like android elements entering or not entering acid splashes etc.).

So these five tapes would just have to be re-recorded to have a solution tape for that level.
User avatar
Holger
Site Admin
Posts: 3360
Joined: Fri Jun 18, 2004 4:13 pm
Location: Germany
Contact:

Re: please report *nature* of failure in tape testing

Post by Holger »

... and regarding your first post about "reporting the nature of failing tape": I can only agree, and it should be implemented just as you've described it.

And I really should add the time (in "MM:SS" format) where a tape broke before it ended, because this was always the point in tape replay I was looking for when replaying those broken tapes to find the reason why it failed. (In the end, I had to use "git bisect" in most cases to find out what broke it; I have no idea how I could live without that before using git!)
User avatar
Holger
Site Admin
Posts: 3360
Joined: Fri Jun 18, 2004 4:13 pm
Location: Germany
Contact:

Re: please report *nature* of failure in tape testing

Post by Holger »

no other regressions except one new debug msg:

rocksndiamonds: DEBUG: THIS SHOULD ONLY HAPPEN WITH PRE-1.2 LEVEL TAPES. [8]

... on rnd_jorge_jordan level 7. Which did solve despite the warning...
No idea about this, but it's not new: I get this debug message for all versions back to 4.0.0.0 (and cannot test for older versions, as newly built 3.3.1.2 segfaults on my current system, unfortunately, and I don't want to debug this further now).

It's an "#if DEBUG" message, and release packages are compiled without "DEBUG", so it will only be active for non-release builds... :-/

Nevertheless, it's strange, as the tape is definitely not a "pre-1.2" one:

Code: Select all

aeglos@isengard:~/.rocksndiamonds/tapes$ file rnd_jorge_jordan/007.tape
rnd_jorge_jordan/007.tape: Rocks'n'Diamonds tape file, version 3.3.0.1, engine 2.1.1.0, date 20110206, level set "rnd_jorge_jordan", level 7
aeglos@isengard:~/.rocksndiamonds/tapes$ 
No idea what went wrong with that tape...
filbo
Posts: 428
Joined: Fri Jun 20, 2014 10:06 am

Re: please report *nature* of failure in tape testing

Post by filbo »

Ugh, looking at that code I can see it would be a pain to fix the fake acid thing. So much code repetition!

The '1.2 tape' thing isn't a regression, just my mistake. My script didn't pick up that string until I applied my printf->Error() patch. Now it does; all 14 releases I'm testing emit that message...
Post Reply