Editor scrolling crash - debugged in-depth

Discussion around programming R'n'D, its source code and its tools.

Moderators: Zomis, Flumminator

Post Reply
Tomi
Posts: 339
Joined: Wed Aug 03, 2005 3:37 pm
Location: Slovakia

Editor scrolling crash - debugged in-depth

Post by Tomi » Fri Jan 02, 2009 3:47 pm

A few days ago I started playing 3.2.6.0. Unfortunately, it crashed with "Floating point exception". A lot. Not just in the editor (as already reported), even in game. If I'm not mistaken, it crashes when you collect an inventory object or key while scrolling up/left. But not always.
(Note: Others have reported problems on Windows, and/or with Nvidia cards. It crashes on Linux with a non-Nvidia card too. And I don't remember any crashes before 3.2.6.0, although I haven't played RnD in a long time.)

Well, the game was too unstable to be playable, so I decided to debug this. I'm posting this here in hopes it may be helpful to Holger, and also that someone might be able to explain this to me, because I certainly don't understand the result.
I concluded the editor crashes and game crashes probably have the same origin, and decided to investigate the editor, as that's much easier to reproduce.

Using gdb, I found out that a floating point exception occurs in tools.c:942, in function getSizedGraphicSource, because width_div is zero. (Sometimes it's height_div.) But how can it be zero? Maybe offset_calc_pos is out of offset_calc's bounds? Nope. The actual content of offset_calc changed, and it contains completely bogus values.

Using a conditional breakpoint at tools.c:934 (with an empty temporary function: if(offset_calc[0].width_div!=16) tmp_func();):

Code: Select all

$ gdbtui ./rocksndiamonds
[...]

Breakpoint 1, tmp_func () at tools.c:917
(gdb) up
#1  0x08056be2 in getSizedGraphicSource (graphic=765, frame=0, tilesize_raw=16, bitmap=0xbfcfeee0, x=0xbfcfeedc, y=0xbfcfeed8)
    at tools.c:936
(gdb) print offset_calc
$1 = {{width_mult = 15, width_div = 512, height_mult = 640, height_div = 0}, {width_mult = 178999928, width_div = 0, 
    height_mult = 178999928, height_div = -1209309548}, {width_mult = 179026168, width_div = 179026168, height_mult = -1076892104, 
    height_div = -1209533302}, {width_mult = 179026168, width_div = -1076892052, height_mult = 179026168, height_div = -1076891968}, {
    width_mult = -1076891896, width_div = 548, height_mult = 6, height_div = -1221870324}, {width_mult = -1076891240, 
    width_div = -1076891080, height_mult = -1076892040, height_div = -1209309548}}
What happened? offset_calc has completely incorrect values, including several *_div=0, right after its initialization! This smells like memory corruption. (Or am I jumping to conclusions?)

Code: Select all

$ valgrind -v ./rocksndiamonds
[...]
--4033-- Command line
--4033--    ./rocksndiamonds
--4033-- Startup, with flags:
--4033--    -v
--4033-- Contents of /proc/version:
--4033--   Linux version 2.6.27-tbARCH (root@kalafunos) (gcc version 4.3.2 (GCC) ) #1 SMP PREEMPT Wed Dec 24 13:26:11 CET 2008
--4033-- Arch and hwcaps: X86, x86-sse1-sse2
[...valgrind initialization...]
[...rocksndiamonds starts, but when I scroll left in the editor:...]
==4415== Thread 1:
==4415== Invalid write of size 4
==4415==    at 0x8056BC1: getSizedGraphicSource (tools.c:932)
==4415==    by 0x8056EEF: getMiniGraphicSource (tools.c:980)
==4415==    by 0x8057519: DrawMiniGraphicExt (tools.c:1111)
==4415==    by 0x80574C6: DrawMiniGraphic (tools.c:1102)
==4415==    by 0x8059979: DrawMiniElementOrWall (tools.c:1813)
==4415==    by 0x80AC7EE: ScrollMiniLevel (editor.c:5186)
==4415==    by 0x80BC19C: HandleControlButtons (editor.c:10970)
==4415==    by 0x80DA71F: HandleGadgets (gadgets.c:1859)
==4415==    by 0x80D91A2: ClickOnGadget (gadgets.c:1426)
==4415==    by 0x80BD1D3: HandleLevelEditorKeyInput (editor.c:11462)
==4415==    by 0x8054905: HandleKey (events.c:875)
==4415==    by 0x805386B: HandleKeyEvent (events.c:370)
==4415==  Address 0xbef7704c is just below the stack ptr.  To suppress, use: --workaround-gcc296-bugs=yes
==4415== 
==4415== Conditional jump or move depends on uninitialised value(s)
==4415==    at 0x4153D8B: SDL_UpperBlit (in /usr/lib/libSDL-1.2.so.0.11.2)
==4415==    by 0x80EDB8C: SDLCopyArea (sdl.c:380)
==4415==    by 0x80D34AF: sysCopyArea (system.c:326)
==4415==    by 0x80D3464: BlitBitmap (system.c:520)
==4415==    by 0x8057557: DrawMiniGraphicExt (tools.c:1112)
==4415==    by 0x80574C6: DrawMiniGraphic (tools.c:1102)
[...clipped, much more errors...]
Invalid write of size 4
at 0x8056BC1: getSizedGraphicSource (tools.c:932)

Note that this is the first reported error. No memory corruption issues were reported prior to it.

Code: Select all

$ gdbtui ./rocksndiamonds
[...]
(gdb) x 0x8056BC1
0x8056bc1 <getSizedGraphicSource+61>:   0x158ba5f3
(gdb) disassemble 0x8056BC1
Dump of assembler code for function getSizedGraphicSource:
0x08056b84 <getSizedGraphicSource+0>:   push   %ebp
0x08056b85 <getSizedGraphicSource+1>:   mov    %esp,%ebp
0x08056b87 <getSizedGraphicSource+3>:   push   %edi
0x08056b88 <getSizedGraphicSource+4>:   push   %esi
0x08056b89 <getSizedGraphicSource+5>:   sub    $0xf0,%esp
0x08056b8f <getSizedGraphicSource+11>:  lea    -0xb8(%ebp),%eax
0x08056b95 <getSizedGraphicSource+17>:  mov    %eax,-0xc0(%ebp)
0x08056b9b <getSizedGraphicSource+23>:  movl   $0x8152900,-0xc4(%ebp)
0x08056ba5 <getSizedGraphicSource+33>:  movl   $0x18,-0xc8(%ebp)
0x08056baf <getSizedGraphicSource+43>:  mov    -0xc0(%ebp),%edi
0x08056bb5 <getSizedGraphicSource+49>:  mov    -0xc4(%ebp),%esi
0x08056bbb <getSizedGraphicSource+55>:  mov    -0xc8(%ebp),%ecx
0x08056bc1 <getSizedGraphicSource+61>:  rep movsl %ds:(%esi),%es:(%edi)
[...clipped...]
(gdb) info line tools.c:933
Line 933 of "tools.c" starts at address 0x8056bc3 <getSizedGraphicSource+63> and ends at 0x8056bd8 <getSizedGraphicSource+84>.
Note: I don't understand assembler. I have no idea what 'rep movsl' does. (Probably 'repeat move'?)

The last command says that line 933 starts at getSizedGraphicSource+63, that is, the invalid write from getSizedGraphcSource+61 occurs before line 933.
So it seems there's only one statement where it can be:

Code: Select all

/* tools.c:920 */
  struct
  {
    int width_mult, width_div;
    int height_mult, height_div;
  }
  offset_calc[6] =
  {
    { 15, 16,   2, 3    },      /* 1 x 1 */
    { 7, 8,     2, 3    },      /* 2 x 2 */
    { 3, 4,     2, 3    },      /* 4 x 4 */
    { 1, 2,     2, 3    },      /* 8 x 8 */
    { 0, 1,     2, 3    },      /* 16 x 16 */
    { 0, 1,     0, 1    },      /* 32 x 32 */
  };
Huh? That looks like a perfectly normal structure initialization. The way I see it, either gcc created bad code (yeah right), or there's some obscure heap/callstack memory corruption going on.
Next, I tried looking at gcc's assembler output:

Code: Select all

$ gcc -o tools.S -S tools.c -DTARGET_SDL `sdl-config --cflags`
/* tools.S:2275..2320 */
        .section        .rodata
        .align 32
        .type   C.241.10079, @object
        .size   C.241.10079, 96
C.241.10079:
        .long   15
        .long   16
        .long   2
[...initial offset_calc data snipped...]
        .long   1
        .long   0
        .long   1
        .text
.globl getSizedGraphicSource
        .type   getSizedGraphicSource, @function
getSizedGraphicSource:
        pushl   %ebp
        movl    %esp, %ebp
        pushl   %edi
        pushl   %esi
        subl    $240, %esp
        leal    -184(%ebp), %eax
        movl    %eax, -192(%ebp)
        movl    $C.241.10079, -196(%ebp)
        movl    $24, -200(%ebp)
        movl    -192(%ebp), %edi
        movl    -196(%ebp), %esi
        movl    -200(%ebp), %ecx
        rep movsl
Again, I don't understand assembler, but it seems that this 'rep movsl' gets its source address (that is, $C.241.10079) from %esi, its destination address (-184(%ebp)) from %edi, and the size ($24) in %ecx. What invalid write could happen here? Does %ebp itself have a wrong value? (I think that's some sort of stack pointer, or something.)

I don't have any idea what's going on. I don't know assembler, I don't know gcc's internals, I don't know rocksndiamonds code in much depth. If anyone finds out what's happening, I'm very interested in finding out.

To conclude, I found a workaround: declare offset_calc as const (that is, replace "struct" with "const struct") and the "invalid write" disappears, along with all the FPEs and display bugs. However, I strongly discourage simply using this "fix" and forgetting about the bug. I think it signals a deeper issue, and should be fixed once and for all.

Thank you for reading this, and sorry if it was too long. Information in the first paragraph was probably better suited for the Bug Reports section, but the bulk of the post belongs to Programmer's Corner.

Daniel H.
Posts: 535
Joined: Sun Apr 02, 2006 7:13 pm
Location: USA

Post by Daniel H. » Sat Jan 03, 2009 6:15 pm

I think that it's good to post that in Bug Reports like you did rather than splitting it up into multiple topics, since most of what you posted is (probably) relevant to the bug. :)

dave
Posts: 77
Joined: Mon Aug 13, 2007 2:06 am
Contact:

Re: Editor scrolling crash - debugged in-depth

Post by dave » Tue Jan 06, 2009 3:37 am

Tomi wrote:

Code: Select all

$ gdbtui ./rocksndiamonds
[...]
(gdb) x 0x8056BC1
0x8056bc1 <getSizedGraphicSource+61>:   0x158ba5f3
(gdb) disassemble 0x8056BC1
Dump of assembler code for function getSizedGraphicSource:
0x08056b84 <getSizedGraphicSource+0>:   push   %ebp
0x08056b85 <getSizedGraphicSource+1>:   mov    %esp,%ebp
0x08056b87 <getSizedGraphicSource+3>:   push   %edi
0x08056b88 <getSizedGraphicSource+4>:   push   %esi
0x08056b89 <getSizedGraphicSource+5>:   sub    $0xf0,%esp
0x08056b8f <getSizedGraphicSource+11>:  lea    -0xb8(%ebp),%eax
0x08056b95 <getSizedGraphicSource+17>:  mov    %eax,-0xc0(%ebp)
0x08056b9b <getSizedGraphicSource+23>:  movl   $0x8152900,-0xc4(%ebp)
0x08056ba5 <getSizedGraphicSource+33>:  movl   $0x18,-0xc8(%ebp)
0x08056baf <getSizedGraphicSource+43>:  mov    -0xc0(%ebp),%edi
0x08056bb5 <getSizedGraphicSource+49>:  mov    -0xc4(%ebp),%esi
0x08056bbb <getSizedGraphicSource+55>:  mov    -0xc8(%ebp),%ecx
0x08056bc1 <getSizedGraphicSource+61>:  rep movsl %ds:(%esi),%es:(%edi)
[...clipped...]
i know a little assembly. that code fragment looks like a memcpy() from some address 0x8152900 onto the stack -0xb8(%ebp) length 96 (0x18*4) bytes. could also be a struct assignment? make sure you compile with -g

katebushlover
Posts: 2
Joined: Thu Dec 18, 2008 12:27 am

Post by katebushlover » Tue Jan 06, 2009 7:12 pm

See my message in bug reports where I have attempted to explain as best I could in English what happens - for me, essentially it crashes whenever the screen area needs to scroll either down or right (tho strangely not up or left) AND the direction button is held down while the screen scrolls. There may be other conditions but i can always make it crash on levels larger than 15x15 (which dont involve screen scrolling and never crash, at least not for me)!

Tomi
Posts: 339
Joined: Wed Aug 03, 2005 3:37 pm
Location: Slovakia

Re: Editor scrolling crash - debugged in-depth

Post by Tomi » Sat Jan 10, 2009 3:47 pm

dave wrote:i know a little assembly. that code fragment looks like a memcpy() from some address 0x8152900 onto the stack -0xb8(%ebp) length 96 (0x18*4) bytes. could also be a struct assignment? make sure you compile with -g
Then my guess was correct, thanks. The question is, why does it generate an "invalid write"? 0x8152900 (AKA $C.241.10079 in gcc -S output, and yes, it is offset_calc's initial data) is a constant address. The only variable in the code is %ebp. Is the stack somehow corrupted? That is what I don't understand. Sigh...
BTW, what does 'lea' do? Can it change the value from -0xb8(%ebp) somehow? Could that be the problem?

Oh, and if I didn't compile with -g, there wouldn't be any symbols and function names like <getSizedGraphicSource+61> ;-)

> whenever the screen area needs to scroll either down or right (tho strangely not up or left)
That's strange, it's the exact opposite. (And it crashed at least once both in game and in editor, as mentioned above.)

[OT] Oh, and is it just me, or the forum's even quieter than usual? I don't consider myself a long-time member, but I don't remember it ever being this quiet. Even Holger's away. [/OT]

User avatar
Holger
Site Admin
Posts: 3225
Joined: Fri Jun 18, 2004 4:13 pm
Location: Germany
Contact:

Post by Holger » Wed Jan 28, 2009 9:13 am

Tomi, thanks a lot for your debugging work and effort! I hope (and think) that it will help me finally finding and removing that strange crash bug that apparently found its way into the new release version 3.2.6.0. :-o

> [OT] Oh, and is it just me, or the forum's even quieter than usual? I don't
> consider myself a long-time member, but I don't remember it ever being
> this quiet. Even Holger's away. [/OT]

Please see my corresponding Off-Topic post for some explanation...

dave
Posts: 77
Joined: Mon Aug 13, 2007 2:06 am
Contact:

Re: Editor scrolling crash - debugged in-depth

Post by dave » Tue Mar 24, 2009 1:50 am

Tomi wrote:That's strange, it's the exact opposite. (And it crashed at least once both in game and in editor, as mentioned above.)
i got the floating point exception, looked at the function it happened in, and it looks to me like a buffer overrun somewhere before this function. the struct was corrupted, and a 0 was used (from the struct) for divide, hence floating point exception.
buffer overruns can be a pain because the real error can be a long time before the actual crash. it is difficult to track down and i dont care to investigate the source further right now.

User avatar
Holger
Site Admin
Posts: 3225
Joined: Fri Jun 18, 2004 4:13 pm
Location: Germany
Contact:

Post by Holger » Tue Mar 31, 2009 8:04 pm

I've debugged this in depth now, too, and found that behaviour caused by a bug in the current version of SDL used by R'n'D. The problem occurs when blitting overlapping parts of a SDL surface (bitmap) from/to the same surface. In this case, either SDL_memcpy() or SDL_revcpy() is used, according to the blitting direction (which is scroll direction in R'n'D game or editor).

After playing with the inline assembler version of SDL_revcpy(), I finally found the cause for the bug that leads to memory corruption in later executed code (after executing SDL_revcpy()). When thinking about sending a patch to the SDL mailing list to fix it, I found that it is already fixed in SVN for about a year, but apparently nobody at the SDL team bothered to release a bugfix version 1.2.14 (which is a shame IMHO, as quite some other people ran into the same problem and wasted some time debugging it, as Google revealed later on).

I will release a new version 3.2.6.1 with a manually fixed SDL.dll soon.

dave
Posts: 77
Joined: Mon Aug 13, 2007 2:06 am
Contact:

Post by dave » Sun Apr 05, 2009 4:57 am

Holger wrote:The problem occurs when blitting overlapping parts of a SDL surface (bitmap) from/to the same surface.
are you talking about a bitmap (ie. depth of 1) surface ? i found them to be quite buggy. my workaround for bitmaps was to use an 8 bit depth with only the constants SDL_ALPHA_TRANSPARENT and SDL_ALPHA_OPAQUE for the alpha channel.

but i have never run into problems scrolling a surface/copying a surface to itself in general.

User avatar
Holger
Site Admin
Posts: 3225
Joined: Fri Jun 18, 2004 4:13 pm
Location: Germany
Contact:

Post by Holger » Sun Apr 05, 2009 11:45 am

> are you talking about a bitmap (ie. depth of 1) surface ?

Sorry, no, I was talking about an SDL surface with the same depth as the display (which is usually a 16 or 24 bit depth). I just added the more general term "bitmap" for those not familiar with SDL specific terms like "surface".

> but i have never run into problems scrolling a surface/copying a surface to
> itself in general.

Then you may have had luck that your code which scrolls your surfaces (implicitly using the buggy inline asm version of SDL_revcpy()) was followed by code that reset the x86 direction flag (using "cld" either explicitly by other inline asm code or by C code compiled by gcc) before another x86 opcode from the "movs" family was executed, therefore luckily *not* trashing your memory. (Or you used Linux, which apparently is not affected by this problem.)

Post Reply