Editor scrolling crash - debugged in-depth
Posted: Fri Jan 02, 2009 3:47 pm
A few days ago I started playing 3.2.6.0. Unfortunately, it crashed with "Floating point exception". A lot. Not just in the editor (as already reported), even in game. If I'm not mistaken, it crashes when you collect an inventory object or key while scrolling up/left. But not always.
(Note: Others have reported problems on Windows, and/or with Nvidia cards. It crashes on Linux with a non-Nvidia card too. And I don't remember any crashes before 3.2.6.0, although I haven't played RnD in a long time.)
Well, the game was too unstable to be playable, so I decided to debug this. I'm posting this here in hopes it may be helpful to Holger, and also that someone might be able to explain this to me, because I certainly don't understand the result.
I concluded the editor crashes and game crashes probably have the same origin, and decided to investigate the editor, as that's much easier to reproduce.
Using gdb, I found out that a floating point exception occurs in tools.c:942, in function getSizedGraphicSource, because width_div is zero. (Sometimes it's height_div.) But how can it be zero? Maybe offset_calc_pos is out of offset_calc's bounds? Nope. The actual content of offset_calc changed, and it contains completely bogus values.
Using a conditional breakpoint at tools.c:934 (with an empty temporary function: if(offset_calc[0].width_div!=16) tmp_func();):
What happened? offset_calc has completely incorrect values, including several *_div=0, right after its initialization! This smells like memory corruption. (Or am I jumping to conclusions?)
Invalid write of size 4
at 0x8056BC1: getSizedGraphicSource (tools.c:932)
Note that this is the first reported error. No memory corruption issues were reported prior to it.
Note: I don't understand assembler. I have no idea what 'rep movsl' does. (Probably 'repeat move'?)
The last command says that line 933 starts at getSizedGraphicSource+63, that is, the invalid write from getSizedGraphcSource+61 occurs before line 933.
So it seems there's only one statement where it can be:
Huh? That looks like a perfectly normal structure initialization. The way I see it, either gcc created bad code (yeah right), or there's some obscure heap/callstack memory corruption going on.
Next, I tried looking at gcc's assembler output:
Again, I don't understand assembler, but it seems that this 'rep movsl' gets its source address (that is, $C.241.10079) from %esi, its destination address (-184(%ebp)) from %edi, and the size ($24) in %ecx. What invalid write could happen here? Does %ebp itself have a wrong value? (I think that's some sort of stack pointer, or something.)
I don't have any idea what's going on. I don't know assembler, I don't know gcc's internals, I don't know rocksndiamonds code in much depth. If anyone finds out what's happening, I'm very interested in finding out.
To conclude, I found a workaround: declare offset_calc as const (that is, replace "struct" with "const struct") and the "invalid write" disappears, along with all the FPEs and display bugs. However, I strongly discourage simply using this "fix" and forgetting about the bug. I think it signals a deeper issue, and should be fixed once and for all.
Thank you for reading this, and sorry if it was too long. Information in the first paragraph was probably better suited for the Bug Reports section, but the bulk of the post belongs to Programmer's Corner.
(Note: Others have reported problems on Windows, and/or with Nvidia cards. It crashes on Linux with a non-Nvidia card too. And I don't remember any crashes before 3.2.6.0, although I haven't played RnD in a long time.)
Well, the game was too unstable to be playable, so I decided to debug this. I'm posting this here in hopes it may be helpful to Holger, and also that someone might be able to explain this to me, because I certainly don't understand the result.
I concluded the editor crashes and game crashes probably have the same origin, and decided to investigate the editor, as that's much easier to reproduce.
Using gdb, I found out that a floating point exception occurs in tools.c:942, in function getSizedGraphicSource, because width_div is zero. (Sometimes it's height_div.) But how can it be zero? Maybe offset_calc_pos is out of offset_calc's bounds? Nope. The actual content of offset_calc changed, and it contains completely bogus values.
Using a conditional breakpoint at tools.c:934 (with an empty temporary function: if(offset_calc[0].width_div!=16) tmp_func();):
Code: Select all
$ gdbtui ./rocksndiamonds
[...]
Breakpoint 1, tmp_func () at tools.c:917
(gdb) up
#1 0x08056be2 in getSizedGraphicSource (graphic=765, frame=0, tilesize_raw=16, bitmap=0xbfcfeee0, x=0xbfcfeedc, y=0xbfcfeed8)
at tools.c:936
(gdb) print offset_calc
$1 = {{width_mult = 15, width_div = 512, height_mult = 640, height_div = 0}, {width_mult = 178999928, width_div = 0,
height_mult = 178999928, height_div = -1209309548}, {width_mult = 179026168, width_div = 179026168, height_mult = -1076892104,
height_div = -1209533302}, {width_mult = 179026168, width_div = -1076892052, height_mult = 179026168, height_div = -1076891968}, {
width_mult = -1076891896, width_div = 548, height_mult = 6, height_div = -1221870324}, {width_mult = -1076891240,
width_div = -1076891080, height_mult = -1076892040, height_div = -1209309548}}
Code: Select all
$ valgrind -v ./rocksndiamonds
[...]
--4033-- Command line
--4033-- ./rocksndiamonds
--4033-- Startup, with flags:
--4033-- -v
--4033-- Contents of /proc/version:
--4033-- Linux version 2.6.27-tbARCH (root@kalafunos) (gcc version 4.3.2 (GCC) ) #1 SMP PREEMPT Wed Dec 24 13:26:11 CET 2008
--4033-- Arch and hwcaps: X86, x86-sse1-sse2
[...valgrind initialization...]
[...rocksndiamonds starts, but when I scroll left in the editor:...]
==4415== Thread 1:
==4415== Invalid write of size 4
==4415== at 0x8056BC1: getSizedGraphicSource (tools.c:932)
==4415== by 0x8056EEF: getMiniGraphicSource (tools.c:980)
==4415== by 0x8057519: DrawMiniGraphicExt (tools.c:1111)
==4415== by 0x80574C6: DrawMiniGraphic (tools.c:1102)
==4415== by 0x8059979: DrawMiniElementOrWall (tools.c:1813)
==4415== by 0x80AC7EE: ScrollMiniLevel (editor.c:5186)
==4415== by 0x80BC19C: HandleControlButtons (editor.c:10970)
==4415== by 0x80DA71F: HandleGadgets (gadgets.c:1859)
==4415== by 0x80D91A2: ClickOnGadget (gadgets.c:1426)
==4415== by 0x80BD1D3: HandleLevelEditorKeyInput (editor.c:11462)
==4415== by 0x8054905: HandleKey (events.c:875)
==4415== by 0x805386B: HandleKeyEvent (events.c:370)
==4415== Address 0xbef7704c is just below the stack ptr. To suppress, use: --workaround-gcc296-bugs=yes
==4415==
==4415== Conditional jump or move depends on uninitialised value(s)
==4415== at 0x4153D8B: SDL_UpperBlit (in /usr/lib/libSDL-1.2.so.0.11.2)
==4415== by 0x80EDB8C: SDLCopyArea (sdl.c:380)
==4415== by 0x80D34AF: sysCopyArea (system.c:326)
==4415== by 0x80D3464: BlitBitmap (system.c:520)
==4415== by 0x8057557: DrawMiniGraphicExt (tools.c:1112)
==4415== by 0x80574C6: DrawMiniGraphic (tools.c:1102)
[...clipped, much more errors...]
at 0x8056BC1: getSizedGraphicSource (tools.c:932)
Note that this is the first reported error. No memory corruption issues were reported prior to it.
Code: Select all
$ gdbtui ./rocksndiamonds
[...]
(gdb) x 0x8056BC1
0x8056bc1 <getSizedGraphicSource+61>: 0x158ba5f3
(gdb) disassemble 0x8056BC1
Dump of assembler code for function getSizedGraphicSource:
0x08056b84 <getSizedGraphicSource+0>: push %ebp
0x08056b85 <getSizedGraphicSource+1>: mov %esp,%ebp
0x08056b87 <getSizedGraphicSource+3>: push %edi
0x08056b88 <getSizedGraphicSource+4>: push %esi
0x08056b89 <getSizedGraphicSource+5>: sub $0xf0,%esp
0x08056b8f <getSizedGraphicSource+11>: lea -0xb8(%ebp),%eax
0x08056b95 <getSizedGraphicSource+17>: mov %eax,-0xc0(%ebp)
0x08056b9b <getSizedGraphicSource+23>: movl $0x8152900,-0xc4(%ebp)
0x08056ba5 <getSizedGraphicSource+33>: movl $0x18,-0xc8(%ebp)
0x08056baf <getSizedGraphicSource+43>: mov -0xc0(%ebp),%edi
0x08056bb5 <getSizedGraphicSource+49>: mov -0xc4(%ebp),%esi
0x08056bbb <getSizedGraphicSource+55>: mov -0xc8(%ebp),%ecx
0x08056bc1 <getSizedGraphicSource+61>: rep movsl %ds:(%esi),%es:(%edi)
[...clipped...]
(gdb) info line tools.c:933
Line 933 of "tools.c" starts at address 0x8056bc3 <getSizedGraphicSource+63> and ends at 0x8056bd8 <getSizedGraphicSource+84>.
The last command says that line 933 starts at getSizedGraphicSource+63, that is, the invalid write from getSizedGraphcSource+61 occurs before line 933.
So it seems there's only one statement where it can be:
Code: Select all
/* tools.c:920 */
struct
{
int width_mult, width_div;
int height_mult, height_div;
}
offset_calc[6] =
{
{ 15, 16, 2, 3 }, /* 1 x 1 */
{ 7, 8, 2, 3 }, /* 2 x 2 */
{ 3, 4, 2, 3 }, /* 4 x 4 */
{ 1, 2, 2, 3 }, /* 8 x 8 */
{ 0, 1, 2, 3 }, /* 16 x 16 */
{ 0, 1, 0, 1 }, /* 32 x 32 */
};
Next, I tried looking at gcc's assembler output:
Code: Select all
$ gcc -o tools.S -S tools.c -DTARGET_SDL `sdl-config --cflags`
/* tools.S:2275..2320 */
.section .rodata
.align 32
.type C.241.10079, @object
.size C.241.10079, 96
C.241.10079:
.long 15
.long 16
.long 2
[...initial offset_calc data snipped...]
.long 1
.long 0
.long 1
.text
.globl getSizedGraphicSource
.type getSizedGraphicSource, @function
getSizedGraphicSource:
pushl %ebp
movl %esp, %ebp
pushl %edi
pushl %esi
subl $240, %esp
leal -184(%ebp), %eax
movl %eax, -192(%ebp)
movl $C.241.10079, -196(%ebp)
movl $24, -200(%ebp)
movl -192(%ebp), %edi
movl -196(%ebp), %esi
movl -200(%ebp), %ecx
rep movsl
I don't have any idea what's going on. I don't know assembler, I don't know gcc's internals, I don't know rocksndiamonds code in much depth. If anyone finds out what's happening, I'm very interested in finding out.
To conclude, I found a workaround: declare offset_calc as const (that is, replace "struct" with "const struct") and the "invalid write" disappears, along with all the FPEs and display bugs. However, I strongly discourage simply using this "fix" and forgetting about the bug. I think it signals a deeper issue, and should be fixed once and for all.
Thank you for reading this, and sorry if it was too long. Information in the first paragraph was probably better suited for the Bug Reports section, but the bulk of the post belongs to Programmer's Corner.