Internationalization

Francesco · Post by **Francesco** » Mon Aug 27, 2007 12:05 pm

Hi all,
I had a look to the source code of RnD, to find a quick and simple way to add some internationalization.

Well, the string literals are hardcoded in the source... in a first time, I thought about a simple parser that scans all the source files, extracts the strings (except those in the comments and in the include directives, of course) and then replaces them looking up in a table that associates the english strings with those in another language, but that would mean that each language needs a separate source and a separate build.

Then I thought that no string literal is used as it is, but is "translated" into a sequence of pictures by some function. There should be some "printf", somewhere, for the error messages dumped in some file, but that would be no big problem: my idea is to simply add a function that "translates" the string literals at runtime, calling it from the (I hope) few functions that handle those strings.

Let me know what you think about this idea.

In the meantime, I am putting together a simple parser to extract the strings from the source code, then I'll build a table with the corresponding italian strings.

- still, there is the problem about special characters, like the italian stressed vowels, but, at least for italian, they can be avoided (or replaced with the corresponding vowel followed by a single quote) so the texts could be completely understood anyway -

Francesco · Post by **Francesco** » Mon Aug 27, 2007 3:35 pm

I've just finished parsing "editor.c"... well, over 770 strings - there is some duplicate, though...

Maybe a runtime function that translates string to string on the fly would be too slow, besides, if each string in the source code is replaced by a macro representing an integer, the lookup on the table would be really faster... but then all the code that handles those strings should be changed... oh well...

Maybe an hash_map<char* , char*> could do the trick, but I don't think I am able to do such a thing in C.

By the way, if somebody cares, here is the parsing program I have just wrote:

Code: Select all

#include <iostream>
#include <fstream>

using namespace std;

ostream* out = 0;
istream* in = 0;

enum Action {
    get_string_literal,
    ignore_rest_of_line,
    ignore_comment,
    get_next
};

void Parse()
{
    char ch = ' ';
    Action action = get_next;
    while ( in->good() ) {
        if ( action == get_next ) {
            in->get ( ch );
            switch ( ch ) {
                case '#':
                    action = ignore_rest_of_line;
                    break;
                case '"':
                    action = get_string_literal;
                    break;
                case '\'':
                    in->get ( ch );
                    break;
                case '/':
                    in->get ( ch );
                    if ( ch == '*' ) {
                        action = ignore_comment;
                    }
                    else if ( ch == '/' ) {
                        action = ignore_rest_of_line;
                    }
                    else {
                        in->putback( ch );
                    }
                    break;
                default:
                    break;
            }
        }
        else if ( action == ignore_rest_of_line ) {
            cout << 'X';
            while ( in->good() ) {
                in->get ( ch );
                if ( ( ch == '\n' ) || ( ch == '\r' ) ) {
                    action = get_next;
                    break;
                }
            }
        }
        else if ( action == ignore_comment ) {
            cout << 'C';
            while ( in->good() ) {
                in->get ( ch );
                if ( ch == '*' ) {
                    in->get ( ch );
                    if ( ch == '/' ) {
                        action = get_next;
                        break;
                    }
                    else {
                        in->putback( ch );
                    }
                }
            }
        }
        else if ( action == get_string_literal ) {
            cout << 'S';
            *out << '"';
            while ( in->good() ) {
                in->get ( ch );
                *out << ch;
                if ( ch == '\\' ) {
                    in->get ( ch );
                    *out << ch;
                }
                else if ( ch == '"' ) {
                    action = get_next;
                    break;
                }
            }
            *out << "\n";
        }
    }
}


int main( int argc, char* argv[] )
{
    ifstream infile( "editor.c", ios::in );
    ofstream outfile( "output.txt", ios::trunc );

    in = &infile;
    out = &outfile;
    cout << "Begun!\n\n";
    Parse();
    cout << "\n\nFinished!\n";
    return 0;
}

Francesco · Post by **Francesco** » Mon Aug 27, 2007 9:37 pm

Damn... here is the list of all string literals found in the main source folder of RnD 3.2.3:
http://www.zomis.net/rnd/info.php?f=677
...it's about 10.000 different literals spawned over 131 files!

OK, most of them won't need to be translated, but hey... it would be a tough work only to separate those that need a translation from those that have to stay unchanged... well, let's hear the project's lead engineer, about all of this stuff

Francesco · Post by **Francesco** » Mon Aug 27, 2007 10:16 pm

Line 514:

Code: Select all

[   1#,    1F]	"And not to forget:"
[   2#,    1F]	"Artsoft"
[   1#,    1F]	"As Template"

I wrote:separate those that need a translation from those that have to stay unchanged

I think I could do that, though.

Post by **Holger** » Mon Aug 27, 2007 10:44 pm

Indeed, i18n is one of the top TODOs of R'n'D for a long time now... ;-) :-/

In any case, it should be done in a generic way, like this:

Replace all strings

Code: Select all

"string"

[/b] by

Code: Select all

_("string")

[/b] and add a function

Code: Select all

_()

[/b] to use the string as a hash value to select the corresponding localized version of the string from a set of language catalog files. Although this does not work for all cases (for example, two strings may be identical in the language used in the source code, but have two different meanings (or grammatical forms) in another language), it probably would work for the current R'n'D.

> still, there is the problem about special characters, like the italian stressed
> vowels

This problem can be solved by using extended font (image) files and a corresponding definition in "graphicsinfo.conf" (at least for custom graphics sets which then should be used to override the original one), or by extending the classic font image files, if the language only uses characters from ISO-Latin-1 (which also contain those special Italian characters not included in the original R'n'D font image file). For example, the new EMC set (yet to be released) will contain an extended font to display text messages also using ISO-Latin-1 characters -- these can already be defined in the current R'n'D like this:

font.xyz.frames: 224
font.xyz.frames_per_line: 16

This example contains a font with 224 characters (skipping the first, non-printable 32 characters), featuring all ISO-Latin-1 characters.

The main problem is that this cannot easily be changed (extended) in the main R'n'D graphics definitions, as they would then be inherited by existing artwork sets which do not have enough character graphics defined in their image files. But principally this is already supported, although currently not used.

The real problems then arise with languages not covered by ISO-Latin-1... :-/

Other problems are the limited screen space, for example in the level editor -- using words with more characters than their English counterparts will cause problems at quite some places in the program.

These are probably some of the reasons why I haven't implemented i18n yet...

Francesco · Post by **Francesco** » Mon Aug 27, 2007 11:55 pm

Holger wrote:it probably would work for the current R'n'D

You mean that I could do some work there and expect to see a next RnD version with a language selector?

I could separate the main things (such as the setup-tree strings and the editor strings), and translate them, but only if you say that it can be easily plugged into the existing source code.

Making a program to translate the dictionary, that warns about the original lenght of each string, would be the funniest part!

I already imagine its messages: "You have translated 1% of the dictionary" and after some hours: "You have translated 1,2% of the dictionary"

:jokes:

No, really, should I work on it? Of course, I would share the program, so we can have it translated at least into the major european languages (no offence for anyone, of course)...

I have been asked for the documentation in italian and in french (which I wouldn't be writing in short time, for sure) but compared to that, translating the interface into italian, french and spanish would be done in almost no time.

Tomi · Post by **Tomi** » Tue Aug 28, 2007 8:13 am

The problems Holger said are all true, but if you're going to work on it, I suggest using gettext:
- gettext is a full-featured i18n library
- everybody else uses it
- there already are *many* translating programs, but all use gettext .po files
- it supports language-specific plurals and all sorts of other features

Francesco · Post by **Francesco** » Tue Aug 28, 2007 8:21 am

Well, using another program to do the work would throw away all of my fun - I would create such a program not because we need it, but because it would be good practice for my C++ studies... well, I could use the same ".po" file format, though.

In any case, I'll wait for Holger's "OK".

Tomi · Post by **Tomi** » Tue Aug 28, 2007 9:00 am

By all means, use whatever translating program you wish! But the internal i18n system used in the game should be gettext and .po files, just to be future-proof. That's all I wanted to say.

Francesco · Post by **Francesco** » Tue Aug 28, 2007 11:18 am

I've had a look to the PO format and to gettext.

Of course, that would be the most complete and most appropriate way to make such an i18n, but I don't know if that would add work on Holger's part... parsing a file that only has pairs of strings is far faster and easier than parsing a PO file.

On my part, my program could dump the result in both ways ("plain string couples" and "PO format").

Again, the final word is up to the chief

Post by **Holger** » Fri Aug 31, 2007 5:07 pm

I think that Tomi is right in his points towards using the well established "gettext" approach, as there are many problems already discussed and solved (so we don't have to re-invent the wheel for R'n'D).

Replacing strings "string" to be translated with "_(string)" and those which should not be translated with "N_(string)" in the R'n'D code would be fairly easy in the first step, as this wouldn't hurt at all. Which steps have to be done then -- and where it is useful to write some nice little tools to help in the process and where it is better to use existing tools -- is currently beyond my knowledge, as I only had a very quick view at gettext years ago.

In fact, I also thought about "quickly hacking my own language system" just for the fun of doing it, but before I wanted to read the large "gettext" manual (and then probably use that existing system).

So the conclusion is that I have no real advice which programs would be handy to write (like automatic translators), and which are better not to be written again, but instead done by using existing software... :-/

And as Francesco wrote, there may be indeed some aspects that may be R'n'D specific, like strings having a length limitation due to a fixed screen layout -- this may be a part of the i18n of R'n'D where the "gettext" system (especially automatic translation) may fail. Currently I simply don't know enough about it... :-/

Francesco · Post by **Francesco** » Fri Aug 31, 2007 6:07 pm

Very well, apart of the issues already raised and those that will arise when all the stuff will have to be put together, I see that there is no point againts starting such an i18n, very well.

Although gettext can be wide or complicated, the format used is quite simple - in practice, it isn't much different from the ".conf" format, pretty a human-readable format.

So my conclusion is that I'll start working on it. I'll share my progresses in this thread.