Internationalization concerns

Discussion of all aspects of the game engine, including development of new and existing features.

Moderator: Forum Moderators

Post Reply
Ayin
Inactive Developer
Posts: 294
Joined: March 30th, 2004, 4:45 pm
Location: Nîmes, France
Contact:

Internationalization concerns

Post by Ayin »

Today, most languages supported by Wesnoth use the latin-1 (iso-8859-1) charset. Texts in the .cfg files (and in the source code) are assumed to be (and are) encoded in latin-1.
Now, there are 2 languages that do not use the latin-1 charsets: Slovak and Polish (And there is also French which really should use Latin0 instead of Latin1). The 8-bit charset used for those is (if I am not mistaken) iso-8859-2. Wesnoth (and libSDL_ttf, which handles the rendering of fonts) does not handle 8-bit encodings other than latin-1; for non-latin1 charsets, UTF-8 encoding is used.
To achieve that, an option, "encoding", may be present in translation files. If set to "utf8", the font-rendering functions (and the text-boxes) will assume that the internal text strings are encoded in UTF-8.

The problem is that this option applies to every text string, even those originally encoded using the Latin-1 character set. This causes problems (and funny characters, or squares, being displayed) when those include accented characters, as done in the credits, or in some WML files (elven names), or in the "language selection" menu.

The simplest solution to fix this problem is to make all strings, either in WML files, or in the code, UTF-8-encoded. The other ones (XML-style encoding declarations, C-style wide-char string prefixes, using encoding only in [translation] markups, getting rid of all those funny languages) are not, IMO, worth the effort.

If it was implemented, it would solve the encoding problem. It would not solve the font problem (until we find a good, Free and complete Unicode True-Type font, that is)

Comments? : )
Dave
Founding Developer
Posts: 7071
Joined: August 17th, 2003, 5:07 am
Location: Seattle
Contact:

Post by Dave »

I concur with the suggestion.

I would also comment that I think that we should ONLY have UTF-8 encoded strings inside quoted WML values. That is, the entire WML file should NOT be considered a UTF-8 encoded file, only values, and values which use UTF-8 should be inside quotes.

The rest of all WML files should stick to the bottom 127 characters of the ASCII character set (and thus trivially be a subset of UTF-8).

David
“At Gambling, the deadly sin is to mistake bad play for bad luck.” -- Ian Fleming
Ayin
Inactive Developer
Posts: 294
Joined: March 30th, 2004, 4:45 pm
Location: Nîmes, France
Contact:

Post by Ayin »

I agree with restricting WML keys to plain ASCII. However, why restrict UTF-8 text to quoted values ? I feel it would be rather useless to add a special case in the code, just to forbid non-ASCII characters in non-quoted values.

I think there's something I missed here :)
User avatar
Viliam
Translator
Posts: 1341
Joined: January 30th, 2004, 11:07 am
Location: Bratislava, Slovakia
Contact:

Post by Viliam »

Ayin wrote:I agree with restricting WML keys to plain ASCII. However, why restrict UTF-8 text to quoted values ? I feel it would be rather useless to add a special case in the code, just to forbid non-ASCII characters in non-quoted values.
As I understand WML, it basicly contains: tags (in square brackets), keys (left from "=" sign), identifiers (e.g. id=intro_1), numbers (e.g. delay=4000), and strings (e.g. message="Then fight we shall."). So, strings can contain any character, other things are ASCII only.

File names IMHO should not contain non-ASCII characters; just to prevent possible errors on some computers. But technically, they are strings, too.

Some strings appear unquoted, like names of units in scenarios (e.g. description=Mokho Kimer). My desire to rename Mr. Kimer is not very strong yet, but perhaps it would be good to make it possible. Two reasons:

1) Translating to non-Latin languages; though "Mokho Kimer" is just a random string without meaning, maybe Russian players would prefer having it written in Cyrilics, or Japanese players in katakana. But maybe such things would be better to do by algorithm.

2) In Latin languages that use accented letters, random inserting of accents in there names would create feeling of familiarity. It is something very subtle that goes like this: The string "Mokho Kimer" is (taken separately) no more English than e.g. Slovak. But when you see hundreds of such strings, you know that author was not Slovak; because Slovak author would sometimes use an accented letter. However, this thing is probably too subtle to be worth the work.

Ayin, what was your example of non-quoted value that should contain non-ASCII characters?
Dave
Founding Developer
Posts: 7071
Joined: August 17th, 2003, 5:07 am
Location: Seattle
Contact:

Post by Dave »

Ayin wrote:I agree with restricting WML keys to plain ASCII. However, why restrict UTF-8 text to quoted values ? I feel it would be rather useless to add a special case in the code, just to forbid non-ASCII characters in non-quoted values.

I think there's something I missed here :)
Well for one thing, WML strips off whitespace from either end of a non-quoted value. If it was UTF-8, it would have to work out how to strip whitespace from a UTF-8 string.

David
“At Gambling, the deadly sin is to mistake bad play for bad luck.” -- Ian Fleming
User avatar
Viliam
Translator
Posts: 1341
Joined: January 30th, 2004, 11:07 am
Location: Bratislava, Slovakia
Contact:

UTF-8 space

Post by Viliam »

Dave wrote:If it was UTF-8, it would have to work out how to strip whitespace from a UTF-8 string.
Use the same algorithm.

Space in ASCII is byte 0x20. Space in UTF-8 is byte 0x20. Everytime there is a byte 0x20 in UTF-8, it is a space.

There are also other space-like characters. But if you do not handle them specially, probably no one will notice. Except for a sequence 0xEF 0xBB 0xBF at the very beginning of the file - this is also a space-like character in UTF-8, and should be ignored.
Ayin
Inactive Developer
Posts: 294
Joined: March 30th, 2004, 4:45 pm
Location: Nîmes, France
Contact:

Post by Ayin »

Viliam wrote:Ayin, what was your example of non-quoted value that should contain non-ASCII characters?
Mhh, I have no example; my point rather was "why bother checking for 7-bit ASCII values when loading config files? Just treat them a streams of bytes"
Post Reply