tool for re-encoding addon texts into utf-8

Discussion of all aspects of the game engine, including development of new and existing features.

Moderator: Forum Moderators

Post Reply
User avatar
Nobun
Code Contributor
Posts: 129
Joined: May 4th, 2009, 9:28 pm
Location: Italy

tool for re-encoding addon texts into utf-8

Post by Nobun »

Hi. I developed for myself a tiny python3 experimental utility wich should be able to create a clone of a wesnoth addon wich will have text files encoded in utf-8.

This utility could be helpful if one or more files contained into an addon (for example .cfg files) are NOT encoded in 'utf-8' format as expected.

usage (under linux/mac):

Code: Select all

./wes2utf8 --addondir=path/to/original/addon/dir --dest=path/to/converted/version --filters cfg lua txt pot po
under windows, instead, you have to call your python3 interpreter before 'wes2utf8' something like that:

Code: Select all

/path/to/python3/python3 path/to/wes2utf8 --addondir=path/to/original/addon/dir --dest=path/to/converted/version --filters cfg lua txt pot po
- The utility will create another copy of the original addon into path/to/converted/version. The directory must not exists before launching the utility (it will be created by the utility itself).
- The parameter 'filters' can be omitted. It accept a list of file extensions. By default, if not specified, it is equal to "cfg lua txt pot po" (without quotes). So, all the example showed above could be have written without using the --filters parameter. The --filters parameter will allow to know what type of files must be re-encoded. All other type of files not matching the filter(for example .ogg, .png), will be simply copied into the destination directory (path/to/converted/version). Values setted on the 'filters' parameter should NOT start with point.

-----------------------------------------------

Tecnical details.
For every file matching the filter, the script will try to see if it can be read successfully using a specify text encoding. If it can't be read successfully with the first text encoding, it will try with the second one, until it finds a suitable text encoding.
The first 'working' text encoding is assumed as the right one.
The list of codecs and the order of the codecs that the script will try to use are listed in the source code of wes2utf8.py at lines 72-78.

If the source file is already 'utf-8' it will be simply copied in destination
if the source file is NOT 'utf-8' and the source file is a .po or a .pot, the script will also rewrite the header info about text encoding, ensuring that the po(t) header informations will says that now the 'utf-8' encoding is used.

-----------------------------------------------

Obtaining the script:
The script can be downloaded from my github account at:

https://github.com/AncientLich/wesnoth-utf8

Let me know if it works for you, or if encountered some problems.
Note: if the script can be useful and if it will work decently in a reasonable number of tests, I will consider also to make a python script with a GUI.
Tad_Carlucci
Inactive Developer
Posts: 503
Joined: April 24th, 2016, 4:18 pm

Re: tool for re-encoding addon texts into utf-8

Post by Tad_Carlucci »

While this script can remove the errors, it cannot ensure the correct conversion. It is very much "Use at your own risk."

Expect errors.

Be sure to carefully check each character of each effected file.

For more on this subject and why, while this **appears** to work, what it set out to do is impossible, see the accepted answer to the question:
http://stackoverflow.com/questions/9083 ... -text-file

and

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).
http://www.joelonsoftware.com/articles/Unicode.html

You have been warned.
I forked real life and now I'm getting merge conflicts.
User avatar
iceiceice
Posts: 1056
Joined: August 23rd, 2013, 2:10 am

Re: tool for re-encoding addon texts into utf-8

Post by iceiceice »

Nobun:

Is this actually a common problem?

I guess I would hope that the wesnoth addon server (`campaignd`) would reject addons that aren't UTF-8, and tell people to try using a different text editor.

Further I guess you could expect wesnoth itself to reject WML files that aren't UTF-8 to avoid problems, but I doubt that we do that right now.

Also: There is already a unix utility called `isutf8` which is available in some debian package, we use that in our travis-ci build to make sure the mainline files are utf-8. I feel like it's simpler to just flag bad files rather than try to automatically correct them -- usually the bad files are created by a text editor which has been misconfigured (user needs to select different encoding) or is simply broken (windows notepad). A stand-alone converter tool is kind of reinventing the wheel IMO.
User avatar
Nobun
Code Contributor
Posts: 129
Joined: May 4th, 2009, 9:28 pm
Location: Italy

Re: tool for re-encoding addon texts into utf-8

Post by Nobun »

@Tad_Carlucci, @iceiceice: first of all thank for your replies. There are two reason why I published this script I wrote for myself (in my personal case, it worked fine, so it did what I need):
1) The first reason is to receive feedback from more experienced programmers like you two, to understand if the method I used to perform the tasks can work (even if I am aware it can be not an 'universal' solution). So I am very happy to read what you wrote and perhaps learn something new.
2) The second reason, less important, is to share the work I made for myself, that (if works) could be helpful to other people.

-------------------------------------

@iceiceice: Also: About `isutf8`: I didn't know about this utility, however, if I undertsood correctly, it only checks if the text file use a different encoding than UTF-8. In my personal situation I was certain that some files in my addon(*) were not utf-8, and I wanted to avoid to rewrite all text files from scratch (some ones are very long). In my situation (some files where 'latin_1') the script could find 'latin_1' files and they should be now correctly encoded in 'utf-8' format. During the development I insterted a debug line wich printed a comment with the original text encoding, and the encoding in those files were checked correctly.

@tad_carlucci: We discussed about this point also on #wesnoth-dev, and I am very grateful for the time you spended there (in IRC) and here (in forum) to try to explain me all those things related to encoding.
But I have to admit that I'm not skilled enough to understand all the things I readed, so I come back for the first reason why I published the script: understanding if the method I use to perform the task make sense (and can work without too many bugs) or not.
This is the reason why, in the first reply of this topic, I tried (not sure if I was able) to explain what my script actually do.

I started using the informations I could understand from http://stackoverflow.com/questions/9083 ... -text-file (you pointed me to that page also on IRC).
What I understood is that browsers and some text editors have a way to heuristically guess what kind of encoding was used in that text files.
So I tried (and in my case worked) to find a simply way I could implement by myself to perform the same task.
The naive solution I used (for every filtered file) is composed by this parts:
- try to read entirely (line by line) a text file using 'utf-8' encoding. If a UnicodeDecodeError or UnicodeError exception was found, than the text is not surely encoded in 'utf-8'.
- try to read entirely (line by line) the text file using the next encoding system in the list (wich could read at line 72-78 of my source) until it raise an exception
- the first successfully try done this way (no exception raised) is assumed to be the right text encoding (if the script reaches the end of the encoding list without finding a suitable one, the application quits generating an error to the end-user).

While the process itself is not secure (I am aware with it) because it can happen that a wrong text encoding could read the text file without raising an error, I tried to follow a criteria that tries to prevent that issue:
1) I limited the list of encodings wich are 'checked' by the script (the list of supported encodings on python 3.2 - and I am using python 3.4 - is way longer) with the most 'standard' ones, wich happened to be used more frequently
2) The encodings are not randomly ordered, but with a criteria too. I started with the most 'strict' ones ('utf-8' and 'latin_1', for example) wich will almost surely fail (raising the UnicodeDecodeError or UnicodeError) when a different encoding is used, and only later I tried some other encodings, wich should however considered somewhat 'standard' ('iso8859_2', 'iso8859_3', and so on).
In that case, I had no direct knowledge of those encodings, but I followed a naive idea (wich usually is right in such situations) wich says 'if there is a progressive ID value, maybe there is a reason, so we follow that order'. This is why I used 'iso8859_2', followed by 'iso8859_3' and so on.

-------------------------------------------------------------------------------------
(the script) cannot ensure the correct conversion. It is very much "Use at your own risk."
Expect errors.
This is also the reason why the script will not overwrite the original wesnoth addon, but it 'converts' the addon into another directory. In this way, if something wrong/bad happened, the original work would be not lost.

Since I am not skilled enough to make a better script/program than the one I published, I was curious to know, during various runs in actual situation where the script could be (theorically) useful, how many times it worked, and how many times it re-encoded badly a text file (I don't consider an error the situation where the script fails becouse it was not able to find a suitable encoding from the list of available ones. In that case the failure is the expected result).

I did a first actual test with an addon wich contained 'latin_1' text files, wich ended successfully and I was able to peform the conversion I actually needed, so I reached my personal target on writing the script.
But a single test is not enough to understand if the script could be useful also to other people in other situations.

#####################################

Notes:
(*) the addon I was speaking is a campaign I never published in the official site, wich I started several years ago and never completed yet (even if it is not long). Some files were encoded wrongly becouse, at that time, I didn't even know that text files can be written using different encodings.
Tad_Carlucci
Inactive Developer
Posts: 503
Joined: April 24th, 2016, 4:18 pm

Re: tool for re-encoding addon texts into utf-8

Post by Tad_Carlucci »

It's not a question of skill. What you're looking to achieve is provably impossible.

If the person needing to convert from a codepage to utf-8 knows the codepage, and can tell your program that codepage, then, yes, your program will do it correctly. But, in that case, it would be faster, probably to simply open the file in their text editor and save it in utf-8.

Otherwise, there are any number of cases where your program will see a character, recognize it is a valid character in that codepage, recognize that it has a Unicode codepoint, and convert it. Even if you repeat the process for every character in the file, and see no errors, you could still be using the wrong codepage and, thus, converting to the wrong Unicode codepoint for output encoded in utf-8.

Just because every character you see in the file happens to be valid in a given codepage does not mean that it is the correct codepage.
I forked real life and now I'm getting merge conflicts.
User avatar
pauxlo
Posts: 1047
Joined: September 19th, 2006, 8:54 pm

Re: tool for re-encoding addon texts into utf-8

Post by pauxlo »

Nobun wrote: While the process itself is not secure (I am aware with it) because it can happen that a wrong text encoding could read the text file without raising an error, I tried to follow a criteria that tries to prevent that issue:
1) I limited the list of encodings wich are 'checked' by the script (the list of supported encodings on python 3.2 - and I am using python 3.4 - is way longer) with the most 'standard' ones, wich happened to be used more frequently
2) The encodings are not randomly ordered, but with a criteria too. I started with the most 'strict' ones ('utf-8' and 'latin_1', for example) wich will almost surely fail (raising the UnicodeDecodeError or UnicodeError) when a different encoding is used, and only later I tried some other encodings, wich should however considered somewhat 'standard' ('iso8859_2', 'iso8859_3', and so on).
In that case, I had no direct knowledge of those encodings, but I followed a naive idea (wich usually is right in such situations) wich says 'if there is a progressive ID value, maybe there is a reason, so we follow that order'. This is why I used 'iso8859_2', followed by 'iso8859_3' and so on.
Did you test this with some files encoded in ISO-8859-2 or ISO-8859-3 (with characters missing in Latin-1)?
The problem you have here that there is no way to automatically distinguish those from files encoded in ISO-8859-1 (= Latin 1), other than doing a spell-check or other heuristics, as they all are using the same bytes to (partially) mean different characters.
Post Reply