UTF8Decode() buggy?
UTF8Decode() buggy?
Hello,
while I'm working on the script filmstarts.de, sometimes I have weird problems with strings from webpages. The problem is UTF8Decode().
If a string contains special UTF-8 characters like Flame (0xF0 0x9F 0x94 0xA5 / https://www.compart.com/en/unicode/U+1F525), UTF8Decode() removes the complete string and returns ''.
I think this is a bug in AMC, not a feature. If a char can't be decoded, UTF8Decode() should replace it with ? or something similiar.
Greets,
yeti
while I'm working on the script filmstarts.de, sometimes I have weird problems with strings from webpages. The problem is UTF8Decode().
If a string contains special UTF-8 characters like Flame (0xF0 0x9F 0x94 0xA5 / https://www.compart.com/en/unicode/U+1F525), UTF8Decode() removes the complete string and returns ''.
I think this is a bug in AMC, not a feature. If a char can't be decoded, UTF8Decode() should replace it with ? or something similiar.
Greets,
yeti
Re: UTF8Decode() buggy?
Hi,
The code behind that function is from 2001 or something like that (Delphi 7) so it is possible that it is unable to handle some of the later unicode specifications.
AMC calls directly Delphi's function so I don't have control on what to do in case of an error
The code behind that function is from 2001 or something like that (Delphi 7) so it is possible that it is unable to handle some of the later unicode specifications.
AMC calls directly Delphi's function so I don't have control on what to do in case of an error

Re: UTF8Decode() buggy?
Thanks.
So I have to use this dirty hack:
It works for me at the moment.
Yeti
So I have to use this dirty hack:

Code: Select all
// The Original UTF8Decode() Function has the Bug that 4-Byte UTF-8 chars
// break the decoding and the resulting string is empty.
// If this happens, this Function replaces the 4-Byte UTF-8 chars
// (not all, only F0000000 to F0FFFFFF) with ? and
// calls UTF8Decode() again.
function MyUTF8Decode(Text:String): String;
var
Value: String;
begin
Value := '';
if Text <> '' then // Something to do?
begin
Value := UTF8Decode(Text);
if Value = '' then
// Replace most 4-Byte UTF-8 chars with ? and Decode again
Value := UTF8Decode(RegExprSetReplace('\xF0...', Text, '?', False));
end;
Result := Value;
end;
Re: UTF8Decode() buggy?
Thanks for the hack 
I know that also some Windows-1252 characters cause problems, for example in FilmUp or another one that I forgot...

I know that also some Windows-1252 characters cause problems, for example in FilmUp or another one that I forgot...
-
- Posts: 764
- Joined: 2007-04-28 05:46:43
- Location: Italy
Re: UTF8Decode() buggy?
FilmUP surely... but we found a way to correct the problem (... at least with some chars.... isn't it? or is this a different problem ?)
where:function DecodePage(s: string): string;
begin //aprire con Notepad++ Encoding Encode in Ansi
s := StringReplace(s, '–', '-');
s := StringReplace(s, '“', '"');
s := StringReplace(s, '”', '"');
s := StringReplace(s, '’', '’');
s := StringReplace(s, 'l?', ('l' + apice));
s := UTF8Decode(s);
Result := s;
end;
Apice = #39; //defined constant

Re: UTF8Decode() buggy?
Only problem is that it does not work for the auto-update because the "update script" also relies on the UTF8Decode function 
But I have another idea for that, some day I'll test it.

But I have another idea for that, some day I'll test it.
Re: UTF8Decode() buggy?
Not the same problem, but similiar. The first 4 are 2-Byte UTF-8 chars >127.fulvio53s03 wrote: ↑2018-08-14 14:10:24 FilmUP surely... but we found a way to correct the problem (... at least with some chars.... isn't it? or is this a different problem ?)
First C2 96 = Windows-1252 150 (ALT 0150) = '–'
Second C2 93 = Windows-1252 147 (ALT 0147) = '“'
Third C2 94 = Windows-1252 148 (ALT 0148) = '”'
Fourth C2 92 = Windows-1252 146 (ALT 0146) = '’'
UTF8Decode() can't interpret this and replace them by ?. The resulting string is not empty.
But if you put the above chars #150#147#148#146 + some fill-chars directly in a string and use UTF8Decode(), then the result is an empty string.
The last one (l?) is something else. Don't know why UTF8Decode() has a problem with that chars.
Re: UTF8Decode() buggy?
This problem occurs in filmweb(PL) with Polish characters.
We are waiting for a solution to this problem. We wish you that your solution to our problem was successful.

Re: UTF8Decode() buggy?
The problematic code (line 261 an up) in UPDATE_SCRIPTS is
Don't know why this is necessary. Under normal circiumstances the scripts don't contain any UTF-8 chars. But UTF8Decode() tries to decode all chars >127, polish chars too. I think all works well if this code will be removed.
Code: Select all
if CheckVersion(4,2,2) then
begin
ScriptContents := UTF8Decode(ScriptContents);
end;
Re: UTF8Decode() buggy?
I should test it first
Don't work, because the server delivers UTF-8 coded scripts.
I'am thinking about that.

I'am thinking about that.
Re: UTF8Decode() buggy?
It is due to the new version of Indy that I used for TLS 1.3, I had to chose a default encoding, and it seems that it applies that to any text independently on what was the original encoding.
I don't know why the Polish characters cannot be correctly decoded though; on a system using that character set it should work like it works for West-European accents on my system.
Here is the solution used for the FilmUp script:
viewtopic.php?p=43868#p43868
But in case of the Polish script, errors are less localized to a single point, as they also occur for texts in the header block for example, it is not just for a few special characters.
Re: UTF8Decode() buggy?
Here a quick and dirty solution for all scripts:
Use this instead of UTF8Decode() in UPDATE_SCRIPTS:
And add this function:
But Attention! It works only with 2-Byte UTF-8-Codes and only with coded Ansi-Chars (8-Bit).
Tested with Filmweb (PL) and Allocine (FR).
HTH
Use this instead of UTF8Decode() in UPDATE_SCRIPTS:
Code: Select all
ScriptContents := UTF8ToAnsi(ScriptContents);
Code: Select all
function UTF8ToAnsi(Text: string): string;
var
TmpChar1, TmpChar2, TmpChar3: Char;
UTF8Pos, TmpInt1, TmpInt2, TmpInt3: Integer;
begin
UTF8Pos := 1;
while UTF8Pos > 0 do
begin
UTF8Pos := AnsiPosEx2(#192, Text, False, False, 1);
if UTF8Pos = 0 then
UTF8Pos := AnsiPosEx2(#193, Text, False, False, 1);
if UTF8Pos = 0 then
UTF8Pos := AnsiPosEx2(#194, Text, False, False, 1);
if UTF8Pos = 0 then
UTF8Pos := AnsiPosEx2(#195, Text, False, False, 1);
if UTF8Pos > 0 then
begin
TmpChar1 := Copy(Text, UTF8Pos, 1);
TmpChar2 := Copy(Text, UTF8Pos+1, 1);
TmpInt1 := ord(TmpChar1);
TmpInt2 := ord(TmpChar2);
TmpInt1 := TmpInt1 and 3;
TmpInt1 := TmpInt1 shl 6;
TmpInt2 := TmpInt2 and $3F;
TmpInt3 := TmpInt1 or TmpInt2;
TmpChar3 := Chr(TmpInt3);
Delete(Text, UTF8Pos, 2);
Insert(TmpChar3, Text, UTF8Pos);
end;
end;
Result := Text;
end;
Tested with Filmweb (PL) and Allocine (FR).
HTH
Re: UTF8Decode() buggy?
Thanks, I will check that.
Normally all scripts are encoded with locale codepage, so I suppose it should work in all cases.
Normally all scripts are encoded with locale codepage, so I suppose it should work in all cases.
Re: UTF8Decode() buggy?
Not at all. Tested all Scripts in http://update.antp.be/amc/scripts/. Some contains the first UTF-8 char, but the second char isn't UTF-8 conform. Below is the updated function (if the second char isn't a UTF-8 one, we do nothing). Looks better too
Now the only difference between the browser and AMC download is an CRLF at the end of the file after the download via AMC. Is it a problem with GetPage() or TStringList?

Now the only difference between the browser and AMC download is an CRLF at the end of the file after the download via AMC. Is it a problem with GetPage() or TStringList?
Code: Select all
function UTF8ToAnsi(Text: string): string;
var
TmpChar1, TmpChar2, TmpChar3: Char;
UTF8Pos, UTF8NextPos, TmpInt1, TmpInt2, TmpInt3, i: Integer;
begin
for i := 192 to 195 do
begin
UTF8Pos := 1;
UTF8NextPos := 1;
while UTF8Pos > 0 do
begin
UTF8Pos := AnsiPosEx2(Chr(i), Text, False, False, UTF8NextPos);
if UTF8Pos > 0 then
begin
TmpChar1 := Copy(Text, UTF8Pos, 1);
TmpChar2 := Copy(Text, UTF8Pos+1, 1);
TmpInt1 := ord(TmpChar1);
TmpInt2 := ord(TmpChar2);
if (TmpInt2 and $C0) = $80 then
begin
TmpInt1 := TmpInt1 and 3;
TmpInt1 := TmpInt1 shl 6;
TmpInt2 := TmpInt2 and $3F;
TmpInt3 := TmpInt1 or TmpInt2;
TmpChar3 := Chr(TmpInt3);
Delete(Text, UTF8Pos, 2);
Insert(TmpChar3, Text, UTF8Pos);
end;
end;
UTF8NextPos := UTF8Pos + 1;
end;
end;
Result := Text;
end;
Re: UTF8Decode() buggy?
Probably related to TStringList, but not a big deal then.
What script has non-UTF8 character? (and which character?)
Are these some of the Windows-1252 specific ones? (like what we tried to handle in the FilmUp script)
What script has non-UTF8 character? (and which character?)
Are these some of the Windows-1252 specific ones? (like what we tried to handle in the FilmUp script)
Re: UTF8Decode() buggy?
These ones:
hispashare (ES).ifs
dvdempire.ifs
IMDB.ifs
IMDB (Actor images).ifs
Moviecovers (FR).ifs
Comingsoon.it.ifs
e.g Comingsoon.it.ifs:
C0 3A -> Value := stringReplace(Value, '>CURIOSITÀ:<', '>Curiosità:<'); //quiFS2017
C0 3A -> Page := stringReplace(Page, '>CURIOSITÀ:<', '>Curiosità:<');
C0 starts a UTF-8 block, but 3A is wrong. But other chars are conform with UTF-8 and the script has converted them.
Either the file is not UTF-8 or something goes wrong with GetPage or the server. It is not correct if a file is mixed UTF-8 and windows-1252.
The À has to be encoded with C3 80.
Nevertheless, all scripts are now 1:1 identical to the ones manually downloaded (except for the CRLF).
hispashare (ES).ifs
dvdempire.ifs
IMDB.ifs
IMDB (Actor images).ifs
Moviecovers (FR).ifs
Comingsoon.it.ifs
e.g Comingsoon.it.ifs:
C0 3A -> Value := stringReplace(Value, '>CURIOSITÀ:<', '>Curiosità:<'); //quiFS2017
C0 3A -> Page := stringReplace(Page, '>CURIOSITÀ:<', '>Curiosità:<');
C0 starts a UTF-8 block, but 3A is wrong. But other chars are conform with UTF-8 and the script has converted them.
Either the file is not UTF-8 or something goes wrong with GetPage or the server. It is not correct if a file is mixed UTF-8 and windows-1252.
The À has to be encoded with C3 80.
Nevertheless, all scripts are now 1:1 identical to the ones manually downloaded (except for the CRLF).
Re: UTF8Decode() buggy?

When I download http://update.antp.be/amc/scripts/Comingsoon.it.ifs via GetPage and do not decode it, I get the correct C3 80 for the À (= "À" if viewed as ANSI).
What is the default language/character set on your system? Maybe there is a difference of behaviour
(on mine it is a mixture of English and French, with West-European character set as default then)
Re: UTF8Decode() buggy?
Ups... My fault. Sorry. I think, I've tested this with already decoded files. Too many tests today 
Only one more test with all files in http://update.antp.be/amc/scripts/ -> UTF-8-coding ok. No wrong coded chars. Puhh...
My system is German with Latin1/Windows-1252 charset.
Good Night.

Only one more test with all files in http://update.antp.be/amc/scripts/ -> UTF-8-coding ok. No wrong coded chars. Puhh...
My system is German with Latin1/Windows-1252 charset.
Good Night.
Re: UTF8Decode() buggy?
By downloading the script filmweb(PL) from the level of Update Scrips, everything is ok. Polish characters are preserved. Thank you very much on behalf of Polish AMC enthusiasts
Re: UTF8Decode() buggy?
@Yeti > Thanks for the test 
@Athe > You mean you tested with Yeti's fix of the "Update Script"?

@Athe > You mean you tested with Yeti's fix of the "Update Script"?