UTF8Decode() buggy?

yeti · Post by **yeti** » 2018-08-14 07:33:48

Hello,

while I'm working on the script filmstarts.de, sometimes I have weird problems with strings from webpages. The problem is UTF8Decode().
If a string contains special UTF-8 characters like Flame (0xF0 0x9F 0x94 0xA5 / https://www.compart.com/en/unicode/U+1F525), UTF8Decode() removes the complete string and returns ''.
I think this is a bug in AMC, not a feature. If a char can't be decoded, UTF8Decode() should replace it with ? or something similiar.

Greets,
yeti

Post by **antp** » 2018-08-14 08:08:12

Hi,
The code behind that function is from 2001 or something like that (Delphi 7) so it is possible that it is unable to handle some of the later unicode specifications.
AMC calls directly Delphi's function so I don't have control on what to do in case of an error

yeti · Post by **yeti** » 2018-08-14 10:00:27

Thanks.

So I have to use this dirty hack:

It works for me at the moment.

Code: Select all

// The Original UTF8Decode() Function has the Bug that 4-Byte UTF-8 chars
// break the decoding and the resulting string is empty.
// If this happens, this Function replaces the 4-Byte UTF-8 chars
// (not all, only F0000000 to F0FFFFFF) with ? and
// calls UTF8Decode() again.
function MyUTF8Decode(Text:String): String;
var
    Value: String;

begin
  Value := '';
  
  if Text <> '' then // Something to do?
  begin
    Value := UTF8Decode(Text);

    if Value = '' then
      // Replace most 4-Byte UTF-8 chars with ? and Decode again
      Value := UTF8Decode(RegExprSetReplace('\xF0...', Text, '?', False));
  end;
  
  Result := Value;
end;

Yeti

Post by **antp** » 2018-08-14 11:05:21

Thanks for the hack

I know that also some Windows-1252 characters cause problems, for example in FilmUp or another one that I forgot...

fulvio53s03 · Post by **fulvio53s03** » 2018-08-14 14:10:24

antp wrote: 2018-08-14 11:05:21 Thanks for the hack
I know that also some Windows-1252 characters cause problems, for example in FilmUp or another one that I forgot...

FilmUP surely... but we found a way to correct the problem (... at least with some chars.... isn't it? or is this a different problem ?)

function DecodePage(s: string): string;
begin //aprire con Notepad++ Encoding Encode in Ansi
s := StringReplace(s, 'Â–', '-');
s := StringReplace(s, 'Â“', '"');
s := StringReplace(s, 'Â”', '"');
s := StringReplace(s, 'Â’', 'â€™');
s := StringReplace(s, 'l?', ('l' + apice));
s := UTF8Decode(s);
Result := s;
end;

where:

Apice = #39; //defined constant

Post by **antp** » 2018-08-14 16:21:57

Only problem is that it does not work for the auto-update because the "update script" also relies on the UTF8Decode function

But I have another idea for that, some day I'll test it.

yeti · Post by **yeti** » 2018-08-14 16:27:43

fulvio53s03 wrote: 2018-08-14 14:10:24 FilmUP surely... but we found a way to correct the problem (... at least with some chars.... isn't it? or is this a different problem ?)

Not the same problem, but similiar. The first 4 are 2-Byte UTF-8 chars >127.

First C2 96 = Windows-1252 150 (ALT 0150) = '–'
Second C2 93 = Windows-1252 147 (ALT 0147) = '“'
Third C2 94 = Windows-1252 148 (ALT 0148) = '”'
Fourth C2 92 = Windows-1252 146 (ALT 0146) = '’'

UTF8Decode() can't interpret this and replace them by ?. The resulting string is not empty.
But if you put the above chars #150#147#148#146 + some fill-chars directly in a string and use UTF8Decode(), then the result is an empty string.

The last one (l?) is something else. Don't know why UTF8Decode() has a problem with that chars.

athe · Post by **athe** » 2018-08-16 21:57:10

antp wrote: 2018-08-14 11:05:21 Thanks for the hack
I know that also some Windows-1252 characters cause problems, for example in FilmUp or another one that I forgot...

This problem occurs in filmweb(PL) with Polish characters.

antp wrote: 2018-08-14 16:21:57 Only problem is that it does not work for the auto-update because the "update script" also relies on the UTF8Decode function
But I have another idea for that, some day I'll test it.

We are waiting for a solution to this problem. We wish you that your solution to our problem was successful.

yeti · Post by **yeti** » 2018-08-17 05:07:04

The problematic code (line 261 an up) in UPDATE_SCRIPTS is

Code: Select all

	if CheckVersion(4,2,2) then
	begin
	  ScriptContents := UTF8Decode(ScriptContents);
	end;

Don't know why this is necessary. Under normal circiumstances the scripts don't contain any UTF-8 chars. But UTF8Decode() tries to decode all chars >127, polish chars too. I think all works well if this code will be removed.

yeti · Post by **yeti** » 2018-08-17 05:34:09

I should test it first

Don't work, because the server delivers UTF-8 coded scripts.

I'am thinking about that.

Post by **antp** » 2018-08-17 06:52:40

yeti wrote: 2018-08-17 05:34:09Don't work, because the server delivers UTF-8 coded scripts.

It is due to the new version of Indy that I used for TLS 1.3, I had to chose a default encoding, and it seems that it applies that to any text independently on what was the original encoding.
I don't know why the Polish characters cannot be correctly decoded though; on a system using that character set it should work like it works for West-European accents on my system.

athe wrote: 2018-08-16 21:57:10We are waiting for a solution to this problem. We wish you that your solution to our problem was successful.

Here is the solution used for the FilmUp script:
viewtopic.php?p=43868#p43868
But in case of the Polish script, errors are less localized to a single point, as they also occur for texts in the header block for example, it is not just for a few special characters.

yeti · Post by **yeti** » 2018-08-17 11:01:20

Here a quick and dirty solution for all scripts:

Use this instead of UTF8Decode() in UPDATE_SCRIPTS:

Code: Select all

ScriptContents := UTF8ToAnsi(ScriptContents);

And add this function:

Code: Select all

function UTF8ToAnsi(Text: string): string;
var
  TmpChar1, TmpChar2, TmpChar3: Char;
  UTF8Pos, TmpInt1, TmpInt2, TmpInt3: Integer;

begin
  UTF8Pos := 1;
  
  while UTF8Pos > 0 do
  begin
    UTF8Pos := AnsiPosEx2(#192, Text, False, False, 1);
    if UTF8Pos = 0 then
      UTF8Pos := AnsiPosEx2(#193, Text, False, False, 1);
    if UTF8Pos = 0 then
      UTF8Pos := AnsiPosEx2(#194, Text, False, False, 1);
    if UTF8Pos = 0 then
      UTF8Pos := AnsiPosEx2(#195, Text, False, False, 1);
    
    if UTF8Pos > 0 then
    begin
      TmpChar1 := Copy(Text, UTF8Pos, 1);
      TmpChar2 := Copy(Text, UTF8Pos+1, 1);
      TmpInt1 := ord(TmpChar1);
      TmpInt2 := ord(TmpChar2);

      TmpInt1 := TmpInt1 and 3;
      TmpInt1 := TmpInt1 shl 6;
      TmpInt2 := TmpInt2 and $3F;
      TmpInt3 := TmpInt1 or TmpInt2;
      TmpChar3 := Chr(TmpInt3);
      Delete(Text, UTF8Pos, 2);
      Insert(TmpChar3, Text, UTF8Pos);
    end;
  end;
  Result := Text;
end;

But Attention! It works only with 2-Byte UTF-8-Codes and only with coded Ansi-Chars (8-Bit).
Tested with Filmweb (PL) and Allocine (FR).

HTH

Post by **antp** » 2018-08-17 11:27:43

Thanks, I will check that.
Normally all scripts are encoded with locale codepage, so I suppose it should work in all cases.

yeti · Post by **yeti** » 2018-08-17 12:43:55

Not at all. Tested all Scripts in http://update.antp.be/amc/scripts/. Some contains the first UTF-8 char, but the second char isn't UTF-8 conform. Below is the updated function (if the second char isn't a UTF-8 one, we do nothing). Looks better too

Now the only difference between the browser and AMC download is an CRLF at the end of the file after the download via AMC. Is it a problem with GetPage() or TStringList?

Code: Select all

function UTF8ToAnsi(Text: string): string;
var
  TmpChar1, TmpChar2, TmpChar3: Char;
  UTF8Pos, UTF8NextPos, TmpInt1, TmpInt2, TmpInt3, i: Integer;

begin
  for i := 192 to 195 do
  begin
    UTF8Pos := 1;
    UTF8NextPos := 1;

    while UTF8Pos > 0 do
    begin
      UTF8Pos := AnsiPosEx2(Chr(i), Text, False, False, UTF8NextPos);
      
      if UTF8Pos > 0 then
      begin
        TmpChar1 := Copy(Text, UTF8Pos, 1);
        TmpChar2 := Copy(Text, UTF8Pos+1, 1);
        TmpInt1 := ord(TmpChar1);
        TmpInt2 := ord(TmpChar2);

        if (TmpInt2 and $C0) = $80 then
        begin
          TmpInt1 := TmpInt1 and 3;
          TmpInt1 := TmpInt1 shl 6;
          TmpInt2 := TmpInt2 and $3F;
          TmpInt3 := TmpInt1 or TmpInt2;
          TmpChar3 := Chr(TmpInt3);
          Delete(Text, UTF8Pos, 2);
          Insert(TmpChar3, Text, UTF8Pos);
        end;
      end;
      UTF8NextPos := UTF8Pos + 1;
    end;
  end;
  Result := Text;
end;

Post by **antp** » 2018-08-17 12:54:53

Probably related to TStringList, but not a big deal then.

What script has non-UTF8 character? (and which character?)
Are these some of the Windows-1252 specific ones? (like what we tried to handle in the FilmUp script)

yeti · Post by **yeti** » 2018-08-17 16:17:27

These ones:

hispashare (ES).ifs
dvdempire.ifs
IMDB.ifs
IMDB (Actor images).ifs
Moviecovers (FR).ifs
Comingsoon.it.ifs

e.g Comingsoon.it.ifs:
C0 3A -> Value := stringReplace(Value, '>CURIOSITÀ:<', '>Curiosità:<'); //quiFS2017
C0 3A -> Page := stringReplace(Page, '>CURIOSITÀ:<', '>Curiosità:<');

C0 starts a UTF-8 block, but 3A is wrong. But other chars are conform with UTF-8 and the script has converted them.
Either the file is not UTF-8 or something goes wrong with GetPage or the server. It is not correct if a file is mixed UTF-8 and windows-1252.
The À has to be encoded with C3 80.
Nevertheless, all scripts are now 1:1 identical to the ones manually downloaded (except for the CRLF).

Post by **antp** » 2018-08-17 18:11:22

C0 3A is for "À:" in ANSI.
When I download http://update.antp.be/amc/scripts/Comingsoon.it.ifs via GetPage and do not decode it, I get the correct C3 80 for the À (= "Ã€" if viewed as ANSI).

What is the default language/character set on your system? Maybe there is a difference of behaviour
(on mine it is a mixture of English and French, with West-European character set as default then)

yeti · Post by **yeti** » 2018-08-17 21:20:32

Ups... My fault. Sorry. I think, I've tested this with already decoded files. Too many tests today

Only one more test with all files in http://update.antp.be/amc/scripts/ -> UTF-8-coding ok. No wrong coded chars. Puhh...

My system is German with Latin1/Windows-1252 charset.
Good Night.

athe · Post by **athe** » 2018-08-17 23:27:38

By downloading the script filmweb(PL) from the level of Update Scrips, everything is ok. Polish characters are preserved. Thank you very much on behalf of Polish AMC enthusiasts

Post by **antp** » 2018-08-18 07:04:55

@Yeti > Thanks for the test

@Athe > You mean you tested with Yeti's fix of the "Update Script"?

antp.be forum

UTF8Decode() buggy?

UTF8Decode() buggy?

Re: UTF8Decode() buggy?

Re: UTF8Decode() buggy?

Re: UTF8Decode() buggy?

Re: UTF8Decode() buggy?

Re: UTF8Decode() buggy?

Re: UTF8Decode() buggy?

Re: UTF8Decode() buggy?

Re: UTF8Decode() buggy?

Re: UTF8Decode() buggy?

Re: UTF8Decode() buggy?

Re: UTF8Decode() buggy?

Re: UTF8Decode() buggy?

Re: UTF8Decode() buggy?

Re: UTF8Decode() buggy?

Re: UTF8Decode() buggy?

Re: UTF8Decode() buggy?

Re: UTF8Decode() buggy?

Re: UTF8Decode() buggy?

Re: UTF8Decode() buggy?