UTF8Decode not accept accents in ANSI char

FinderX · Post by **FinderX** » 2010-05-24 09:08:44

Hi,
I only warning for the record.
In the procedure UTF8Decode(str: string): string when the str have accents in ANSI (or ASCII) like [á é ò ù ä ï ´ `] , this return empy string ('').
My script is using like this

Code: Select all

str := 'I’m seeing ChäoS;HEAd';
str := HTMLDecode(str); // 'I'm seeing ChäoS;HEAd'
str := UTF8Decode(str); // '' empty string

Post by **antp** » 2010-05-24 20:58:20

Hi,
This is not a problem of the program: The string that you provide to the function is not valid UTF8... what are you trying to do actually with that function?

FinderX · Post by **FinderX** » 2010-05-25 04:06:18

the HTMLDecode is procedure not funtion, my bad...
however, what type is the string that return HTMLDecode?
or
that mean, is all chars is in ANSI, UTF8Decode thinks is invalid?

Post by **antp** » 2010-05-26 20:42:12

HTMLDecode returns ANSI indeed, not Unicode.
UTF8Decode is for webpages encoded in UTF8, which is not the case when using html entities as here. Or if there is both UTF8 & html entities, you must first call UTF8Decode, and then HTMLDecode.

FinderX · Post by **FinderX** » 2010-05-27 22:20:54

Sir, are you absolutely f... correct!

Is working, changing the order and the issues were resolved by 99%

The only thing what resist is when it's find one of strings

… // ...
’ // ´
” // ”
“ // “

I don't now theses are not UTF8 or not valid for HTMLDecode, but when are parsed the output character is a 'invisible' char (Notepad++ mark as such US control character)

Thanks! and now I have to put to rebuild my sript
stay tuned

Post by **antp** » 2010-05-28 08:34:06

That one is a bug, indeed

It was previously identified, but I haven't released yet a version to correct HTMLDecode for these cases.
Meanwhile, you have to use stringreplace function to replace them by corresponding character before calling HTMLDecode.
You can find them there:
http://en.wikipedia.org/wiki/Quotation_ ... _languages
"&hellip" is "…"

fulvio53s03 · Post by **fulvio53s03** » 2010-05-29 08:17:46

antp wrote:That one is a bug, indeed It was previously identified, but I haven't released yet a version to correct HTMLDecode for these cases.
Meanwhile, you have to use stringreplace function to replace them by corresponding character before calling HTMLDecode.
You can find them there:
http://en.wikipedia.org/wiki/Quotation_ ... _languages
"&hellip" is "…"

maybe in: http://www.trans4mind.com/personal_deve ... acters.htm we can find more informations....
Bye

Post by **antp** » 2010-05-29 15:41:40

There are lots of special characters, but I only have to handle those which exist in windows character set. For example “ ” should be converted because they exist in windows-1252 charset, but π (&pi on your page) is useless: it does not exist in standard set, I can't replace it by the actual character.

FinderX · Post by **FinderX** » 2010-06-01 02:28:02

You'll be mad but I have to inform, not only HTMLDecode,

UTF8Decode have a bug

I search and look and no find any report...
that's why I do , please read this ...

The idea is that the entire text to be translated in UTF8 to Windows 1252, as original author good say.
But what happens when the text is mixed with various codes, such as utf8 + win1252 + ...
sometimes it is impossible to know what type of encoding is.

Then the function UTF8Decode translates what is in UTF8, GOOD, and when it is not UTF8 put a "?", OK,
but when there is text in ASCII (extended) does not agree, and instead of placing a "?" return an empty string! wrong...
This is very serious, when the idea, in my opinion, is that Translates or Replace the character only, as does StringReplace (),
whether it is acceptable or to put a "?", but not destroy the whole string...
(well... 'destroy' is a way of saying...)
An example to illustrate:

Code: Select all

str := 'ChÃ¤oS WalkÃ¼re SeisÃ´ Hen'
str := UTF8Decode(str); //'ChäoS Walküre Seisô Hen'
str := UTF8Decode(str); // ''

Using this little code it is clear that UTF8Decode not like the ASCII extended in range of 128-255:

Code: Select all

program NewScript;
var
  code,decode,lastDecode,dummy:string;
  totalErrors,totalDecodes:string;
  i,nError,nSuccess:integer;
begin
  PickListClear;
  PickTreeClear;
  PickTreeAdd('Success Decodes','');
  nError:=0;
  nSuccess:=0;
  
  for i:=1 to 255 do begin   // ASCII (extended) range
    code:='&#'+IntToStr(i)+';';  // HTML number for HTMLDecode
    decode:=Copy(code,1,length(code));
    
    HTMLDecode(decode);
    lastDecode:=Copy(decode,1,length(decode));
    decode:=UTF8Decode(decode);
    
    if (decode = '') then begin  // If is empty then UTF8Decode Fails
      nError:=nError+1;
      PickListAdd(IntToStr(nError)+') '+code+' = '+lastDecode);
    end else begin
      nSuccess:=nSuccess+1;
      PickTreeAdd(IntToStr(nSuccess)+') '+code+' = '+lastDecode+' -> '+decode,'dummy');
    end;
  end;
  PickListExec('Failure Decodes',dummy);
  PickTreeExec(dummy);
  PickTreeClear;
  PickListClear;
end.

Again, I not looking to give me the reason or the fifth leg of cat, but for me this is a serious problem...
For this you must add more code, and control over processes by this problem.

I just hope to help the authors of the program so that in future versions of these problems
are solved.

The Ant Movie Catalog is GREAT, and I hope this helps somewhat
cya

Post by **antp** » 2010-06-01 14:23:27

No there is no bug in that function.

UTF8Decode returns an empty string when you ask to decode an invalid string, that's all.
If you give ANSI (ASCII Extended) it is normal that it fails.

If the page is in UTF8 unicode, you should decode all values ONCE (why call it twice on the same value?) or decode the whole page before processing it.
Else if the page is not UTF8, the function should not be used.
That's all.

UTF8 is a way of storing unicode into extended ASCII (128-255) positions, in the same way all character sets store extract characters in these positions.

That function is build-in Delphi, it works that way, I won't change it for the mysterious things you are trying to do

If you really want to be able to call it with anything you can do that:

Code: Select all

function TryUTF8Decode(S: string): string;
begin
  Result = UTF8Decode(S);
  if(Result = '')
    Result = S;
end;

That way if the string is not valid UTF8 it won't be modified.
If you have a mix of UTF8 and other things in the string, there is either some serious problems in your code or on the site.
If you do not know what is in the string, it is strange: it should be the same case everywhere on a page: either it is UTF8, either it is not.