function HTMLdecode() in script

J · Post by J » 2013-07-07 18:28:27

hi,

the variable "Value" contains some text including different HTML entities.

When i´m using HTMLDecode(Value) and Value includes

–

then the rest of he text vanishes.

Is this an error of the function or am I doing anything wrong?

Post by **antp** » 2013-07-07 19:22:50

There seems to be a bug indeed...
but I do not know why it occurs in AMC since the function HTMLDecode used outside AMC works fine (at least the version of that function that was in previous AMC versions, but it does not seem to have changed).
So soulsnake will have to check that himself in AMC's code (I do not have AMC's code ready to compile on my PC).

J · Post by J » 2013-07-07 19:53:11

Alright, I´ll use a StringReplace for now.

thanks.

soulsnake · Post by **soulsnake** » 2013-07-08 18:38:10

Hi,

This charactere "en dash" doesn't exist in table ascii iso 8859-1 that why it is replaced by empty.
Actually, all special characteres which don't exist in table ascii iso 8859-1 are replaced by empty.
But maybe we could subtitute "en dash" by 2 short dash and "em dash" by 3 short dash by default ?

Soulsnake.

Post by **antp** » 2013-07-08 20:04:57

Well the bug is that all what follows the dash is deleted.
Replacing it by an empty string wouldn't be so much a problem if the rest of the variable wasn't lost.
If the function is improved, it could replace all the dash types simply by a "-".

Raoul_Volfoni · Post by **Raoul_Volfoni** » 2013-07-08 20:05:12

And replace by &#150 ; () and &#151 ; () ?

Code: Select all

program N_dash_M_dash;
var
value: string;
begin
value := '–'+#13#10+'N dash'+#13#10#13#10+'—'+#13#10+'M dash';
value := StringReplace(value, '–', '&#150 ;');
value := StringReplace(value, '—', '&#151 ;');
HTMLDecode(Value);
Showmessage(Value);
end.

Don't forget to remove space between &#150 and ; and between &#151 and ;

Post by **antp** » 2013-07-08 20:25:05

Well in such case they could directly be included in the function.
These character do not exist in iso-8859-1, they are part of windows-1252. Maybe HTMLDecode should be extended to handle all the extra characters of that charset, as anyway the program itself uses windows-1252 on all western-european character set Windows.

J · Post by J » 2013-07-09 10:38:12

hi,

replacing one or two characters before using HTMLDecode() is not the problem, but referring to
http://en.wikipedia.org/wiki/List_of_XM ... references
there are dozens of special HTML character entities (The HTML 4 DTDs define 252 named entities).

As Antoine said, replacing the entities by an empty string wouldn't be so much a problem if the rest of the variable wasn't lost.

When parsing the HTML I have no clue what character set is used and this problem might occur with other special characters again. I expected the function working on all HTML character entities no matter which charset is used.

So at least a better "handling" (working replacement with space without cutting) or an expansion of htmDecode() for all the characters (even UTF-8 in my case) or maybe a second function doing this would be really nice to have.

thanks
J.

Post by **antp** » 2013-07-09 12:59:32

Available entities do not depend on the character set used, as far as I know.
Except the few existing in windows-1252 charset that could be added, most of the missing ones can't be properly translated anyway, so either removing them or replacing them by "?" might be the best solution.
As AMC isn't unicode, handling all the unicode-ones will not be very useful, even if they could be stored in UTF-8 (for use in something else when exported by AMC then?)

J · Post by J » 2013-07-09 13:37:15

You´re right,

as far as I understood they are defined in HTML-DTD and the chars itself are partially used in the character sets.

Doing it as you suggested, including the win-1252 ones, should work then as expected for AMC. First I was confused by the "cutting off" error, which is still not the normal result, isn´t it?

If you don´t want to change the existing HTMLDecode function, I´ll then write a little script function which first does the remaining entities and then calls HTMLDecode, but that´s second best solution.

Post by **antp** » 2013-07-09 14:10:17

J wrote:First I was confused by the "cutting off" error, which is still not the normal result, isn´t it?

Indeed, it is not normal.
Extending it to windows-1252 is easy, as I think that all the missing characters used spaces that are currently empty in the iso-8859-1 set anyway.
It will give weird results on systems using a different character set (Cyrillic, etc.) but that was already the case with the symbols already handled anyway.

soulsnake · Post by **soulsnake** » 2013-07-09 16:36:17

Hi,

I will extend HTMLDecode function to windows-1252 charset in next update of AMC 4.2 beta.

Soulsnake.

J · Post by J » 2013-07-13 17:46:56

very good, thanks.

just for the records:

…

will cause the script to an endless loop or to disappear into nirvana