If you can't get CP1252 valid charset using the standard UTF8Decode or similar functions or procedures, this is how to build your conversion procedure.
1. Remove any conversion function from your script and save the unchanged page content to a debug file (you can use DumpPage())
2. Test the script then open the dump file with an hex editor (I use the free and open source wxHexEditor 0.24 beta for windows)
3. Look at all untranslated characters (example: ÃÂ is malformed BUT ' is a valid html character) and get their hex characters
4. Finally you will have a list of all untranslated characters and you can notice they share few prefixes. Example:
C382C2B0, C382C2AD, C382C48D and so on. As you can see, in this example I have a common "C382C2" for 2 of 3 untranslated characters and "C382C4" for another one. So these 2 are my "prefixes". A prefix can be very long (5 or more characters). Maybe you have to test more and more pages to get all prefixes.
5. Get the wanted character looking the original site page in your browser. Example: … is showed c3a2c280c2a6
6. This is the my conversion function in javascript, save the html file in your pc.
Code: Select all
<html>
<head>
<title>Weird conversion function</title>
</head>
<body>
<script>
const char = "…";
const utf8EncodedChar = unescape(encodeURIComponent(char));
let hex = "";
for (let i = 0; i < utf8EncodedChar.length; i++) {
hex += utf8EncodedChar.charCodeAt(i).toString(16);
}
document.writeln(`La sequenza esadecimale di "${char}" in UTF-8 è: ${hex}<br>`);
const hexPrefix = hex.substring(0, hex.length - 2);
const decoder = new TextDecoder("utf-8");
let str = "";
for (let i = 0; i < hexPrefix.length; i += 2) {
str += 'chr(' + parseInt(hexPrefix.substr(i, 2), 16).toString(10) + ') + ';
}
for (let i = 128; i <= 191; i++) {
let tmp = i.toString(16);
if (tmp.length < 2)
tmp = "0" + tmp;
const hexSequence = hexPrefix + tmp;
const byteSequence = hexSequence.match(/.{1,2}/g).map((byte) => parseInt(byte, 16));
const char2 = decoder.decode(new Uint8Array(byteSequence));
document.writeln(`str := StringReplace(str, (${str}chr(${i})), '${char2}');<br>`);
}
</script>
</body>
</html>
7. Open the script in your browser. Now have a look at the output: you can see where … is:
Code: Select all
str := StringReplace(str, (chr(226) + chr(128) + chr(166)), '…');
and you can do this with this assignment:
Code: Select all
str := StringReplace(str, (chr(195) + chr(162) + chr(194) + chr(128) + chr(194)), (chr(226) + chr(128)));
9. Now you have to be sure all characters are well displayed in CP1252. CP1252 only support 255 characters so you don't have … but you can use ... instead. So apply this replace for all lines you got from Javascript script. If you can't find a good replace set of characters for a UTF8 character you can use void string so that character will not be showed/translated at all (I prefer this to ? or invalid characters but it's up to you)
10. Test the script with different pages. If you still get invalid characters repeat from point 5, then add the new output to the previous lines. Note: you can also reuse valid and tested functions for known charset conversions so you can merge/mix your special conversion functions and ready-to-use ones.