My weird anything to CP1252 how-to
Posted: 2023-05-23 18:35:45
This is weird but it worked for my scripts where UTF8Decode or other functions fail. I used it for ComicsBox ifs script
If you can't get CP1252 valid charset using the standard UTF8Decode or similar functions or procedures, this is how to build your conversion procedure.
1. Remove any conversion function from your script and save the unchanged page content to a debug file (you can use DumpPage())
2. Test the script then open the dump file with an hex editor (I use the free and open source wxHexEditor 0.24 beta for windows)
3. Look at all untranslated characters (example: ÃÂ is malformed BUT ' is a valid html character) and get their hex characters
4. Finally you will have a list of all untranslated characters and you can notice they share few prefixes. Example:
C382C2B0, C382C2AD, C382C48D and so on. As you can see, in this example I have a common "C382C2" for 2 of 3 untranslated characters and "C382C4" for another one. So these 2 are my "prefixes". A prefix can be very long (5 or more characters). Maybe you have to test more and more pages to get all prefixes.
5. Get the wanted character looking the original site page in your browser. Example: … is showed c3a2c280c2a6
6. This is the my conversion function in javascript, save the html file in your pc.
7. Open the script in your browser. Now have a look at the output: you can see where … is:
8. Look again at your original site page and how … is in your hex editor. If it's not e280a6 (that is: 226 128 166 in decimal format) you need to add a further conversion. So if in your script you have c3a2c280c2a6 what you have to do is convert 'c3a2c280c2' prefix in 'e280'
and you can do this with this assignment:
Put this line before the lines you get from javascript script
9. Now you have to be sure all characters are well displayed in CP1252. CP1252 only support 255 characters so you don't have … but you can use ... instead. So apply this replace for all lines you got from Javascript script. If you can't find a good replace set of characters for a UTF8 character you can use void string so that character will not be showed/translated at all (I prefer this to ? or invalid characters but it's up to you)
10. Test the script with different pages. If you still get invalid characters repeat from point 5, then add the new output to the previous lines. Note: you can also reuse valid and tested functions for known charset conversions so you can merge/mix your special conversion functions and ready-to-use ones.
If you can't get CP1252 valid charset using the standard UTF8Decode or similar functions or procedures, this is how to build your conversion procedure.
1. Remove any conversion function from your script and save the unchanged page content to a debug file (you can use DumpPage())
2. Test the script then open the dump file with an hex editor (I use the free and open source wxHexEditor 0.24 beta for windows)
3. Look at all untranslated characters (example: ÃÂ is malformed BUT ' is a valid html character) and get their hex characters
4. Finally you will have a list of all untranslated characters and you can notice they share few prefixes. Example:
C382C2B0, C382C2AD, C382C48D and so on. As you can see, in this example I have a common "C382C2" for 2 of 3 untranslated characters and "C382C4" for another one. So these 2 are my "prefixes". A prefix can be very long (5 or more characters). Maybe you have to test more and more pages to get all prefixes.
5. Get the wanted character looking the original site page in your browser. Example: … is showed c3a2c280c2a6
6. This is the my conversion function in javascript, save the html file in your pc.
Code: Select all
<html>
<head>
<title>Weird conversion function</title>
</head>
<body>
<script>
const char = "…";
const utf8EncodedChar = unescape(encodeURIComponent(char));
let hex = "";
for (let i = 0; i < utf8EncodedChar.length; i++) {
hex += utf8EncodedChar.charCodeAt(i).toString(16);
}
document.writeln(`La sequenza esadecimale di "${char}" in UTF-8 è: ${hex}<br>`);
const hexPrefix = hex.substring(0, hex.length - 2);
const decoder = new TextDecoder("utf-8");
let str = "";
for (let i = 0; i < hexPrefix.length; i += 2) {
str += 'chr(' + parseInt(hexPrefix.substr(i, 2), 16).toString(10) + ') + ';
}
for (let i = 128; i <= 191; i++) {
let tmp = i.toString(16);
if (tmp.length < 2)
tmp = "0" + tmp;
const hexSequence = hexPrefix + tmp;
const byteSequence = hexSequence.match(/.{1,2}/g).map((byte) => parseInt(byte, 16));
const char2 = decoder.decode(new Uint8Array(byteSequence));
document.writeln(`str := StringReplace(str, (${str}chr(${i})), '${char2}');<br>`);
}
</script>
</body>
</html>
7. Open the script in your browser. Now have a look at the output: you can see where … is:
Code: Select all
str := StringReplace(str, (chr(226) + chr(128) + chr(166)), '…');
and you can do this with this assignment:
Code: Select all
str := StringReplace(str, (chr(195) + chr(162) + chr(194) + chr(128) + chr(194)), (chr(226) + chr(128)));
9. Now you have to be sure all characters are well displayed in CP1252. CP1252 only support 255 characters so you don't have … but you can use ... instead. So apply this replace for all lines you got from Javascript script. If you can't find a good replace set of characters for a UTF8 character you can use void string so that character will not be showed/translated at all (I prefer this to ? or invalid characters but it's up to you)
10. Test the script with different pages. If you still get invalid characters repeat from point 5, then add the new output to the previous lines. Note: you can also reuse valid and tested functions for known charset conversions so you can merge/mix your special conversion functions and ready-to-use ones.