Page 1 of 1

My weird anything to CP1252 how-to

Posted: 2023-05-23 18:35:45
by MrObama2022
This is weird but it worked for my scripts where UTF8Decode or other functions fail. I used it for ComicsBox ifs script

If you can't get CP1252 valid charset using the standard UTF8Decode or similar functions or procedures, this is how to build your conversion procedure.

1. Remove any conversion function from your script and save the unchanged page content to a debug file (you can use DumpPage())

2. Test the script then open the dump file with an hex editor (I use the free and open source wxHexEditor 0.24 beta for windows)

3. Look at all untranslated characters (example: àis malformed BUT &#39 is a valid html character) and get their hex characters

4. Finally you will have a list of all untranslated characters and you can notice they share few prefixes. Example:
C382C2B0, C382C2AD, C382C48D and so on. As you can see, in this example I have a common "C382C2" for 2 of 3 untranslated characters and "C382C4" for another one. So these 2 are my "prefixes". A prefix can be very long (5 or more characters). Maybe you have to test more and more pages to get all prefixes.

5. Get the wanted character looking the original site page in your browser. Example: … is showed c3a2c280c2a6

6. This is the my conversion function in javascript, save the html file in your pc.

Code: Select all

<html>
<head>
<title>Weird conversion function</title>
</head>
<body>
<script>
const char = "…";
const utf8EncodedChar = unescape(encodeURIComponent(char));

let hex = "";
for (let i = 0; i < utf8EncodedChar.length; i++) {
  hex += utf8EncodedChar.charCodeAt(i).toString(16);
}

document.writeln(`La sequenza esadecimale di "${char}" in UTF-8 è: ${hex}<br>`);

const hexPrefix = hex.substring(0, hex.length - 2);

const decoder = new TextDecoder("utf-8");

let str = "";
for (let i = 0; i < hexPrefix.length; i += 2) {
  str += 'chr(' + parseInt(hexPrefix.substr(i, 2), 16).toString(10) + ') + ';
}

for (let i = 128; i <= 191; i++) {
  let tmp = i.toString(16);
  if (tmp.length < 2)
    tmp = "0" + tmp;
  const hexSequence = hexPrefix + tmp;
  const byteSequence = hexSequence.match(/.{1,2}/g).map((byte) => parseInt(byte, 16));
  const char2 = decoder.decode(new Uint8Array(byteSequence));
  document.writeln(`str := StringReplace(str, (${str}chr(${i})), '${char2}');<br>`);
}

</script>
</body>
</html>

7. Open the script in your browser. Now have a look at the output: you can see where … is:

Code: Select all

str := StringReplace(str, (chr(226) + chr(128) + chr(166)), '…');
8. Look again at your original site page and how … is in your hex editor. If it's not e280a6 (that is: 226 128 166 in decimal format) you need to add a further conversion. So if in your script you have c3a2c280c2a6 what you have to do is convert 'c3a2c280c2' prefix in 'e280'
and you can do this with this assignment:

Code: Select all

str := StringReplace(str, (chr(195) + chr(162) + chr(194) + chr(128) + chr(194)), (chr(226) + chr(128)));
Put this line before the lines you get from javascript script

9. Now you have to be sure all characters are well displayed in CP1252. CP1252 only support 255 characters so you don't have … but you can use ... instead. So apply this replace for all lines you got from Javascript script. If you can't find a good replace set of characters for a UTF8 character you can use void string so that character will not be showed/translated at all (I prefer this to ? or invalid characters but it's up to you)

10. Test the script with different pages. If you still get invalid characters repeat from point 5, then add the new output to the previous lines. Note: you can also reuse valid and tested functions for known charset conversions so you can merge/mix your special conversion functions and ready-to-use ones.

Re: My weird anything to CP1252 how-to

Posted: 2023-05-27 14:58:32
by fulvio53s03
mi dispiace ma non riesco a seguire le tue istruzioni.
Mi areno al momento di cercare i caratteri speciali da convertire.
Se utilizzi il mio script ComicsBox su https://www.comicsbox.it/albo/NATHNEVER_207 otterrai

L'??obbiettivo
che dovrebbe diventare
L'obbiettivo
- -- - - - - - - -- - - - --
contro "la Compagnia?"
che dovrebbe diventare
contro "la Compagnia"
* * * * * * * * * * * * * * * * * * * * * * * * * * *
I'm sorry but I can't follow your instructions.
I stop myself when looking for special characters to convert.
If you use my ComicsBox script on https://www.comicsbox.it/albo/NATHNEVER_207 you will find
L'??obbiettivo
which it should become
L'obbiettivo
- -- - - - - - - -- - - - --
contro "la Compagnia?"
which should become
contro "la Compagnia"

grazie per la pazienza.

Re: My weird anything to CP1252 how-to

Posted: 2023-05-30 21:07:23
by MrObama2022
Hi fulvio,

sorry for my delay. Those ?? can't be correct because they are in the page. My decoder can't recognize valid "?" from invalid ones. What it does is translate a character from a charset to another but it's not a text corrector.

If I access to given url with Chrome, I see ?? so my encoder take ?? for valid ones. This is how Chrome shows the page [ https://www.comicsbox.it/albo/NATHNEVER_207 ] to me:

Nathan Never
Controllo totale
Stefano Piani (script) / Andrea Cascioli (art)
Continua la lotta senza quartiere del solitario Joel Baldwin contro "la Compagnia?", un po' meno solo ora che ha trovato nell'Agente speciale Alfa Nathan Never un potenziale alleato. Nathan viene così a scoprire quanto sia pericoloso il microchip su cui sta indagando, l'I.C.-4. In particolare alcuni esemplari, prototipi che sono stati modificati illegalmente a scopo sperimentale e che violano costantemente la privacy dei loro possessori. Ma questo è solo l'??inizio, un piccolo esperimento su scala ridotta... L'??obbiettivo è il controllo totale!

94 pagine


As you can see the ?? are inside the original html file. And this is the html code of that page:

Code: Select all

<div class="sinossi">Continua la lotta senza quartiere del solitario Joel Baldwin contro "la Compagnia?", un po' meno solo ora che ha trovato nell'Agente speciale Alfa Nathan Never un potenziale alleato. Nathan viene così a scoprire quanto sia pericoloso il microchip su cui sta indagando, l'I.C.-4. In particolare alcuni esemplari, prototipi che sono stati modificati illegalmente a scopo sperimentale e che violano costantemente la privacy dei loro possessori. Ma questo è solo l'??inizio, un piccolo esperimento su scala ridotta... L'??obbiettivo è il controllo totale!</div>
So ?? are """valid""" ? added from comic description author so you have to remove it manually. If this "problem" happens often you can apply a stringreplace to translate " '?? " in " ' ". Add this stringreplace AFTER the charset decoder and not BEFORE.

Re: My weird anything to CP1252 how-to

Posted: 2023-05-31 04:24:54
by fulvio53s03
MrObama2022 wrote: 2023-05-30 21:07:23 ... Those ?? can't be correct because they are in the page...
?? are """valid""" ? added from comic description author so you have to remove it manually.
If this "problem" happens often you can apply a stringreplace to translate " '?? " in " ' ".
Add this stringreplace AFTER the charset decoder and not BEFORE.
I understand.
Thank you.

Capisco.
Grazie.
:grinking: