IMDB script problem

If you made a script you can offer it to the others here, or ask help to improve it. You can also report here bugs & problems with existing scripts.
bad4u
Posts: 1148
Joined: 2006-12-11 22:54:46

Post by bad4u »

antp wrote:AMC is not unicode and so it is unable to display foreign characters sets (e.g. on my French-configured Windows I am unable to have Russian characters in AMC).
It is no problem to copy/paste russian characters into my english configured AMC on a german configured Windows system, but I did not test to import straight from a russian website via AMC script.


Btw.. I have been playing around with a 'RemoveAccents' option (seems to be working, not finished yet), but I think there are a lot more special characters to add - i.e. capital letters like ÁÉÚÀÈÙ etc. Correct ?
antp
Site Admin
Posts: 9639
Joined: 2002-05-30 10:13:07
Location: Brussels
Contact:

Post by antp »

Indeed.
See http://en.wikipedia.org/wiki/Iso-8859-1 and "code table" section
All the characters of the second part are specific and should be replaced by generic ones, a conversion to basic ASCII charset so.

¡ (delete, used in front of a sentence that ends by ! in spanish)
¢ = cent
£ = pound
¥ = yen
¦ = |
© = (c)
ª = a
« = "
® = (r)
° = deg. (or if you have a better replacement idea...)
± = +/-
² = ^2
³ = ^3
´ = '
µ = mu
· = -
º = o
» = "
¼ = 1/4
½ = 1/2
¾ = 3/4
¿ (delete, used in front of a sentence that ends by ? in spanish)
À, Á, Â, Ã, Ä, Å = A
Æ = AE
Ç = C
È, É, Ê, Ë = E
Ì, Í, Î, Ï = I
Ð = DH
Ñ = N
Ò, Ó, Ô, Õ, Ö = O
× = x
Ø = O
Ù, Ú, Û, Ü = U
Ý = Y
Þ = TH
ß = ss
à, á, â, ã, ä, å = a
æ = ae
ç = c
è, é, ê, ë = e
ì, í, î, ï = i
ð = dh
ñ = n
ò, ó, ô, õ, ö = o
÷ = :
ø = o
ù, ú, û, ü = u
ý = y
þ = th
ÿ = y

I skipped the few that will probably never occur in texts
Note sure of what we should do with ¹, §, ¶, ¤ and stuff like that

I do not know for German characters... in some other languages it may look strange if these are replaced with the "letter+e" version?

In addition of the above, some sites use windows-1252 characters that should be also replaced.
http://en.wikipedia.org/wiki/Windows-1252
If you are using Windows, you can type them directly in the script : (if you are running AMC in WinE on Linux, I am not sure on how it will work)
€ = euro
„ = "
… = ...
‰ = /1000
Š = S (or SH?)
Œ = OE
Ž = Z
‘ = '
’ = '
“ = "
” = "
• = -
– = -
— = -
™ = (tm)
š = s (or sh?)
œ = oe
ž = z
Ÿ = Y
bad4u
Posts: 1148
Joined: 2006-12-11 22:54:46

Post by bad4u »

That will be a rather long list of StringReplace's ;)

Would it make sense to put it into an external .pas file to make it available for other scripts too - instead of adding it to (and blowing up) the IMDB script ? Or even integrate it into stringutils1.pas ?
antp
Site Admin
Posts: 9639
Joined: 2002-05-30 10:13:07
Location: Brussels
Contact:

Post by antp »

Yes, integrate it in StringUtils1, and use its version number to check if you can use the function or not ;) (i.e. if the version is recent enough)
In the future I should maybe include such functions into AMC directly?
bad4u
Posts: 1148
Joined: 2006-12-11 22:54:46

Post by bad4u »

Sorry, I did not find any time to finish the "RemoveAccent" function yet, but I'll do it soon.
antp wrote:In the future I should maybe include such functions into AMC directly?
Maybe it makes sense to integrate functions like "TextBetween" or "FindLine" into AMC directly (if it does not take too much time and if the functions are documented in the helpfile then), but I don't think such things like "RemoveAccents" should be integrated, cause then you loose the flexibility, i.e. if someone likes german substitution "oe, ue, ae" instead of "o, u, a" for the special characters "ö, ü, ä".

Best solution for later versions of AMC (in a future, very, yes even extremly far away ;)) would be to have an option "Substitute special characters" in AMC preferences and some kind of .ini file with the sustitutions - or even an integrated editor. Then the substitution should be done whenever "SetField" is used on a script, so that it is independent from and usable for all scripts without the need of modifications on the script - much better than calling a function from a script.
antp
Site Admin
Posts: 9639
Joined: 2002-05-30 10:13:07
Location: Brussels
Contact:

Post by antp »

Well, the really best solution would be to use unicode to get rid of accents problems :D (though that I am not sure that it is supported by the script engine, it would add many other problems)

The replacement, as you suggested as "setfield" level, cannot be done for all scripts. It is not possible to differentiate a iso-8859-1 text from a russian text without knowing what the web page is supposed to contain.
bad4u
Posts: 1148
Joined: 2006-12-11 22:54:46

Post by bad4u »

I had a quick look on a russian script and now I know what you mean. I thought different because I could copy/paste russian language into AMC fields without problems, but it seems not possible to import russian symbols through a script when you're using western european Windows language settings, is it ?

Btw. why is copy/paste possible ?

Sorry, but I never dealt with different charsets yet, so it's just a little bit confusing ;)
antp
Site Admin
Posts: 9639
Joined: 2002-05-30 10:13:07
Location: Brussels
Contact:

Post by antp »

bad4u wrote:it seems not possible to import russian symbols through a script when you're using western european Windows language settings, is it ?
Indeed.
bad4u wrote:Btw. why is copy/paste possible ?
:??: It should not be possible.
Maybe it seems to work when you paste it, but when you switch to another movie and come back you should not see anymore the russian characters.
bad4u
Posts: 1148
Joined: 2006-12-11 22:54:46

Post by bad4u »

You're right. I did not check this before ;)
bad4u
Posts: 1148
Joined: 2006-12-11 22:54:46

Post by bad4u »

Just editing the StringUtils unit and there are some more questions :

- How should the function be named ?
"RemoveAccents" (but it's not only accents being changed)
"ReplaceSpecialCharacters / ReplaceSpecChar" (this is closest to what it does)
"ChangeToASCII" , "CharsetToASCII" , "BasicCharset" ?
Better ideas ?

- I'm not sure about the ‘ ’ ´ signs, as it cannot be replaced by the ' (I think so, because that's used in the scripting language). Is the ` (Code 60 from http://en.wikipedia.org/wiki/Iso-8859-1 ) available for the replacement ?

- Recent StringUtils show version 4, but only v.3 is listed on the version history. Shall I set new version to 5 then ?
bad4u
Posts: 1148
Joined: 2006-12-11 22:54:46

Post by bad4u »

Here are updated versions of IMDB script (3.20) and StringUtils1.pas (v.5). You will need to update both files to use the "RemoveAccents" function on IMDB script. Remember to set option "RemoveAccents" to "1" (it is "0" by default).

http://www.bad4u.741.com/IMDB.ifs
http://www.bad4u.741.com/StringUtils1.pas (copy links into a NEW browser window)

These files are forum releases only ! Please do not upload them until the name of the function has been adjusted and the script has been tested by some people running russian or similar Windows localizations.

Please test and report if you have any problems or not ! Thanks.
antp
Site Admin
Posts: 9639
Joined: 2002-05-30 10:13:07
Location: Brussels
Contact:

Post by antp »

Could be called "ConvertToASCII", as actually these basic characters are ASCII, 128 bytes (extended 256 bytes charsets have other names). Or "WestEuropeanToASCII" (as here we only made rules for west-european charset), or ISO88591ToASCII, though that it includes some Windows-1252 characters...

I probably forgot to add the v4 to the history.

Accents:
´ -> extended character, replace by '
` -> is part of the basic charset so it can stay
Quotes/Apostrophes:
’ -> closing simple quote / apostrophe -> replace by '
‘ -> opening simple quote -> I said to replace by ' but it could indeed be replaced by ` (2nd accent above)
bad4u
Posts: 1148
Joined: 2006-12-11 22:54:46

Post by bad4u »

I'll name it "ConvertToASCII" then in final version, that seems to be obvious and easy to understand for everyone.

But how to replace ’ or ´ by a ' ??? I think it's not possible to set a ' between two ' ' in the script. Or is there any code for this I could use ? (at the moment it is replaced by the ` from the basic charset) ;)
antp
Site Admin
Posts: 9639
Joined: 2002-05-30 10:13:07
Location: Brussels
Contact:

Post by antp »

Use two ' for that:

Code: Select all

s := ''''
four ', one for the beginning, one for the end, two for the character itself, so the string will actually only contain one.
SFX
Posts: 5
Joined: 2007-07-30 19:41:18

Post by SFX »

bad4u wrote:Here are updated versions of IMDB script (3.20) and StringUtils1.pas (v.5). You will need to update both files to use the "RemoveAccents" function on IMDB script. Remember to set option "RemoveAccents" to "1" (it is "0" by default).

http://www.bad4u.741.com/IMDB.ifs
http://www.bad4u.741.com/StringUtils1.pas (copy links into a NEW browser window)

These files are forum releases only ! Please do not upload them until the name of the function has been adjusted and the script has been tested by some people running russian or similar Windows localizations.

Please test and report if you have any problems or not ! Thanks.
I've just tested it and it works good. Thank you so much bad4u and antp too! You have to do much more work than I expected. :)
bad4u
Posts: 1148
Joined: 2006-12-11 22:54:46

Post by bad4u »

antp wrote:Note sure of what we should do with ¹, §, ¶, ¤ and stuff like that
I don't think these symbols will appear (often) on IMDB, but this is what I replaced them with:

§ = [section]
¶ = [paragraph]
¤ = [currency]
¹ = ^1

@Antoine: No very elegant solution, but it seemed practical to me. And if they should appear somewhere, the user can edit the text later just like he likes or needs. If you have better ideas feel free to change them ;)

Btw.. I don't think it makes sense to change StringUtils1 version number again, if you should change any symbols later.

If there are no problems reported within some days, you can upload the scripts (there were no further changes on IMDB) - but please download latest version before (I updated the "forum release" with some minor changes, like the '''' replacement).
antp
Site Admin
Posts: 9639
Joined: 2002-05-30 10:13:07
Location: Brussels
Contact:

Post by antp »

bad4u wrote: Btw.. I don't think it makes sense to change StringUtils1 version number again, if you should change any symbols later.
Indeed. Version was there for checks to prevent from using non-existing functions. So the user gets a message "download new version of stringutils1.pas" instead of "identified... not declared".
SFX
Posts: 5
Joined: 2007-07-30 19:41:18

Post by SFX »

bad4u
I opened StringUtils1.pas and looked at the code you added:

Code: Select all

function ConvertToASCII(Value: string): string;
begin
  HTMLDecode(Value);
  Value := StringReplace(Value, 'а', 'a');
  Value := StringReplace(Value, 'б', 'a');
  Value := StringReplace(Value, 'в', 'a');
  Value := StringReplace(Value, 'г', 'a');
  Value := StringReplace(Value, 'д', 'ae');
  Value := StringReplace(Value, 'е', 'a');
  Value := StringReplace(Value, 'ж', 'a');
  Value := StringReplace(Value, 'з', 'c');
  Value := StringReplace(Value, 'и', 'e');
  Value := StringReplace(Value, 'й', 'e');
  Value := StringReplace(Value, 'к', 'e');
  Value := StringReplace(Value, 'л', 'e');
  Value := StringReplace(Value, 'м', 'i');
  Value := StringReplace(Value, 'н', 'i');
  Value := StringReplace(Value, 'о', 'i');
  Value := StringReplace(Value, 'п', 'i');
  Value := StringReplace(Value, 'р', 'dh');
  Value := StringReplace(Value, 'с', 'n');
  Value := StringReplace(Value, 'т', 'o');
  Value := StringReplace(Value, 'у', 'o');
  Value := StringReplace(Value, 'ф', 'o');
  Value := StringReplace(Value, 'х', 'o');
  Value := StringReplace(Value, 'ц', 'o');
  Value := StringReplace(Value, 'ш', 'o');
  Value := StringReplace(Value, 'щ', 'u');
  Value := StringReplace(Value, 'ъ', 'u');
  Value := StringReplace(Value, 'ы', 'u');
  Value := StringReplace(Value, 'ь', 'u');
  Value := StringReplace(Value, 'э', 'y');
  Value := StringReplace(Value, 'ю', 'th');
  Value := StringReplace(Value, 'я', 'y');
  Value := StringReplace(Value, 'Я', 'ss');
  Value := StringReplace(Value, 'Ў', '');
  Value := StringReplace(Value, 'ў', 'cent');
  Value := StringReplace(Value, 'Ј', 'pound');
  Value := StringReplace(Value, 'Ґ', 'yen');
  Value := StringReplace(Value, 'Ђ', 'euro');
  Value := StringReplace(Value, '©', '(c)');
  Value := StringReplace(Value, 'Є', 'a');
  Value := StringReplace(Value, '«', '"');
  Value := StringReplace(Value, '®', '(r)');
  Value := StringReplace(Value, '°', 'deg.');
  Value := StringReplace(Value, '±', '+/-');
  Value := StringReplace(Value, 'І', '^2');
  Value := StringReplace(Value, 'і', '^3');
  Value := StringReplace(Value, 'ґ', '''');
  Value := StringReplace(Value, 'µ', 'micro');
  Value := StringReplace(Value, '·', '-');
  Value := StringReplace(Value, 'є', 'o');
  Value := StringReplace(Value, '»', '"');
  Value := StringReplace(Value, 'ј', '1/4');
  Value := StringReplace(Value, 'Ѕ', '1/2');
  Value := StringReplace(Value, 'ѕ', '3/4');
  Value := StringReplace(Value, 'ї', '');
  Value := StringReplace(Value, 'А', 'A');
  Value := StringReplace(Value, 'Б', 'A');
  Value := StringReplace(Value, 'В', 'A');
  Value := StringReplace(Value, 'Г', 'A');
  Value := StringReplace(Value, 'Д', 'A');
  Value := StringReplace(Value, 'Е', 'A');
  Value := StringReplace(Value, 'Ж', 'AE');
  Value := StringReplace(Value, 'З', 'C');
  Value := StringReplace(Value, 'И', 'E');
  Value := StringReplace(Value, 'Й', 'E');
  Value := StringReplace(Value, 'К', 'E');
  Value := StringReplace(Value, 'Л', 'E');
  Value := StringReplace(Value, 'М', 'I');
  Value := StringReplace(Value, 'Н', 'I');
  Value := StringReplace(Value, 'О', 'I');
  Value := StringReplace(Value, 'П', 'I');
  Value := StringReplace(Value, 'Р', 'DH');
  Value := StringReplace(Value, 'С', 'N');
  Value := StringReplace(Value, 'Т', 'O');
  Value := StringReplace(Value, 'У', 'O');
  Value := StringReplace(Value, 'Ф', 'O');
  Value := StringReplace(Value, 'Х', 'O');
  Value := StringReplace(Value, 'Ц', 'O');
  Value := StringReplace(Value, 'Ч', 'x');
  Value := StringReplace(Value, 'Ш', 'O');
  Value := StringReplace(Value, 'Щ', 'U');
  Value := StringReplace(Value, 'Ъ', 'U');
  Value := StringReplace(Value, 'Ы', 'U');
  Value := StringReplace(Value, 'Ь', 'U');
  Value := StringReplace(Value, 'Э', 'Y');
  Value := StringReplace(Value, 'Ю', 'TH');
  Value := StringReplace(Value, '№', '^1');
  Value := StringReplace(Value, '§', '[section]');
  Value := StringReplace(Value, '¶', '[paragraph]');
  Value := StringReplace(Value, '¤', '[currency]');
  Value := StringReplace(Value, '„', '"');
  Value := StringReplace(Value, '…', '...');
  Value := StringReplace(Value, '‰', '/1000');
  Value := StringReplace(Value, 'Љ', 'S');
  Value := StringReplace(Value, 'Њ', 'OE');
  Value := StringReplace(Value, 'Ћ', 'Z');
  Value := StringReplace(Value, '‘', '''');
  Value := StringReplace(Value, '’', '''');
  Value := StringReplace(Value, '“', '"');
  Value := StringReplace(Value, '”', '"');
  Value := StringReplace(Value, '•', '-');
  Value := StringReplace(Value, '–', '-');
  Value := StringReplace(Value, '—', '-');
  Value := StringReplace(Value, '™', '(tm)');
  Value := StringReplace(Value, 'љ', 's');
  Value := StringReplace(Value, 'њ', 'oe');
  Value := StringReplace(Value, 'ћ', 'z');
  Value := StringReplace(Value, 'џ', 'Y');
  Result := Value;
end;
So on russian or similar Windows it works like:
1. Unicode characters downloaded from IMDB, changes to wrong ASCII characters.
2. Wrong ASCII characters changes to right ASCII.
Am I right? Did you know or expect it?
That code in StringUtils1.pas looks different on different Windows locations, but works fine?

And another very little troble: when I get writer in the producer field it adds "..., (more)".
Zack Snyder (screenplay) &, Kurt Johnstad (screenplay) ..., (more)
antp
Site Admin
Posts: 9639
Joined: 2002-05-30 10:13:07
Location: Brussels
Contact:

Post by antp »

You have an example of an IMDB page with unicode characters? As far as I know they only use iso-8859-1. And anyway AMC does not handle unicode characters :??:
Viridarium
Posts: 7
Joined: 2006-08-02 21:06:20
Contact:

Post by Viridarium »

I have a problem with the update of the imdb-script. The import of the actors do not work as well as in the old version. Where can I get the old imdb-script? :??:
Post Reply