Page 6 of 7

Posted: 2007-09-27 06:29:31
by dwayne2005
I had errors with your beta script and might give up, and just make do.

I was thinking Google have a wildcard for search criteria, but it's only for words using an asterix. I thought maybe you could have a "Batman (*)" but that doesn't seem to work. Wonder if you can force IMDB to include paranthesis? And I was just thinking "IMDB > Batman" might be better search criteria instead of /maindetails, it seems more accurate -/maindetails, but on comments it's like "IMDB > Batman > Comments" and still registers. It's probably not even detecting the > bit in the search criteria.

I don't know why Google keeps things `minimal'.

Posted: 2007-09-27 08:46:42
by dwayne2005
Just been looking into more of Google's search features, came up with nothing.

Wildcards do work on urls, though, but just once it seems. A Google search like this:

site:www.imdb.com/title/*/combined batman

Gets rid of that problem with the findings of Batman Returns and Superman: Man of Steel using site:www.imdb.com/title/*/maindetails (and problems with the "comment on this" selections as well) and also gets rid of the unexpected page branch offs ... the side effect being that it's the complete credits listing.

I tried this without success:
site:www.imdb.com/title/*/ batman -site:www.imdb.com/title/*/*

I couldn't make sense of it, so I guess wildcard twice won't work although the negative seemed to do something. You can enter in each of the URLs specifically -site:www.imdb.com/title/*/quotes but there are at least a dozen such pages. You can do just the -/externalreviews bit but not for all URLs (at least won't work with -/quotes & -/combined & others).

(You know, personally I wouldn't mind downloading the full credits into the actors section :))

Posted: 2007-09-27 10:20:02
by bad4u
dwayne2005 wrote:site:www.imdb.com/title/*/combined batman

Gets rid of that problem with the findings of Batman Returns and Superman: Man of Steel using site:www.imdb.com/title/*/maindetails (and problems with the "comment on this" selections as well) and also gets rid of the unexpected page branch offs ... the side effect being that it's the complete credits listing.
The /combined or /maindetail branch is no problem at all, you do not need to care about that (IMDB_319_mod.ifs only !). The script builds the results list from Google's results page and deletes the part behind the title number (.../ttxxxxxxx/) on the URL to use the standard IMDB address of the film. So if search parameters above work better, you can simply change the two lines that contain google's search at the end of the script (I can do the changes tomorrow if you are not sure about it).

Posted: 2007-09-27 10:32:15
by dwayne2005
AnalyzeGooglesResultsPage('http://www.google.com/search?num=30&q=s ... */combined+' + UrlEncode(MovieName))
AnalyzeGooglesResultsPage('http://www.google.com/search?num=30&q=s ... mbined+%22' + UrlEncode(MovieName) + '%22')


That's what I changed, but I had `could not find results'. :/ Am I missing a code for asterix?

EDIT: made a little adjustment. 404 not found? :/ Same with %2A

Posted: 2007-09-27 11:00:35
by bad4u
dwayne2005 wrote:AnalyzeGooglesResultsPage('http://www.google.com/search?num=30&q=s ... */combined+' + UrlEncode(MovieName))
AnalyzeGooglesResultsPage('http://www.google.com/search?num=30&q=s ... mbined+%22' + UrlEncode(MovieName) + '%22')
It looks correct for me but i cannot test it now. Are you sure you did the changes on the IMDB_319_mod.ifs for your tests ? Versions 3.20 and 3.21 will NOT work for this, because they do not contain the modifications for deleting /combined from the URL yet.

Btw. are you sure every film has a /combined page ? ;)

Posted: 2007-09-27 11:03:55
by dwayne2005
I am not sure, but will give it a more thorough look into. :wink:

You're right, I appended to 3.20. I didn't try with your adjusted one because I had problems with that, but I'll go give it a shot. :)

EDIT: No luck. :/ Same error.

Posted: 2007-09-27 11:17:03
by bad4u
Are you testing in batch mode ? Sorry then, there's one more thing to change if you use the modificated script in batch mode :

Code: Select all

        AnalyzeResultsPage(TextBetween(Value, '<a href="', 'maindetails"'));
must be (not tested yet)

Code: Select all

        begin
          Address := TextBetween(Value, '<a href="http://www.imdb.com/title/', '/');
          Address := 'http://www.imdb.com/title/' + Address + '/';
          AnalyzeResultsPage(Address);
        end;
If you want to use 3.20 or 3.21 with the modifications just copy/paste the complete procedure AnalyzeGooglesResultsPage(GoogleAddress: string); part to the new script (and delete the original part) - or wait until tonight ;)

Posted: 2007-09-27 11:21:52
by dwayne2005
I'll wait until you're ready with a version, or tomorrow when I can work on it again. Thanks once again for your time!!

(BTW, wasn't in batch but I'd like to try out batch with your mods. No wait, it was in batch... but 1 time wasn't. :/)

Posted: 2007-09-27 21:15:09
by bad4u
Here's the modified script: http://www.bad4u.741.com/beta/IMDB_321_mod.ifs (copy link into new browser window)

I used current IMDB version 3.21 and added all the google modifications mentioned before (select correct film URL and your /combined search). I did some short tests with/without batch mode and it seems to work pretty good (at least for the films I tested, including superman/returns batman/begins and some more). I'll test more when I find some time for this, and if I don't find new problems - or maybe you - I'll change Google search parameters on the official IMDB version on the next update, too.

Thanks so far, let me know about your test results too, please :)

Posted: 2007-09-28 06:08:38
by dwayne2005
I took a random assortment of 208 films, cancelled through the searching and found it reached 140. 5 of those titles I should have adapted from the IMDB titles, so I just omitted them.

135 films
13 Major or confusing errors

Major/confusing errors
-A Patch of Blue [Repulsion; not found in results]
-Dopperugengâ (Doppelganger) [ReAnimator ... "Re-Animator" works so I think this is fine, I type it that way anyway I just wanted to remove the dash to see the results and it was confusing so here it is]
-error ["The Queen of Sheba" (1952) finds a 2008 tbr page which causes an error in and halts the processing; not found in the lists]
-Uchûjin Tôkyô ni arawaru (Space Men Appear in Tokyo) [Solar Crisis, not found in results, obscure film]
-Quills [found a featurette instead of the film; film not found in the non-batched results :/]
-error ["The Witches" found a 2008 tbr page, again halting the searching with an ugly error box; film from 1990 is not even to be found :/]
-Meat Loaf: To Hell and Back [Should be To Hell and Back (1955) with Audie Murphy, easy to identify but despite it being non-confusing "To Hell and Back" (1955) was not found among the results]
-The Fog of War: Eleven Lessons from the Life of Robert S. McNamara [search "Fog" as in "The Fog" (1980) ... this found with the search criteria "The Fog", I just used "Fog" for variety in the results, I would consider this pretty minor although confusing]
-"Star Trek" [should have been "Star!" (1968), the search criteria was "Star" but with or without the exclamation mark, it appeared low in the list, this would have had me puzzled]
-Hyde and Hare [searched "Dr Jekyll and Mr Hyde 1941"; not sure why the 1931 version finds but not this, also very interesting is that neither 1931 or 1941 titles appear in the non-batched list minus the year]
-The Arena [searched "Amazons and Gladiators"; not found in the results. :/]
-Conspirators of Pleasure [no movie found]
-Charlotte's Web [wrong version; interesting, the original version doesn't appear in the list, maybe Google search omits multiple results that share a similiar name?]

This next batch of errors is just to be comprehensive, but I don't think these can be held against the Google Search...

Easy to identify/minor errors
-"South Park" [TV series, not movie]
-Rambo III [searching for "Rambo 3", found video game, not movie; "Rambo III" works]
-Othello [wrong year version, but to be expected]
-"The X Files" [instead of movie]
-The Guru [wrong version]
-Grudge [wrong version]
-Too Hot to Handle [wrong version]
-Christmas Vacation 2 [search "Christmas Vacation", correct film appears second in results; if you had entered "National Lampoon's" in the title as well, it would have been correct. Shouldn't be too hard to spot this mistake, assuming you don't have the turkey that is the second "Christmas Vacation" movie]
-The Chaplin Revue [should have been the film "Chaplin", but that's an understandable mistake, although it appears pretty far down the list]

I'd like to batch IMDB as well using the same search titles, for a full comparison.

Posted: 2007-09-28 06:48:16
by bad4u
Films that are not found or low in the list are dependent on Googles' search engine and results rankings - it should deliver identical results when searching manually on Googles' website. It can only be influenced by finding better parameters / keywords for the search.

About the "2008 tbr" thing - it should be easy to handle this like the "movie not found" on IMDB search to avoid error messages.

Posted: 2007-09-28 07:28:09
by dwayne2005
You're right, too; this at best will not replace the IMDb results. IMDb makes at most half of the errors.

O (Othello) ["Othello"; wrong version]
Guru [wrong version]
The Marrying Man (Too Hot To Handle) ["Too Hot to Handle"]
House of Sand and Fog ["Fog"]
Star Trek: The Wraith of Kahn ["Wraith"]
Dîner de cons, Le (The Dinner Game) ["Diner"]
...one or two other mistaken versions... very minor.

I'll have to spend more time on looking into refining it, though. Perhaps the best way or remedying the confusing results is simply having an option to place the Original Title search phrase in one of the unused fields.

Posted: 2007-09-28 08:12:38
by dwayne2005
The */combined bit is a bit of a problem.

I can detect more of them (The Witches, ReAnimator, but not To Hell and Back) using inurl:combined . Maybe negating inurl values will also yield better results, To Hell and Back for instance registers as an ordinary page, if you can negate the branch offs with inurl you might have the best criteria.

This Google query brought up an interesting Google message (on the 5th page of results), so I don't think that's going to work...
.. your query looks similar to automated requests from a computer virus or spyware application. To protect our users, we can't process your request right now.
site:www.imdb.com/title/* to hell and back -inurl:parentalguide -inurl:literature -inurl:goofs -inurl:synopsis -inurl:faq -inurl:alternateversions -inurl:movieconnections -inurl:episodes -inurl:trivia -inurl:taglines -inurl:miscsites -inurl:externalreviews -inurl:recommendations -inurl:quotes -inurl:combined -inurl:companycredits -inurl:dvd -inurl:business -inurl:locations -inurl:technical -inurl:keywords -inurl:usercomments -inurl:news -inurl:plotsummary -inurl:amazon -inurl:maindetails

btw, in the above site:www.imdb.com/title/* without the asterix processes some other pages formatted like this www.imdb.com/title?...

The above search criteria I was able to append to your script, and the errors I got were:

The Queen of Sheba's Pearls ["Queen of Sheba"]
Rakvickarna ["Conspirators of Pleasure" (same director)]
Guru [wrong version]
Grudge [wrong version]
Star [wrong movie, same title ("Star!")]
Some Like It Hot [wrong movie; the correct one doesn't appear]
Monsters, Inc. (VG)

...very good results. There was a stall somewhere, I'll try to look into on which film it happened on. Got to find out why Conspirators of Pleasure and Some Like It Hot fail, but otherwise... Maybe the `combined' and `maindetails' can come off? It will get you Monsters, inc and Some Like it Hot.

Posted: 2007-09-28 16:20:15
by dwayne2005
site:imdb.com/title
Some film titles are found under imdb.com/title? rather than www.imdb.com/title/, this encompasses those. With that URL, you should get all of the films I failed to find.

-inurl:pro
This omits all pro entries in the below searches. Since Google processes just the first 32 words in their search, I decided to use this url setting and count how many occurrences of the below there were in Google's cache. So this would be my suggested order (I hope I didn't miss any :hum:).

-intitle:IMDb (814,000) [this is search listings and some user comments]
-inurl:locations (1,480,000)
-inurl:keywords (988,000)
-inurl:usercomments (783,000)
-inurl:faq (768,000)
-inurl:dvd (675,000)
-inurl:news (623,000)
-inurl:business (577,000)
-inurl:goofs (513,000)
-inurl:taglines (498,000)
-inurl:quotes (481,000)
-inurl:fullcredits (472,000)
-inurl:trivia (392,000)
-inurl:ratings (360,000)
-inurl:recommendations (355,000)
-inurl:amazon (288,000)
-inurl:awards (273,000)
-inurl:synopsis (238,000)
-inurl:companycredits (213,000)
-inurl:soundtrack (206,000)
-inurl:newsgroupreviews (186,000)
-inurl:plotsummary (182,000)
-inurl:releaseinfo (157,000)
-inurl:technical (147,000)
-inurl:parentalguide (144,000)
-inurl:alternateversions (109,000)
-inurl:miscsites (87,300)
-inurl:sales (84,100)
-inurl:movieconnections (70,700)
-inurl:externalreviews (58,100)
-inurl:literature (48,500)
-inurl:crazycredits (32,800)
-inurl:episodes (21,000)

Does this sound tempting to you? I'm yet to put it together, but will run it through the same test as the others as soon as I get things sorted out.

site:imdb.com/title -inurl:pro -intitle:IMDb -inurl:locations -inurl:keywords -inurl:usercomments -inurl:faq -inurl:dvd -inurl:news -inurl:business -inurl:goofs -inurl:taglines -inurl:quotes -inurl:fullcredits -inurl:trivia -inurl:ratings -inurl:recommendations -inurl:amazon -inurl:awards -inurl:synopsis -inurl:companycredits -inurl:soundtrack -inurl:newsgroupreviews -inurl:plotsummary -inurl:releaseinfo -inurl:technical -inurl:parentalguide -inurl:alternateversions -inurl:miscsites -inurl:sales -inurl:movieconnections -inurl:externalreviews -inurl:literature -inurl:crazycredits -inurl:episodes

...those question mark findings won't work (or more precisely, ones without www.). :badidea:

EDIT: Won't even detect Invasion of the Body Snatchers (1978) with or without the www.

EDIT 2: Invasion of the Body Snatchers (1978) turns up with an extra asterix after the title. :hum: This is getting a bit weird.

Posted: 2007-11-10 16:09:14
by nevr
bad4u wrote:Here are updated versions of IMDB script (3.20) and StringUtils1.pas (v.5). You will need to update both files to use the "RemoveAccents" function on IMDB script. Remember to set option "RemoveAccents" to "1" (it is "0" by default).

http://www.bad4u.741.com/IMDB.ifs
http://www.bad4u.741.com/StringUtils1.pas (copy links into a NEW browser window)
Is this version of script available somewhere? By link http://www.bad4u.741.com/IMDB.ifs version 3.22 is placed and there is no ConvertToASCII option in it. Or, may be, I can't find it?

Posted: 2007-11-10 17:44:08
by bad4u
nevr wrote:Is this version of script available somewhere? By link http://www.bad4u.741.com/IMDB.ifs version 3.22 is placed and there is no ConvertToASCII option in it. Or, may be, I can't find it?
You're right. It has been deleted when uploading another update. :D

I opened a new topic for that modifications, where you can find a new version of the modificated script : viewtopic.php?p=24487#24487

Posted: 2007-11-10 19:05:31
by nevr
Thank you. Its enough tiresomely to correct all the wrong letters by hand :)

Posted: 2007-11-29 11:22:08
by zivija
Today I find problem with IMDB script. I always use "1" in Description Selection option, but now, if I choose 1 or 2, I got blank description field. When I choose 0 (short summary from main page), everything is OK.

Can anyone fix this?

Thanx!!!

Posted: 2007-11-29 12:15:42
by bad4u
zivija wrote:Today I find problem with IMDB script. I always use "1" in Description Selection option, but now, if I choose 1 or 2, I got blank description field. When I choose 0 (short summary from main page), everything is OK.
If this is the only problem, here's a new version of IMDB script (v.3.23) :

http://www.bad4u.741.com/full/IMDB.ifs (copy link into a NEW browser window!)

@antp: I have no time to do more tests with the script now, so I'll do this tonight. If there is positive feedback from other users it could be uploaded, I think.

Posted: 2007-11-29 20:53:23
by bad4u
New IMDB script version 3.23 can be uploaded, I did not find more problems.
They only changed the address of the plotsummary page and the way they link to this page, but not the structure of the page itself. :)