Hashing Files for Auto-Importing

mjs7231 · Post by **mjs7231** » 2009-05-12 21:30:52

The issue I have been pondering in my head more and more lately is the pain the butt it is to actually get your movie collection into something like ANT, XBMC, or BOXEE. Everyone has a different method to try to auto parse the file names to know what the correct IMDB site to goto is.

I question the legality of a project like this, but Music Brainz is doing something exactly like what I want, but for music. Create a public / social database of the hash of a file pointing to its IMDB page. It would be dead simple to write, and once the project gets going it would theoretically be self sufficient. What are your thoughts ideas?

Post by **antp** » 2009-05-15 16:43:47

It could be a good idea indeed... but who would manage that database? I guess I would have to host one? Or is there already a public database for that, like FreeDB for Audio CDs?
And computing the hash of large video files takes some time, no?

mjs7231 · Post by **mjs7231** » 2009-05-22 03:19:49

Hashing large files does take some time, but I bet you can get away without hashing the entire file, and perhaps only the first few megabytes. We're not going for data integrity, just a quick and easy method to say this file relates to this row in the database.

A database would need to be hosted someplace. I wouldn't mind setting one up. I was thinking of making it as open as possible, maybe something like openVideoDB.info or something silly like that.

Write a quick and dirty API to access the hashes along with the IMDB links, and some mechanism to make sure no one trashes it. Probably backing up once a day would be good enough. Perhaps a voting system where people can flag bad matches. And then stop, keep it simple.

bonienl · Post by **bonienl** » 2009-05-22 07:59:55

This sounds like an excellent idea to me, if it can be extended to other online movie databases as well.

Not everyone will be using IMDB and use their localized version instead. E.g. I am using 'moviemeter.nl'

Not sure if I am asking too much now

Post by **antp** » 2009-05-22 11:29:38

Sounds like a good idea... if only I had time to make all that

mjs7231 · Post by **mjs7231** » 2009-05-23 05:05:30

Yea,

It'll obviously need some method to extend to any service someone is using and thats where it gets complicated. However, it might be able to be simple if you consider all the other sites an addition..

For example:
Table 1 has columns: hash, movie_index, votes
Table 2 has columns: movie_index, title, year, sites

So the scenario would be as follows:
1.) I load some client app that hashes my movie and looks it up in the database.
2.) If its found, we're done.. we know what it is.
3.) If its not found, I use whatever script ant has to pull the movie information, but the main key is we only care about the final link and site name its pulled from.
4.) If there's more than 1 result the user much choose (as they do now in ant).
5.) User selects the movie, and the hash entry is added to Table 1, and the movie info added to Table 2.

The votes column can be used to validate, say we need at least 5 people to agree some file relates to a movie before we say its accurate.

Maybe I'll start a google-code project to launch this and get started, it all seems so simple, but I know we'll hit snags along the way.