Amazon.ca screen-scraper shell script

canuck · Post by **canuck** » 2013-07-05 03:15:05

Hi folks,

This afternoon I spent a couple of hours throwing together a *nix shell script that will take an input file with one barcode/UPC per line, and scrape Amazon.ca's site for information on the title (including cover picture, if one exists).

The information gets written out to a file that can then be imported into AMC. It's by no means perfect (nor is the information that Amazon gives always "clean", particularly for older titles) nor particularly elegant, but it does the job. It's also not a turn-key solution since it's not an AMC script (I am not a software developer and have not invested the time to learn the AMC script syntax...sorry).

It can be adapted as necessary for your needs. You just need a *nix system (or Cygwin on Windows) as this script relies on tools such as wget, grep, awk, and sed.

Code: Select all

echo > log.txt
echo > output.csv
while read line ;do
    wget -q -O searchresult$line "http://www.amazon.ca/s/ref=nb_sb_noss/?url=search-alias%3Daps&field-keywords="$line
    title=$(grep "<div class=\"productTitle\">" searchresult$line | sed -e :a -e 's/<[^>]*>//g;/</N;//ba' | awk 'BEGIN{FS="~"};{print $1}' | sed 's/^[ \t]*//;s/[ \t]*$//')
    if [ -n "$title" ]; then # $title is not null so a search result was found
        echo $title
        url=$(grep "div class=\"productTitle\">" searchresult$line | sed -n "/href=/s/.*href=\([^>]*\).*/\1/p" | sed 's/.\(.*\)./\1/')
        wget -q -O infopage$line $url
        actors=$(grep "<b>Actors:</b>" infopage$line | sed 's/<[^>]\+>//g' | cut -d' ' --complement -f1,2 | sed 's/^[ \t]*//;s/[ \t]*$//')
        length=$(grep "<b>Run Time:</b>" infopage$line | sed 's/<[^>]\+>//g' | awk '{print $3}' | sed 's/^[ \t]*//;s/[ \t]*$//')
        lang=$(grep "<b>Language:</b>" infopage$line | sed 's/<[^>]\+>//g' | cut -d' ' --complement -f1 | sed 's/^[ \t]*//;s/[ \t]*$//')
        discs=$(grep "<b>Number of discs:</b>" infopage$line | sed 's/<[^>]\+>//g' | awk '{print $4}' | sed 's/^[ \t]*//;s/[ \t]*$//')
        directors=$(grep "<b>Directors:</b>" infopage$line | sed 's/<[^>]\+>//g' | cut -d' ' --complement -f1,2 | sed 's/^[ \t]*//;s/[ \t]*$//')
        year=$(grep "<b>Release Date:</b>" infopage$line | sed 's/<[^>]\+>//g' | awk '{print $5}' | sed 's/^[ \t]*//;s/[ \t]*$//')
        imgurl=$(grep -B 1 "original-main-image" infopage$line | grep "src" | sed -n "/src=/s/.*src=\([^>]*\).*/\1/p" | sed 's/.\(.*\)./\1/')
        # sometimes there is no cover art picture available
        if [ -n "$imgurl" ]; then wget -q -O $line.jpg $imgurl; fi;

        # Output information to file
        echo $title";"$actors";"$length";"$lang";"$discs";"$directors";"$year";"$line.jpg >> output.csv
        rm infopage$line
        echo "$line found as $title" >> log.txt
      else
        echo "$line not found at Amazon" >> log.txt
    fi
    rm searchresult$line
done < barcodes.txt

Since I have over 250 titles (which is still far less than some friends of mine), using a barcode reader and this script saved me a LOT of time even though I now have to go back and clean up a few entries. Still much better than entering a few hundred manually! I hope that others in a similar situation to mine will find it helpful.

a canuck