Amazon.ca screen-scraper shell script
Posted: 2013-07-05 03:15:05
Hi folks,
This afternoon I spent a couple of hours throwing together a *nix shell script that will take an input file with one barcode/UPC per line, and scrape Amazon.ca's site for information on the title (including cover picture, if one exists).
The information gets written out to a file that can then be imported into AMC. It's by no means perfect (nor is the information that Amazon gives always "clean", particularly for older titles) nor particularly elegant, but it does the job. It's also not a turn-key solution since it's not an AMC script (I am not a software developer and have not invested the time to learn the AMC script syntax...sorry).
It can be adapted as necessary for your needs. You just need a *nix system (or Cygwin on Windows) as this script relies on tools such as wget, grep, awk, and sed.
Since I have over 250 titles (which is still far less than some friends of mine), using a barcode reader and this script saved me a LOT of time even though I now have to go back and clean up a few entries. Still much better than entering a few hundred manually! I hope that others in a similar situation to mine will find it helpful.
a canuck
This afternoon I spent a couple of hours throwing together a *nix shell script that will take an input file with one barcode/UPC per line, and scrape Amazon.ca's site for information on the title (including cover picture, if one exists).
The information gets written out to a file that can then be imported into AMC. It's by no means perfect (nor is the information that Amazon gives always "clean", particularly for older titles) nor particularly elegant, but it does the job. It's also not a turn-key solution since it's not an AMC script (I am not a software developer and have not invested the time to learn the AMC script syntax...sorry).
It can be adapted as necessary for your needs. You just need a *nix system (or Cygwin on Windows) as this script relies on tools such as wget, grep, awk, and sed.
Code: Select all
echo > log.txt
echo > output.csv
while read line ;do
wget -q -O searchresult$line "http://www.amazon.ca/s/ref=nb_sb_noss/?url=search-alias%3Daps&field-keywords="$line
title=$(grep "<div class=\"productTitle\">" searchresult$line | sed -e :a -e 's/<[^>]*>//g;/</N;//ba' | awk 'BEGIN{FS="~"};{print $1}' | sed 's/^[ \t]*//;s/[ \t]*$//')
if [ -n "$title" ]; then # $title is not null so a search result was found
echo $title
url=$(grep "div class=\"productTitle\">" searchresult$line | sed -n "/href=/s/.*href=\([^>]*\).*/\1/p" | sed 's/.\(.*\)./\1/')
wget -q -O infopage$line $url
actors=$(grep "<b>Actors:</b>" infopage$line | sed 's/<[^>]\+>//g' | cut -d' ' --complement -f1,2 | sed 's/^[ \t]*//;s/[ \t]*$//')
length=$(grep "<b>Run Time:</b>" infopage$line | sed 's/<[^>]\+>//g' | awk '{print $3}' | sed 's/^[ \t]*//;s/[ \t]*$//')
lang=$(grep "<b>Language:</b>" infopage$line | sed 's/<[^>]\+>//g' | cut -d' ' --complement -f1 | sed 's/^[ \t]*//;s/[ \t]*$//')
discs=$(grep "<b>Number of discs:</b>" infopage$line | sed 's/<[^>]\+>//g' | awk '{print $4}' | sed 's/^[ \t]*//;s/[ \t]*$//')
directors=$(grep "<b>Directors:</b>" infopage$line | sed 's/<[^>]\+>//g' | cut -d' ' --complement -f1,2 | sed 's/^[ \t]*//;s/[ \t]*$//')
year=$(grep "<b>Release Date:</b>" infopage$line | sed 's/<[^>]\+>//g' | awk '{print $5}' | sed 's/^[ \t]*//;s/[ \t]*$//')
imgurl=$(grep -B 1 "original-main-image" infopage$line | grep "src" | sed -n "/src=/s/.*src=\([^>]*\).*/\1/p" | sed 's/.\(.*\)./\1/')
# sometimes there is no cover art picture available
if [ -n "$imgurl" ]; then wget -q -O $line.jpg $imgurl; fi;
# Output information to file
echo $title";"$actors";"$length";"$lang";"$discs";"$directors";"$year";"$line.jpg >> output.csv
rm infopage$line
echo "$line found as $title" >> log.txt
else
echo "$line not found at Amazon" >> log.txt
fi
rm searchresult$line
done < barcodes.txt
a canuck