Since Ian was kind enough to give me instructions that gave me a working lxml (I had never been able to compile it before), I thought I'd write a quick scraper by mashing lxml together with eventlet.
The result is a thing of beauty:
from os import path
import sys
from eventlet import coros
from eventlet import httpc
from eventlet import util
from lxml import html
## Make httpc work -- I'll make it work without this soon
util.wrap_socket_with_coroutine_socket()
def get(linknum, url):
print "[%s] downloading %s" % (linknum, url)
file(path.basename(url), 'wb').write(httpc.get(url))
def scrape(url):
root = html.parse(url).getroot()
pool = coros.CoroutinePool(max_size=8)
linknum = 0
for link in root.cssselect('a'):
url = link.get('href', '')
if url.endswith('.mp3'):
linknum += 1
pool.execute(get, linknum, url)
pool.wait_all()
if __name__ == '__main__':
if len(sys.argv) == 2:
scrape(sys.argv[1])
else:
print "usage: %s url" % (sys.argv[0], )
This script manages to max out my bandwidth -- 800KB/sec at home and 2.5MB/sec at work -- without breaking a sweat. It oscillates between about 10% and 20% CPU on my MacBook Pro. Nice!
1 comment:
Hey Donovan,
If you're doing what I think you're doing, also have a look at barbipes:
http://code.google.com/p/barbipes/
An mp3 spider that has some smarts to prevent downloading the same songs over and over, when you delete them, and a few other things. It is one of my main sources of new music these days.
It doesn't use lxml yet, (Just throwing out the admittedly ugly regexes didn't merit a new dependency, in my case.) and it's not exactly elegant, but it maxes out *my* glass fiber connection, and it does that, without hitting any individual site too hard.
Post a Comment