12/19/2008

lxml + eventlet mashup

Since Ian was kind enough to give me instructions that gave me a working lxml (I had never been able to compile it before), I thought I'd write a quick scraper by mashing lxml together with eventlet.



The result is a thing of beauty:




from os import path
import sys

from eventlet import coros
from eventlet import httpc
from eventlet import util

from lxml import html

## Make httpc work -- I'll make it work without this soon
util.wrap_socket_with_coroutine_socket()

def get(linknum, url):
print "[%s] downloading %s" % (linknum, url)
file(path.basename(url), 'wb').write(httpc.get(url))

def scrape(url):
root = html.parse(url).getroot()
pool = coros.CoroutinePool(max_size=8)
linknum = 0
for link in root.cssselect('a'):
url = link.get('href', '')
if url.endswith('.mp3'):
linknum += 1
pool.execute(get, linknum, url)
pool.wait_all()

if __name__ == '__main__':
if len(sys.argv) == 2:
scrape(sys.argv[1])
else:
print "usage: %s url" % (sys.argv[0], )


This script manages to max out my bandwidth -- 800KB/sec at home and 2.5MB/sec at work -- without breaking a sweat. It oscillates between about 10% and 20% CPU on my MacBook Pro. Nice!



1 comment:

eric casteleijn said...

Hey Donovan,
If you're doing what I think you're doing, also have a look at barbipes:
http://code.google.com/p/barbipes/
An mp3 spider that has some smarts to prevent downloading the same songs over and over, when you delete them, and a few other things. It is one of my main sources of new music these days.
It doesn't use lxml yet, (Just throwing out the admittedly ugly regexes didn't merit a new dependency, in my case.) and it's not exactly elegant, but it maxes out *my* glass fiber connection, and it does that, without hitting any individual site too hard.