6/28/2003

Apache, ProxyPass and Twisted

One of the ongoing problems with writing web applications in Woven, my Twisted web app server and templating system, is creating links to other web pages in the system without creating bugs in those links.

Twisted stores the 'host' header sent by the browser in the request, and it stores the path segments that were used to locate the current Resource object in request.prepath. request.prepath is a list of strings indicating the URL segments leading up to the current Resource object. request.postpath is a list of path segments that have not yet been handled, but this is usually not terribly interesting.
So, it is possible, for example, to construct sibling urls, parent urls, and child urls by making a copy of request.prepath and performing list operations on it:
new = request.prepath[:]
new[-1] = 'someSibling.html'
Now that we have a list of path segments leading to a sibling Resource, we can construct a URL:
url = 'http://%s/%s' % (request.getHeader('host'), '/'.join(new))
While this works for simple cases, it breaks down when things get more complex. For example, the browser could be talking to the server over https, and while it is possible, it is non-trivial to detect this and construct a URL with the proper method. Duplicating this code throughout user-level code leads to many buggy implementations, with subtle problems that don't show up unless under very specific configurations.
For example, it is often desirable to run twisted.web behind a reverse proxy, such as using apache and the ProxyPass directive. (Someone please correct me if this is not the correct terminology.) In this case, the user browses to a URL served by the Apache server, and Apache forwards the request to the port upon which Twisted Web is listening. However, while it is doing this, it changes the "host" header from the header originally sent by the browser, to the host upon which Apache is running. Which means if we rely on the host header to construct our new URLs, they will reference the private proxy server rather than the public Apache server.
When I first began working on the Twisted project, one of the classes I wrote was called PathReferenceAcqusitionContext. It had this long name for historical reasons, and I shall simply refer to it as PathRef. One of the abilities of this object, which you could construct by calling the pathRef method on the Request, was to generate other PathRef objects by calling methods such as child, sibling, parent, etc. Then, once you had navigated to the conceptual URL location you desire, you could convert it to a URL string.
This implementation of PathRef was never documented, and had some other, unfortunate, unrelated properties which made it a perpetual pain in the ass. Glyph, finally fed up with the unused and problem-causing PathRef, removed it recently, and it is no longer in Twisted 1.0.6. Which is a good thing.
However, Glyph, finally seeing the need for a way to talk about URLs abstractly in a secure and convenient way, has kindly written the very good (in my opinion) Twisted/twisted/sandbox/paths.py, an implementation of all that was good about PathRef. I hope this gets moved into Twisted fairly soon, and a method for generating a URLPath representing the current Resource is added to the Request.
However, this method needs to be able to understand that the 'host' header is not necessarily the address to which the outside world refers. I decided to do a little experiment to see what Apache would do when performing a ProxyPass, and if we could determine where the outside world considers our application to live. When I was running Apache on port 8081 on my machine, and it was set up to forward requests to Twisted Web, I was able to observe that Apache had inserted an additional header in the process of forwarding the request:
'x-forwarded-host': 'localhost:8081'
The factory method on the request which generates a URLPath object should check to see if 'x-forwarded-host' has been set, and compensate accordingly. This should be a good solution to a long standing problem with using Twisted Web.