the web server
You can have a second computer once you’ve shown you know how to use the first one.
—Paul Barham1
This document tells how a web server is used to provide willshake.net
to people
over the internet.
1 motivation
A web server is a program that handles the business end of an HTTP conversation. The other end being the client.
In one respect, a web server is a way to deliver a big thing in little pieces. If the mountain won’t come to us, then we must go to the mountain.
So, one needs a web server to reach people.
The particulars of the web server are not important. I don’t care which web server is used. I’ve changed it before, and I fully expect to change it again.2
The objectives here are to
- set up and maintain a production server
- establish a lightweight workflow for development
- support the modification of server configuration by other features
2 HTTP
This section is a work in progress.
I’ve already talked some about the web as a platform, which is really focused on what you can do with the web now—or at least, the subset that willshake takes part in. In that sense, the “platform” is the part closest to the user.
This is more about how things are done, which is further from the user. And yet, the thing that makes the web work, a protocol called “HTTP”, is somewhat user-facing. Surely you’ve heard of HTTP? It’s that little incantation at the beginning of all web addresses.
What problem did HTTP solve?
In 1990, although “personal computing” was becoming common, most people still used computers through institutions—particularly government agencies and universities. Those institutions would have a central computer (a mainframe), which was connected to a number of “dumb terminals.” Any of the users could sit at any of the dumb terminals, log into their account, and access their files.
Such systems conventionally provided each user a “home directory,” where personal files could be stored.
So the Globe Theater Company’s computer would have accounts for the company’s
managers (which included Shakespeare) and maybe the actors as well.
Shakespeare’s “home directory” would probably be called ~wshaksper
.
But what if Mr. Shaksper wanted to access his files from his home in Stratford? The “dumb terminals” were all directly wired to the Globe’s mainframe. But there was a thing called “the Internet,” which connected a growing number of computers all across the world. Essentially, HTTP was just a way of publishing documents so that anyone on the internet could access your home directory (or some directory on your system).
Insert stuff about a global addressing scheme.
A web server, then, was just an Internet application that listened for requests and responded with files.
As you probably know, that’s not the end of the story. Many web requests are still served from files.
But at some point, someone realized that the server didn’t have to return files from the system. It could in fact do whatever it wanted to. It could just make something up and return that.
(To be continued…)
On the division between “filesystem” and “webspace”, see https://httpd.apache.org/docs/2.4/sections.html#file-and-web
The filesystem is the view of your disks as seen by your operating system.
In contrast, the webspace is the view of your site as delivered by the web server and seen by the client… The webspace need not map directly to the filesystem, since webpages may be generated dynamically from databases or other locations.
It’s this division that made “Web 2.0” possible. Web servers were able to become something fundamentally different than mere file servers—with no change to the client. In other words, a browser made in 1993, before the existence of “dynamic” webpages, could be used in 199x when many servers were serving dynamic pages, and you wouldn’t know the difference.
3 scenarios
This is about modes of operation.
There are several ways that you can serve willshake over the web. All of them have been used at some point.
- static files only
- this is the most robust way to serve a web site, and it
works, as long as you’re able to render all of the possible pages in the
site. In fact, it’s a motivator to avoid any “unbounded vector” of pages
(like search queries, or combinatorial paths) for as long as possible.
It’s easy to set up and uses minimal compute and memory resources. But it
requires a lengthier preprocess and requires the transfer and storage of
many more bytes.
In any case, you always have to serve some static files, so this is a baseline level.
- dynamic pages
- by running programs on the server, you can create pages on-demand. There are many advntages to this: you can ship a very small amount of code to serve a large number of pages, and you can handle sets of pages that can’t be enumerated ahead of time (such as the set of all search queries). The downside is that it’s more complicated to set up, it uses more compute and memory resources, and there are more ways for things to go wrong.
- hybrid
- A hybrid setup is just a dynamic site with some or all of its pages pre-rendered and served as static files. This would allow you to handle the most common pages with little overhead, while still being able to deal with the “long tail” of pages that you can’t or don’t want to pre-render.
4 server-free
This is really about an alternative to http
called the file
protocol. The
bottom line is that you can’t run willshake at all using the file protocol,
although the barrier to doing it is fairly thin.
You can almost run willshake without a server. Almost.
Why can’t you? Some web sites can work over the local file system.
Documentation is commonly shipped this way. You just open an .html
file, and
you can navigate a whole collection of documents. Willshake goes out of its way
to be deployable as static files, so you might expect this to work.
What happens if you just open up the site/index.html
in a browser? You’ll see
willshake’s “home page,” but only as a plain HTML document. It will look
“plain” because the stylesheets won’t be located (since they use rooted paths).
And even though willshake can function as a “single-page app,” it can only do
that if the scripts are loaded, which will fail for the same reason as the
stylesheets. And the links won’t work, for the same reason. In short,
willshake can’t be run over the file://
protocol, because it uses rooted paths
all over the place. It expects to be served from the root of a domain, and
right now that’s a hard requirement.
So I only mention this for completeness, since this would be the simplest scenario. There is currently no “server-free” willshake, thanks to those rooted paths. Sure, that could probably be “fixed” without too much trouble, but I can’t say what the benefit would be, other than to remove that Cleopatric “almost.”
5 a static file server
The simplest way to run willshake is with a static file server.3 Any one will do.
Once the site
directory is built, you can use Python’s built-in server, for
example.
cd site
python3 -m http.server
Uh, right, except it’s not, because of the “coherence.” That breaks this case.
That serves on port 8000 by default, so you should be able to browse to
http://localhost:8000
and see the site.
Because willshake works as “single-page app,” you can use the whole thing that way. It looks exactly like the site online, only without any network latency.
But there’s a catch. You can only enter through the entryway (i.e., the home
page). That’s the “single page.” You can’t go directly to an inner page. For
example, if you wanted to go directly to the second scene of Hamlet, which would
be at http://localhost:8000/plays/Ham/1.2
, it won’t know what you’re talking
about, because that’s not a file. You can get to it, by starting at the
beginning and following links, but you can’t “drop in” there. Same if you hit
“refresh” on any (inner) page.
That may sound like a showstopper, but in fact that’s exactly how the app works. You always begin at the beginning. I hear it’s a very good place to start.
5.1 other possibilities
Actually, you can still serve the site (if not “run” it) with just a static file server, by creating all of the pages in advance, and serving them as static files. That’s exactly what’s done with the current production site, and it’s covered in the section pre-rendering the site of web performance.
One final note before I let go of the static server. You can still support
inner pages without server-side rendering, with a couple of very small
modifications. First, you’d need to direct all inner requests back to the home
page, but with the original request attached (say, as a hash or a query). Then
you’d need a startup script to to go that address, which you can do using
getflow.go(to_path)
. I haven’t done this, though.
6 dynamic
At some point, though, static files aren’t going to cut it. The whole idea of a
“dynamic” web site is that you can go directly to /plays/Ham/1.2
—or any other
page within the site—and the server will create that page on-demand. This is
often called “server-side rendering.”
Willshake was made with this in mind. It uses a purpose-built framework called getflow that makes both the client and server modes first-class. You can read more about it there.
Running willshake is just a matter of running getflow. Getflow is very small, and most of what it does is purely a function of XML. Getflow is 99% agnostic of anything that has to do with serving web pages. But at some point, it has to serve web pages.
7 WSGI
In the Python world, WSGI (Web Server Gateway Interface) is the go-to option for talking to web servers. Or at least, that was the intention when it was proposed in 2003. The problem then was that, although there was an abundance of frameworks, they were generally married to a specific server. That was seen as limiting the freedom of developers, who don’t like the idea of being stuck with something forever—especially when that thing is just the baggage of the thing they wanted in the first place. The idea of WSGI was that, if everyone—both servers and frameworks—got on board and followed this specification, then people could mix-and-match servers and frameworks, and everyone would be happy. The plan succeeded.4
One of WSGI’s design goals was that it “must be easy to implement” for both servers and frameworks. Since getflow is kind of a do-nothing framework (as far as HTTP is concerned), it’s a pretty good marriage: getflow’s WSGI integration is about thirty lines. The whole thing—including getflow itself—is bundled into one file. Deploying it is just a matter of copying to the server.
: $(ROOT)/getflow/<wsgi> \
|> ^ ship %b ^ cp %<wsgi> %o \
|> $(ROOT)/server/getflow/getflow.wsgi
But that’s only one side of it. WSGI has to plug into a server, too. Which one? Well, any one that supports WSGI, which is pretty much all of them. That’s the cost of freedom. You’re not finished making choices yet.
Well, in the Linux world, Apache is the go-to web server. And while I’ve tried to make willshake creative and original in some ways, this won’t be one of them.
But the choices just don’t stop. And for connecting WSGI to Apache, the choice is not 100% go-to.
You have
mod_wsgi
, an Apache modulemod_wsgi-express
, a simpler version of #1.
Both were created by Graham Dumpleton, and they’re both ultimately the same
module. The “express” extension is supposed to be easier to use, and it is.
Graham Dumpleton himself makes the plain-old mod_wsgi
setup sound rather
terrifying, and frankly sounds ready to deprecate it in favor of
express
.5
I use both methods. The first one is covered here. The second one is covered in the development server.
To install mod_wsgi
for the first method, just
sudo apt-get install libapache2-mod-wsgi
It’s very quick.
7.1 configuration
The absolute bare minimum way to use mod_wsgi
is the WSGIScriptAlias
directive.
For willshake, that would theoretically mean mapping the whole site (/
) to
getflow.
WSGIScriptAlias / ${SITE_ROOT}/server/getflow/getflow.wsgi
But that is no good! That would indiscriminately send everything to getflow. And this web site, as discussed earlier, is a mix of static and dynamic files.
But I have no idea—or rather, I want to have no idea—which locations are dynamic and which ones are static. So instead of mapping getflow to specific locations, I’ll just map it to a handler.
WSGIHandlerScript getflow ${SITE_ROOT}/server/getflow/getflow.wsgi
That doesn’t actually give getflow any work to do. It just makes a handler
named getflow
available. And unlike the ScriptAlias
approach, Apache lets you
set handlers conditionally.
# Use getflow if the file doesn't exist
<If "! -f %{REQUEST_FILENAME}">
SetHandler getflow
</If>
So actual files will be handled exactly as before, without getflow ever knowing about it. Everything else is getflow’s problem.6
If you try that, you may get a warning about running in “embedded mode.” The
alternative to embedded mode is “daemon mode.” This has to do with whether the
process running your application is owned by Apache, or by a separate “daemon.”
It’s an easy choice, since the creator of mod_wsgi
recommends against embedded
mode so strongly that it’s not clear it will remain a supported feature. I have
no reason to object.7
WSGIDaemonProcess ${DOMAIN} home='${DOCUMENT_ROOT}'
WSGIProcessGroup ${DOMAIN}
WSGIApplicationGroup %{GLOBAL}
There are a bunch of other options for WSGIDaemonProcess
, but inasmuch as I
don’t know anything about them, I leave them unset for now. These are settings
from the configuration that’s generated by mod_wsgi-express
.
processes=1 \
threads=5 \
display-name=%{GROUP} \
user='gavin' \
group='gavin' \
maximum-requests=0 \
It would be worth comparing the generated config from live to the generated config from local. It includes a lot that I’m pretty sure doesn’t apply to me. But one setting can make a big difference.
That’ll do it. But I wrap the whole thing inside of a conditional. That way,
if you want to turn off WSGI for some reason, you can just Define NO_WSGI
in the
site’s main configuration. You’d have to do that yourself, since it goes into
the protected Apache configuration area.
<IfDefine !NO_WSGI>
<<wsgi configuration>>
</IfDefine>
I had to add this (to the protected config) to get it working on local. I expect that the path would be different on remote, if it’s necessary at all.
LoadModule wsgi_module '/usr/local/lib/python2.7/dist-packages/mod_wsgi-4.4.22-py2.7-linux-x86_64.egg/mod_wsgi/server/mod_wsgi-py27.so'
You only have to add this once per server (I think). That is (I think) adding this for one site will mean it’s loaded for another.
8 Apache
To use Apache, you have to of course… have Apache. I’m not going to talk about installing Apache because there are many good ways to do it, which change over time and vary from system to system. If you can do any of this, you can figure out how to install Apache.
The end result will be some configurations.
In order to work in different environments (development, staging, production), a
few variables are assumed by the “common” configuration files. SITE_ROOT
and
DOMAIN
are set in the protected configuration.
Define DOMAIN willshake.net
Define SITE_ROOT /var/www/${DOMAIN}
These are the parts that may differ from one web site to the next. I try to keep that to a minimum because it’s the part that has to be manually maintained.
Once you have Apache, you can create a web site by adding configuration files.
Usually, Apache configuration will be under /etc/apache2
. To create a site, you
put a conf
file into the sites-available
folder, then enable the site. I can’t
do either of those things for you through the build process, since they require
root access and are outside of the folder. But this is what it would look like.
Include ${SITE_ROOT}/server/static.conf
That would go into /etc/apache2/sites-available
. And you can enable the site
using the Apache command a2ensite
.
sudo a2ensite willshake.net
Of course, then you’ll have to either start or restart Apache.
sudo service apache2 restart
Now, that “configuration” really doesn’t do anything but refer to another configuration file. That’s because I’m trying to minimize the amount of server-specific configuration and out-of-band steps in deployment. By pointing to a configuration file that’s controlled by a build (or deployment) process, most changes can still be automatic.
Thanks to the variables defined in the referring configuration, this can be written without any knowledge of the specific location of the site.
<VirtualHost *:80>
<<essential configuration>>
Include ${SITE_ROOT}/server/httpd.conf
</VirtualHost>
I know, I know, another layer of indirection. Why?! I promise, it’ll be clear
one day. (The real reason is that I’ll have to create another VirtualHost
to
support https, and I don’t want to have to repeat all that configuration).
Further to the “environment variables” set above, DOCUMENT_ROOT
is set here,
based on SITE_ROOT
.
Don’t be confused by the fact that mod_rewrite
defines a special variable called
%{DOCUMENT_ROOT}
. It is not available outside of that module.
Define DOCUMENT_ROOT ${SITE_ROOT}/public
# This is the only thing that's always needed.
DocumentRoot ${DOCUMENT_ROOT}
# If you're going to run multiple sites from the same server, you need to bind
# this virtual host with a name.
ServerName ${DOMAIN}
Any additional configuration is considered boilerplate. In fact, even
ServerName
may be considered boilerplate, for reasons I won’t say right now.
Turning this off too. In practice, I have not actually used logs. Why would I? It’s just more work. And this doesn’t work with express. I could do it conditionally if the directory exists, I guess. Or set a specific flag for it. Or just get rid of it.
So I’ll even use these little snippets for basic stuff like logging.
#ErrorLog ${SITE_ROOT}/log/error.log
#LogLevel info
This requires write permission on the directory.
9 support more configuration
You might have all sorts of reasons for modifying Apache configuration.
As for that Apache configuration, we’ll use the same “additive” approach that we use elsewhere. That is, we’ll set a location where you can write configuration rules, then we’ll bundle them together into the file that gets used.
: $(ROOT)/server/httpd.conf.d/<all> \
|> ^o bundle Apache configuration ^ \
cat `echo %<all>` > %o \
|> $(ROOT)/server/httpd.conf
The order of the rules should not matter.
Depending on what directives are used in the configuration, you may have to enable Apache modules. For example, I had to enable the “headers” module:
sudo a2enmod headers
But since I do not use the rewrite module, I disabled it.
sudo a2dismod rewrite
9.1 content types
In order to enable download of static files with the correct Content-Type
header, the server has to know which files are which type. This is usually
determined by the file’s extension. Apache already knows about the most common
types, so much custom configuration is not needed.
For XSLT, though, Apache defaults to application/xslt+xml
. I don’t know what
the “official” value should be. Many people cite RFC 30238 as the
source of application/xslt+xml
, but there was no such MIME type at that time.
Meanwhile, see issue 231395 at Bugzilla9; it was necessary at
one point to use text/xsl
for IE, and even now (in 2016) Chrome complains,
saying
Resource interpreted as Stylesheet but transferred with MIME type application/xslt+xml
As of 2014, a StackOverflow answer to the question is barely consistent with itself.
So I use text/xml
as the most widely accepted value.
AddType application/xml .xsl
UPDATE: Actually Chrome complains about both text/xml
and application/xml
. It
seems the only thing it doesn’t complain about is text/xsl
. But text/xsl
doesn’t work at all in Firefox. I don’t have IE handy for testing. At any
rate, I’m going to have to ignore the Chrome warnings.
Following are some of the file types used by willshake, along with what I last believed to be their correct MIME types.
woff | application/font-woff |
ogg | audio/ogg |
json | application/json |
svg | image/svg+xml |
map | application/json |
map
is for javascript source maps, which are only needed for development, but
are benign in production.
9.1.1 TODO Apache mismatches
Apache mostly matches the above table (which was based on my earlier research).
- woff
- is being returned as
application/x-font-woff
- map
- is being returned as
application/javascript
- vtt
- not sure because I don’t have any right now (and won’t matter because I don’t plan to start using them again).
9.1.2 charset (encoding)
This tells Apache to send the necessary charset
declaration in the Content-Type
headers.
AddDefaultCharset utf-8
It somehow “does the right thing,” which is to say, includes the declaration only on HTML responses.
9.2 SVG objects crash IE
This doesn’t really belong here, but where would I put it? Diagrams?
Loading SVG, at least with the <object>
tag, can cause IE to crash. I have a
100% reproducible case right now: go to the about page (which has the program
graph), then go back to the home page. Crash. And crash again, when you try to
reload the page. Every time.
This can’t be prevented with script, because when a “bad” page loads, it’s already too late. And IE stopped recognizing conditional comments with version 10, so there’s no declarative way on the client side to prevent the SVG from being loaded.
I can’t have willshake crashing people’s browsers. That’s unpleasant. So I’m just not going to serve SVG images to IE (except the logo, which is loaded as a background style and works fine).
<If "%{HTTP_USER_AGENT} =~ /Trident|MSIE/">
RedirectMatch 415 static/images/[^/]+\.svg$
</If>
What’s the appropriate HTTP status code for this situation? I don’t know. I’m
using 415 Unsupported Media Type
, even though that suggests that the server
doesn’t support the media type. Well. It depends who’s asking.
10 roadmap
10.1 custom error pages
Right now, willshake doesn’t have any custom error handling. At the very least, there should be a friendly 404 page.
Footnotes:
This is the epigram to “Service-Disoriented Architecture” (http://bravenewgeek.com/service-disoriented-architecture/).
I’m not the only one looking for a source for this quote: https://twitter.com/derekcollison/status/606956724071899137
We’re talking about this oft-cited Paul Barham: https://scholar.google.com/citations?user=q34R9psAAAAJ&hl=en
From IIS, if you must know.
A static file server is just a program that turns a directory into a web site by answering HTTP requests with files.
P.J. Eby, “PEP 3333 – Python Web Server Gateway Interface v1.0.1”, a “Python Enhancement Proposal” created 2010. PEP 3333 is the Python 3 update of PEP 333, which was posted in 2003 and oddly enough has nothing to do with Python 3.
Graham Dumpleton, “Introducing =modwsgi-express=”, Thursday, April 2, 2015
See the note in WSGI integration about special measures taken to ensure support for this.
Graham Dumpleton, “Why are you using
embedded mode of mod_wsgi
?”, October 13, 2012. See also the March 2009
discussion “Should embedded mode be disabled by default in mod_wsgi
3.0?”, on
the modwsgi Google Group.