the web server

You can have a second computer once you’ve shown you know how to use the first one.

—Paul Barham¹

This document tells how a web server is used to provide willshake.net to people over the internet.

1 motivation

A web server is a program that handles the business end of an HTTP conversation. The other end being the client.

In one respect, a web server is a way to deliver a big thing in little pieces. If the mountain won’t come to us, then we must go to the mountain.

So, one needs a web server to reach people.

The particulars of the web server are not important. I don’t care which web server is used. I’ve changed it before, and I fully expect to change it again.²

The objectives here are to

set up and maintain a production server
establish a lightweight workflow for development
support the modification of server configuration by other features

I’ve already talked some about the web as a platform, which is really focused on what you can do with the web now—or at least, the subset that willshake takes part in. In that sense, the “platform” is the part closest to the user.

This is more about how things are done, which is further from the user. And yet, the thing that makes the web work, a protocol called “HTTP”, is somewhat user-facing. Surely you’ve heard of HTTP? It’s that little incantation at the beginning of all web addresses.

What problem did HTTP solve?

In 1990, although “personal computing” was becoming common, most people still used computers through institutions—particularly government agencies and universities. Those institutions would have a central computer (a mainframe), which was connected to a number of “dumb terminals.” Any of the users could sit at any of the dumb terminals, log into their account, and access their files.

Such systems conventionally provided each user a “home directory,” where personal files could be stored.

So the Globe Theater Company’s computer would have accounts for the company’s managers (which included Shakespeare) and maybe the actors as well. Shakespeare’s “home directory” would probably be called ~wshaksper.

But what if Mr. Shaksper wanted to access his files from his home in Stratford? The “dumb terminals” were all directly wired to the Globe’s mainframe. But there was a thing called “the Internet,” which connected a growing number of computers all across the world. Essentially, HTTP was just a way of publishing documents so that anyone on the internet could access your home directory (or some directory on your system).

Insert stuff about a global addressing scheme.

A web server, then, was just an Internet application that listened for requests and responded with files.

As you probably know, that’s not the end of the story. Many web requests are still served from files.

But at some point, someone realized that the server didn’t have to return files from the system. It could in fact do whatever it wanted to. It could just make something up and return that.

(To be continued…)

On the division between “filesystem” and “webspace”, see https://httpd.apache.org/docs/2.4/sections.html#file-and-web

The filesystem is the view of your disks as seen by your operating system.

In contrast, the webspace is the view of your site as delivered by the web server and seen by the client… The webspace need not map directly to the filesystem, since webpages may be generated dynamically from databases or other locations.

It’s this division that made “Web 2.0” possible. Web servers were able to become something fundamentally different than mere file servers—with no change to the client. In other words, a browser made in 1993, before the existence of “dynamic” webpages, could be used in 199x when many servers were serving dynamic pages, and you wouldn’t know the difference.

3 scenarios

This is about modes of operation.

There are several ways that you can serve willshake over the web. All of them have been used at some point.

static files only: this is the most robust way to serve a web site, and it works, as long as you’re able to render all of the possible pages in the site. In fact, it’s a motivator to avoid any “unbounded vector” of pages (like search queries, or combinatorial paths) for as long as possible. It’s easy to set up and uses minimal compute and memory resources. But it requires a lengthier preprocess and requires the transfer and storage of many more bytes.
In any case, you always have to serve some static files, so this is a baseline level.
dynamic pages: by running programs on the server, you can create pages on-demand. There are many advntages to this: you can ship a very small amount of code to serve a large number of pages, and you can handle sets of pages that can’t be enumerated ahead of time (such as the set of all search queries). The downside is that it’s more complicated to set up, it uses more compute and memory resources, and there are more ways for things to go wrong.
hybrid: A hybrid setup is just a dynamic site with some or all of its pages pre-rendered and served as static files. This would allow you to handle the most common pages with little overhead, while still being able to deal with the “long tail” of pages that you can’t or don’t want to pre-render.

4 server-free

This is really about an alternative to http called the file protocol. The bottom line is that you can’t run willshake at all using the file protocol, although the barrier to doing it is fairly thin.

You can almost run willshake without a server. Almost.

Why can’t you? Some web sites can work over the local file system. Documentation is commonly shipped this way. You just open an .html file, and you can navigate a whole collection of documents. Willshake goes out of its way to be deployable as static files, so you might expect this to work.

What happens if you just open up the site/index.html in a browser? You’ll see willshake’s “home page,” but only as a plain HTML document. It will look “plain” because the stylesheets won’t be located (since they use rooted paths). And even though willshake can function as a “single-page app,” it can only do that if the scripts are loaded, which will fail for the same reason as the stylesheets. And the links won’t work, for the same reason. In short, willshake can’t be run over the file:// protocol, because it uses rooted paths all over the place. It expects to be served from the root of a domain, and right now that’s a hard requirement.

So I only mention this for completeness, since this would be the simplest scenario. There is currently no “server-free” willshake, thanks to those rooted paths. Sure, that could probably be “fixed” without too much trouble, but I can’t say what the benefit would be, other than to remove that Cleopatric “almost.”

5 a static file server

The simplest way to run willshake is with a static file server.³ Any one will do.

Once the site directory is built, you can use Python’s built-in server, for example.

sh next

cd site
python3 -m http.server

Uh, right, except it’s not, because of the “coherence.” That breaks this case.

That serves on port 8000 by default, so you should be able to browse to http://localhost:8000 and see the site.

Because willshake works as “single-page app,” you can use the whole thing that way. It looks exactly like the site online, only without any network latency.

But there’s a catch. You can only enter through the entryway (i.e., the home page). That’s the “single page.” You can’t go directly to an inner page. For example, if you wanted to go directly to the second scene of Hamlet, which would be at http://localhost:8000/plays/Ham/1.2, it won’t know what you’re talking about, because that’s not a file. You can get to it, by starting at the beginning and following links, but you can’t “drop in” there. Same if you hit “refresh” on any (inner) page.

That may sound like a showstopper, but in fact that’s exactly how the app works. You always begin at the beginning. I hear it’s a very good place to start.

5.1 other possibilities

Actually, you can still serve the site (if not “run” it) with just a static file server, by creating all of the pages in advance, and serving them as static files. That’s exactly what’s done with the current production site, and it’s covered in the section pre-rendering the site of web performance.

One final note before I let go of the static server. You can still support inner pages without server-side rendering, with a couple of very small modifications. First, you’d need to direct all inner requests back to the home page, but with the original request attached (say, as a hash or a query). Then you’d need a startup script to to go that address, which you can do using getflow.go(to_path). I haven’t done this, though.

6 dynamic

At some point, though, static files aren’t going to cut it. The whole idea of a “dynamic” web site is that you can go directly to /plays/Ham/1.2—or any other page within the site—and the server will create that page on-demand. This is often called “server-side rendering.”

Willshake was made with this in mind. It uses a purpose-built framework called getflow that makes both the client and server modes first-class. You can read more about it there.

Running willshake is just a matter of running getflow. Getflow is very small, and most of what it does is purely a function of XML. Getflow is 99% agnostic of anything that has to do with serving web pages. But at some point, it has to serve web pages.

7 WSGI

In the Python world, WSGI (Web Server Gateway Interface) is the go-to option for talking to web servers. Or at least, that was the intention when it was proposed in 2003. The problem then was that, although there was an abundance of frameworks, they were generally married to a specific server. That was seen as limiting the freedom of developers, who don’t like the idea of being stuck with something forever—especially when that thing is just the baggage of the thing they wanted in the first place. The idea of WSGI was that, if everyone—both servers and frameworks—got on board and followed this specification, then people could mix-and-match servers and frameworks, and everyone would be happy. The plan succeeded.⁴

One of WSGI’s design goals was that it “must be easy to implement” for both servers and frameworks. Since getflow is kind of a do-nothing framework (as far as HTTP is concerned), it’s a pretty good marriage: getflow’s WSGI integration is about thirty lines. The whole thing—including getflow itself—is bundled into one file. Deploying it is just a matter of copying to the server.

tup next

: $(ROOT)/getflow/<wsgi> \
|> ^ ship %b ^ cp %<wsgi> %o \
|> $(ROOT)/server/getflow/getflow.wsgi

But that’s only one side of it. WSGI has to plug into a server, too. Which one? Well, any one that supports WSGI, which is pretty much all of them. That’s the cost of freedom. You’re not finished making choices yet.

Well, in the Linux world, Apache is the go-to web server. And while I’ve tried to make willshake creative and original in some ways, this won’t be one of them.

But the choices just don’t stop. And for connecting WSGI to Apache, the choice is not 100% go-to.

You have

mod_wsgi, an Apache module
mod_wsgi-express, a simpler version of #1.

Both were created by Graham Dumpleton, and they’re both ultimately the same module. The “express” extension is supposed to be easier to use, and it is. Graham Dumpleton himself makes the plain-old mod_wsgi setup sound rather terrifying, and frankly sounds ready to deprecate it in favor of express.⁵

I use both methods. The first one is covered here. The second one is covered in the development server.

To install mod_wsgi for the first method, just

previous sh next

sudo apt-get install libapache2-mod-wsgi

It’s very quick.

7.1 configuration

The absolute bare minimum way to use mod_wsgi is the WSGIScriptAlias directive. For willshake, that would theoretically mean mapping the whole site (/) to getflow.

conf next

WSGIScriptAlias / ${SITE_ROOT}/server/getflow/getflow.wsgi

But that is no good! That would indiscriminately send everything to getflow. And this web site, as discussed earlier, is a mix of static and dynamic files.

But I have no idea—or rather, I want to have no idea—which locations are dynamic and which ones are static. So instead of mapping getflow to specific locations, I’ll just map it to a handler.

wsgi configuration next
previous conf next

WSGIHandlerScript getflow ${SITE_ROOT}/server/getflow/getflow.wsgi

That doesn’t actually give getflow any work to do. It just makes a handler named getflow available. And unlike the ScriptAlias approach, Apache lets you set handlers conditionally.

previous wsgi configuration next
previous conf next

# Use getflow if the file doesn't exist
<If "! -f %{REQUEST_FILENAME}">
SetHandler getflow
</If>

So actual files will be handled exactly as before, without getflow ever knowing about it. Everything else is getflow’s problem.⁶

If you try that, you may get a warning about running in “embedded mode.” The alternative to embedded mode is “daemon mode.” This has to do with whether the process running your application is owned by Apache, or by a separate “daemon.” It’s an easy choice, since the creator of mod_wsgi recommends against embedded mode so strongly that it’s not clear it will remain a supported feature. I have no reason to object.⁷

previous wsgi configuration
previous conf next

WSGIDaemonProcess ${DOMAIN} home='${DOCUMENT_ROOT}'
WSGIProcessGroup ${DOMAIN}
WSGIApplicationGroup %{GLOBAL}

There are a bunch of other options for WSGIDaemonProcess, but inasmuch as I don’t know anything about them, I leave them unset for now. These are settings from the configuration that’s generated by mod_wsgi-express.

previous conf next

processes=1 \
threads=5 \
display-name=%{GROUP} \
user='gavin' \
group='gavin' \
maximum-requests=0 \

It would be worth comparing the generated config from live to the generated config from local. It includes a lot that I’m pretty sure doesn’t apply to me. But one setting can make a big difference.

That’ll do it. But I wrap the whole thing inside of a conditional. That way, if you want to turn off WSGI for some reason, you can just Define NO_WSGI in the site’s main configuration. You’d have to do that yourself, since it goes into the protected Apache configuration area.

server/httpd.conf.d/wsgi.conf
previous conf next

<IfDefine !NO_WSGI>
  <<wsgi configuration>>
</IfDefine>

I had to add this (to the protected config) to get it working on local. I expect that the path would be different on remote, if it’s necessary at all.

previous conf next

LoadModule wsgi_module '/usr/local/lib/python2.7/dist-packages/mod_wsgi-4.4.22-py2.7-linux-x86_64.egg/mod_wsgi/server/mod_wsgi-py27.so'

You only have to add this once per server (I think). That is (I think) adding this for one site will mean it’s loaded for another.

8 Apache

To use Apache, you have to of course… have Apache. I’m not going to talk about installing Apache because there are many good ways to do it, which change over time and vary from system to system. If you can do any of this, you can figure out how to install Apache.

The end result will be some configurations.

In order to work in different environments (development, staging, production), a few variables are assumed by the “common” configuration files. SITE_ROOT and DOMAIN are set in the protected configuration.

server-setup/willshake.net.conf next
previous conf next

Define DOMAIN willshake.net
Define SITE_ROOT /var/www/${DOMAIN}

These are the parts that may differ from one web site to the next. I try to keep that to a minimum because it’s the part that has to be manually maintained.

Once you have Apache, you can create a web site by adding configuration files. Usually, Apache configuration will be under /etc/apache2. To create a site, you put a conf file into the sites-available folder, then enable the site. I can’t do either of those things for you through the build process, since they require root access and are outside of the folder. But this is what it would look like.

previous server-setup/willshake.net.conf
previous conf next

Include ${SITE_ROOT}/server/static.conf

That would go into /etc/apache2/sites-available. And you can enable the site using the Apache command a2ensite.

server/enable-willshake.net
previous sh next

sudo a2ensite willshake.net

Of course, then you’ll have to either start or restart Apache.

previous sh next

sudo service apache2 restart

Now, that “configuration” really doesn’t do anything but refer to another configuration file. That’s because I’m trying to minimize the amount of server-specific configuration and out-of-band steps in deployment. By pointing to a configuration file that’s controlled by a build (or deployment) process, most changes can still be automatic.

Thanks to the variables defined in the referring configuration, this can be written without any knowledge of the specific location of the site.

server/host.conf
previous conf next

<VirtualHost *:80>
  <<essential configuration>>
  Include ${SITE_ROOT}/server/httpd.conf
</VirtualHost>

I know, I know, another layer of indirection. Why?! I promise, it’ll be clear one day. (The real reason is that I’ll have to create another VirtualHost to support https, and I don’t want to have to repeat all that configuration).

Further to the “environment variables” set above, DOCUMENT_ROOT is set here, based on SITE_ROOT.

Don’t be confused by the fact that mod_rewrite defines a special variable called %{DOCUMENT_ROOT}. It is not available outside of that module.

essential configuration
previous conf next

Define DOCUMENT_ROOT ${SITE_ROOT}/public

# This is the only thing that's always needed.
DocumentRoot ${DOCUMENT_ROOT}

# If you're going to run multiple sites from the same server, you need to bind
# this virtual host with a name.
ServerName ${DOMAIN}

Any additional configuration is considered boilerplate. In fact, even ServerName may be considered boilerplate, for reasons I won’t say right now.

Turning this off too. In practice, I have not actually used logs. Why would I? It’s just more work. And this doesn’t work with express. I could do it conditionally if the directory exists, I guess. Or set a specific flag for it. Or just get rid of it.

So I’ll even use these little snippets for basic stuff like logging.

server/httpd.conf.d/logging
previous conf next

#ErrorLog ${SITE_ROOT}/log/error.log
#LogLevel info

This requires write permission on the directory.

9 support more configuration

You might have all sorts of reasons for modifying Apache configuration.

As for that Apache configuration, we’ll use the same “additive” approach that we use elsewhere. That is, we’ll set a location where you can write configuration rules, then we’ll bundle them together into the file that gets used.

previous tup

: $(ROOT)/server/httpd.conf.d/<all> \
|> ^o bundle Apache configuration ^ \
    cat `echo %<all>` > %o \
|> $(ROOT)/server/httpd.conf

The order of the rules should not matter.

Depending on what directives are used in the configuration, you may have to enable Apache modules. For example, I had to enable the “headers” module:

previous sh next

sudo a2enmod headers

But since I do not use the rewrite module, I disabled it.

previous sh

sudo a2dismod rewrite

9.1 content types

In order to enable download of static files with the correct Content-Type header, the server has to know which files are which type. This is usually determined by the file’s extension. Apache already knows about the most common types, so much custom configuration is not needed.

For XSLT, though, Apache defaults to application/xslt+xml. I don’t know what the “official” value should be. Many people cite RFC 3023⁸ as the source of application/xslt+xml, but there was no such MIME type at that time. Meanwhile, see issue 231395 at Bugzilla⁹; it was necessary at one point to use text/xsl for IE, and even now (in 2016) Chrome complains, saying

Resource interpreted as Stylesheet but transferred with MIME type application/xslt+xml

As of 2014, a StackOverflow answer to the question is barely consistent with itself.

So I use text/xml as the most widely accepted value.

server/httpd.conf.d/MIME types
previous conf next

AddType application/xml .xsl

UPDATE: Actually Chrome complains about both text/xml and application/xml. It seems the only thing it doesn’t complain about is text/xsl. But text/xsl doesn’t work at all in Firefox. I don’t have IE handy for testing. At any rate, I’m going to have to ignore the Chrome warnings.

Following are some of the file types used by willshake, along with what I last believed to be their correct MIME types.

woff	application/font-woff
ogg	audio/ogg
json	application/json
svg	image/svg+xml
map	application/json

map is for javascript source maps, which are only needed for development, but are benign in production.

9.1.1 TODO Apache mismatches

Apache mostly matches the above table (which was based on my earlier research).

woff: is being returned as application/x-font-woff
map: is being returned as application/javascript
vtt: not sure because I don’t have any right now (and won’t matter because I don’t plan to start using them again).

9.1.2 charset (encoding)

This tells Apache to send the necessary charset declaration in the Content-Type headers.

server/httpd.conf.d/encoding
previous conf next

AddDefaultCharset utf-8

It somehow “does the right thing,” which is to say, includes the declaration only on HTML responses.

9.2 SVG objects crash IE

This doesn’t really belong here, but where would I put it? Diagrams?

Loading SVG, at least with the <object> tag, can cause IE to crash. I have a 100% reproducible case right now: go to the about page (which has the program graph), then go back to the home page. Crash. And crash again, when you try to reload the page. Every time.

This can’t be prevented with script, because when a “bad” page loads, it’s already too late. And IE stopped recognizing conditional comments with version 10, so there’s no declarative way on the client side to prevent the SVG from being loaded.

I can’t have willshake crashing people’s browsers. That’s unpleasant. So I’m just not going to serve SVG images to IE (except the logo, which is loaded as a background style and works fine).

server/httpd.conf.d/IE-SVG
previous conf

<If "%{HTTP_USER_AGENT} =~ /Trident|MSIE/">
RedirectMatch 415 static/images/[^/]+\.svg$
</If>

What’s the appropriate HTTP status code for this situation? I don’t know. I’m using 415 Unsupported Media Type, even though that suggests that the server doesn’t support the media type. Well. It depends who’s asking.

We’re talking about this oft-cited Paul Barham: https://scholar.google.com/citations?user=q34R9psAAAAJ&hl=en

From IIS, if you must know.

A static file server is just a program that turns a directory into a web site by answering HTTP requests with files.

⁴

P.J. Eby, “PEP 3333 – Python Web Server Gateway Interface v1.0.1”, a “Python Enhancement Proposal” created 2010. PEP 3333 is the Python 3 update of PEP 333, which was posted in 2003 and oddly enough has nothing to do with Python 3.

⁵

Graham Dumpleton, “Introducing =mod_wsgi-express=”, Thursday, April 2, 2015

⁶

See the note in WSGI integration about special measures taken to ensure support for this.

⁷

Graham Dumpleton, “Why are you using embedded mode of mod_wsgi?”, October 13, 2012. See also the March 2009 discussion “Should embedded mode be disabled by default in mod_wsgi 3.0?”, on the modwsgi Google Group.

⁸

http://www.ietf.org/rfc/rfc3023.txt

⁹

https://bugzilla.mozilla.org/show_bug.cgi?id=231395