web performance

This document covers some measures that are targeted specifically at making willshake.net faster.

1 speed

Speed (also known as performance) is all about not being annoying. In relationships (at least client-server relationships), there are several effective ways to not be annoying, and they all involve not asking for something over and over.

Persistence is about not asking for the same connection over and over.

Caching is about not asking for the same resources over and over.

Compression is about not asking for the same byte sequences over and over.

Fortunately, all of these techniques are built into the platform. All you have to do is ask for them (over and over).

1.1 persistent connections

Persistent connections are about “staying on the phone” with the web server when you know that you’ll be asking for a lot of things, instead of making a separate call for each one—which would be super annoying to everyone.

Consider an image gallery. It has lots of images on it. Those images are not actually contained by the gallery page, they are referenced. In order to get all of those little pictures, the browser has to request each one separately. Probably, all of those images come from the same server that was just contacted to get the gallery page itself.

Fortunately, HTTP 1.1 specifies a way to “stay on the line” with the server, using what’s called a “persistent connection."¹

Apache supports keep-alive connections, which you can enable through the KeepAlive and KeepAliveTimeout directives.²

server/httpd.conf.d/keep-alive next
conf next

KeepAlive On
KeepAliveTimeout 100

That handles (as I understand it) the ugly business of actually persisting the connections, but it doesn’t actually send the header indicating that persistent connections are available.

previous server/httpd.conf.d/keep-alive
previous conf next

Header set Connection keep-alive

I’ve sometimes seen two values get sent (as in keep-alive, Keep-Alive), so I suspect that for some requests this is getting added elsewhere.

1.2 compression

Compression helps speed up the overall use of a web site, on the assumption that network transmission is slower than computation. We use it.

In Apache, mod_deflate provides on-demand gzip compression for responses.. It’s enabled on the basis of MIME type.

server/httpd.conf.d/compression
previous conf next

AddOutputFilterByType DEFLATE text/plain
AddOutputFilterByType DEFLATE text/html
AddOutputFilterByType DEFLATE text/css
AddOutputFilterByType DEFLATE text/xml
AddOutputFilterByType DEFLATE application/xml
AddOutputFilterByType DEFLATE text/javascript
AddOutputFilterByType DEFLATE application/javascript
AddOutputFilterByType DEFLATE application/xhtml+xml
AddOutputFilterByType DEFLATE image/svg+xml

This assumes that the deflate module is already loaded.

On the difference between text/xml and application/xml,

If an XML document – that is, the unprocessed, source XML document – is readable by casual users, text/xml is preferable to application/xml… Application/xml is preferable when the XML MIME entity is unreadable by casual users.³

Based on this, I think that application/xml, which is the default I’m getting from Apache, is correct.

1.3 caching

There’s an old saying in the computer world:

There are only two hard things in computer science: cache invalidation and naming things.
—Phil Karlton⁴

The insight is that invalidation—knowing when the cache is out-of-date—is the hard part. Caching is easy. Sure, cache everything. But how long?

—Hey willshake.net, it’s me again!

—Yeah, what do you want?

—That hysterical scene from Much Ado About Nothing.

— Again? I just gave you that.

—Yeah but the user loves it.

—Look, Firefox, I’m kinda busy. And most of this stuff hasn’t changed in 400 years. Could you just keep copies of things instead of asking me over and over?

—Okay. For how long?

—Um… forever?

—Forever? Okay. Got it.

Time passes. Things change for willshake.net—change dramatically. But Firefox never calls again, and never knows. Finis.

What happens? We update things. The thing called willshake.net is not the same as it was yesterday, thanks to a phenomenon usually called “progress.”

If the user’s browser keeps a copy forever, then the user will never see any progress, and will assume you and your web site are derelict. Or, the browser will combine incompatible versions of old and new resources, breaking the whole thing and leading the user to the same sad conclusion.

What should happen instead? A cache should be evicted when it’s out of date (or “stale”).⁵

The HTTP specification describes two main caching strategies: revalidation and expiration.⁶

With expiration caching, resources are fresh for a pre-determined time period. The initial response includes a description of the (relative or absolute) time period. Until the cached version “expires,” the browser won’t request it at all, no matter how many times the user wants it. This is of course the fastest possible result (short of not needing the thing in the first place). But it requires you to tell in advance when you expect the resource to change. This is usually not possible to do exactly, so tends to be used with approximations. For example, if something is updated about once a day, you can use a 24-hour cache expiration. This may be a good balance, but there are still two downsides. The user won’t see your update until that 24-hour period is over, even if you just posted a change. And if you don’t post any changes for a while, the daily user will still re-check every day.

With revalidation, you still ask for the resource every time you want it, but using a “conditional request” that includes some way of identifying the version you have a copy of. If the server determines that your version is current enough, it can respond with a 304 Not Modified, which is much cheaper than re-sending the whole thing.

We’re going to use both methods, sometimes on the same resource. There is no one-size-fits-all approach, even within willshake. This has to be done on a per-type basis.

The following sections cover the different caching strategies in increasing order of unstraightforwardness.

1.3.1 default

There are thousands of locations in willshake.net. Not every one will be covered by a specific caching strategy. I don’t want to leave those uncovered resources out in the cold, so I’ll start with a couple of general rules.

First of all, nothing in the site is un-cacheable. Everything can at least use revalidation.

revalidate
server/httpd.conf.d/caching next
previous conf next

Header set Cache-Control "must-revalidate, public, max-age=0"

On the meaning of must-revalidate without max-age, there is is some dispute.⁷ In practice, you can’t rely on browsers revalidating without an expiration directive, and I don’t blame them. The case can reasonably be construed as undefined. When in doubt, cache.

Um, riiiight, but how exactly would it be revalidated? You don’t have a Last-Modified date, and you don’t have an etag… how in the world would you honor a conditional GET?

That said, this is a good default for pre-rendered pages, which are static and can support conditional gets. And it’s a good fallback generally. But this remark makes no sense. This policy should have no effect on generated content, which is fine. But I haven’t tested it.

This will cover all generated content, which is in various locations. In willshake, only the HTML is generated. But each HTML document can be affected by changes to any number of files, including getflow itself. So it’s impractical to determine a “real” modified date. Thus, as a general matter, generated content must be revalidated.

Files are simpler. They’re extremely well suited for revalidation caching—at least if their timestamps are to be trusted. But most files are not so important that they need to be revalidated every time. Check back in a week.

server/httpd.conf.d/caching
previous conf next

<LocationMatch "^/(favicon|static/).*">
  Header set Cache-Control "max-age=604800, public"
</LocationMatch>

You might be wondering why not use something like

previous conf next

<If "-f %{REQUEST_FILENAME}">

to distinguish static from generated content. That won’t work, because Apache inexplicably processes If directives after Location (and LocationMatch) directives. ⁸

I guess this is better, since it means no file system access is needed to set the header. But still.

I’m not persuaded by the above… something like this should absolutely be usable to set a header.

server/httpd.conf.d/testing
previous conf next

<If "-f %{REQUEST_FILENAME}">
Header set X-Debug-Static yes
</If>

Riiight, but this is still 24 hours.

Actually, there’s one more static file not covered above: robots.txt. When I have a problem with robots.txt, I want to correct it right away. I don’t want to wait a week. robots.txt is not requested by ordinary visitors, anyway.

specific cache policies next
previous conf next

<Location "/robots.txt">
  Header set Cache-Control "max-age=86400, public"
</Location>

I don’t know whether search engines pay any attention to the expiration header on robots.txt. This is in case they do.

Some test cases,

caching tests next
sh next

test "/plays/Ham" "should be revalidate (default)"
test "/robots.txt" "should be one-day expiration (special case for robots.txt)"
test "/favicon.ico" "should be one-week expiration (special static file)"
test "/static/something/or-other" "should be one-week expiration (general static file)"

These general rules have to come first. If a resource is covered by one of the more specific policies below, it will override this.

previous server/httpd.conf.d/caching
previous conf next

<<specific cache policies>>

1.3.2 cache “forever”

I said that caching things forever is a bad idea. That’s what all this is about.

But, sometimes it’s a good idea. If you really know that something isn’t going to change, then you can safely cache it “forever.”

One problem, though: there is no “forever” in HTTP. Expirations are specified in “delta seconds,” meaning the number of seconds since getting the resource. And although there is no specific limit on the value of max-age, the spec does urge people not to set expiration periods longer than one year:

To mark a response as “never expires,” an origin server sends an Expires date approximately one year from the time the response is sent. HTTP/1.1 servers SHOULD NOT send Expires dates more than one year in the future.⁹

Fair enough. In practice, a file is very unlikely to stay in someone’s cache longer than a year, anyway.

cache forever
previous conf next

Header set Cache-Control "max-age=54864000, public"

Okay, so what gets cached “forever”?

For willshake, this includes the font files. The font files won’t change. They might be reconstituted at some point (for better efficiency, or to correct problems). In such cases, the names can be changed manually.

previous specific cache policies next
previous conf next

<Location "/static/style/fonts/">
  <<cache forever>>
</Location>

For example.

previous caching tests next
previous sh next

test "/static/style/fonts/some-font-file" "should be forever (font)"

I don’t think this applies to anything else.

1.3.3 cache for a long time

This will go for images. This one is tricky because although the images seldom change, they can change for various reasons—have already, and tracking these changes (and all references to images) isn’t a good tradeoff.

1.3.4 versioning

The trickiest caching strategy will apply to those files that change the most often—which are also needed the most often. For these files, the tradeoffs of simpler methods are unacceptable.

Instead, these files use a strategy called versioning. Since this does not have strictly to do with caching, it is covered in the section coherence.

1.3.5 force revalidate

Again, everything around here is cacheable. The “worst case” is that things need to be revalidated. And that is the default, specified earlier. But “static” files get a one-week expiration, on the assumption that—with the exception of the versioned files—there will be no particular harm in having stale versions.

But I use the “about” documents as a program reference, and so I want them to always be up-to-date. So I make them revalidate, just for my own purposes.

previous specific cache policies next
previous conf next

<Location "/static/doc/about/">
  <<revalidate>>
</Location>

And to confirm that,

previous caching tests next
previous sh next

test "/static/doc/about/some-document.xml" "should be revalidate (about document)"

1.3.6 testing cache policies

Okay, let’s try it.

test/cache policies
previous sh next

host="${1:-https://willshake.net}"

test() {
	path="$1"
	message="$2"
	echo "$path : $message"
	curl --head --insecure "$host$path" 2>/dev/null | grep Cache-Control
	echo
}

<<caching tests>>

Except that they do now… that I’m not using mod_wsgi-express. Does this have to do with the Alias rules?

Note that the test cases don’t need to point to files that actually exist. That’s because the configuration rules refer to “webspace” (the URL seen by the user) and not the file system (seen by the server).¹⁰

2 coherence

Make it work, make it right, make it fast.¹¹

Web sites are like mail-order furniture. They have a bunch of parts (files), and they come from somewhere far away (a server). And some poor lackey (your browser) has to put the thing together.

Sometimes, just to be nice, they send you extra parts, in case you lose or break something. If you have room, you stash them somewhere (your cache).

Where things get interesting is, suppose you really like the thing and you order another one (revisit the site).

Of course, you want it as fast as possible. And because of limited inventory (bandwidth, server capacity), the fewer parts you have to order, the better. Your browser looks around at your cache and says, Hey, I already have part ‘Z’. I’ll just order the parts we need, and it’ll get here sooner.

Meanwhile, the desk designers (the site creators), they aren’t twiddling their thumbs, either. They keep improving their designs, and the latest model always has a slightly different set of parts.

And very often the newer designs obsolete the old ones. You can’t just put something together from an incompatible mixture of parts. You’ll end up with a broken, ugly mess. But your browser’s going to try anyway. The alternative is to start over and get all new parts (refresh). That’s going to take forever.

So you can see that there’s a conflict between speed and coherence when fetching web sites that change regularly. Interestingly, the word “cohere,” which is often defined as “to stick together,” is closely related to “hesitate” and could be interpreted as “tarry together."¹² Being a coherent unit means being only as fast as your slowest part.

Now, some web sites might be designed defensively, so that this mixture of old and new parts is not a problem. Willshake is not one of those sites. It’s meant to be one coherent thing. All of the files that constitute willshake’s “bluprints” must be current.¹³

As to the difference between “make it work” and “make it right,” there is still some debate. But it’s clear that where correctness and speed are in conflict, correctness must take precedence.

Of course, if correctness were all you cared about, then you’d just say don’t cache anything. Since that’s not an option, this is about balancing correctness with efficiency.

2.1 the trick

The problem with the previous system is that the instructions use the same name to mean different things. Part ‘Z’ in one assembly might be slightly different in the next, but it’s still called part ‘Z’ because its role in the structure is the same.

The process, then, is biased towards first-time buyers (empty cache), who have to get all the parts anyway, so will automatically get the latest (and compatible) versions.

So if you really need to get the latest part ‘Z’, what’s the fastest way? If you already have a ‘Z’ in your junk drawer (cache), you could call the vendor to check whether or not the part has been updated (must-revalidate). You’d have to tell them some identifying mark, or the date that you got it (Etag, Last-Modified), and they’d tell you whether it was current (304 Not Modified). It still takes time, but it’s much faster than the mail. If you don’t, well, you’d have to order a new one anyway, and you only spent a little time asking (conditional GET).

The trick is that each time a resource changes, it is considered a new resource. This is done simply by giving it a new name.

Versioned resources are treated as immutable, so they can be cached forever. New versions will always be fetched as soon as they are needed, since, as far as the browser is concerned, it’s never seen them before.

So yeah. I’m solving a cache invalidation problem by naming things. Take that, Phil Karlton.

Of course, for this to work, you have to be willing and able to change the name in the place where requests are made. And you have to put the name and version into a table that itself can’t use an expiration date.

I won’t always be willing and able to do that. This will go for the stylesheets, the scripts, the transforms, and some of the documents.

Anyway, let’s get to it.

Specifically, the plan is to support versioned resources so that:

they are served with an effectively infinite cache expiration
they are referenced under a directory based on a hash of their content (e.g. /static/style/v/8fb3aa03/some.css)
such directories are routed to remove the /v/hash part¹⁴

2.2 fake immutable resources

The server’s job in this con is to accept requests like

/static/style/v/ABRACADABRA-jorge-luis-borges/rules.css

—which is not a file on the server—and serve whatever’s at

/static/style/rules.css

as if it were never going to change.

The “version” number (/v/ABRACADABRA-jorge-luis-borges) means nothing at all, and can be anything. So anything like /v/{whatever} will just be stripped out of the path.

Why not just use a query instead, and avoid the need for rewrites? The browser will respect it, and the server will ignore it, so you get what you want for free, right?

Because it’s nonsensical, and it’s liable to screw some things up. For one, some proxy servers (supposedly) ignore queries in headers, and so you can get incorrect results if you’re using query as a version code.

The article “Invalidating and updating cached responses” supposedly used to say something about how proxies ignore queries.¹⁵ But it no longer says that. Yeah. Check the Wayback Machine I guess.

The simplest way to do this would be with a rewrite rule.

previous conf next

RewriteRule ^(.+)/v/.+?/(.+)$ $1/$2

Rewrite rules will not work from a configuration passed to mod_wsgi-express using --include-file, since the file is included after the handler is mapped. I asked about this on the modwsgi group, and Graham Dumpleton added a feature to mod_wsgi-express on the same day that I posted the question.¹⁶ So it may be that a future release will include the --rewrite-rules option, which would allow you to point to a configuration file where they would work.

But I don’t use the rewrite module anywhere else, and I’d prefer to avoid requiring it just for this. Instead, I’m using the simpler AliasMatch (from Apache’s core module).

There’s only one catch: mod_alias doesn’t know where the site is. It’s really designed to point to alternate locations on the file system (i.e. besides the document root you’re already mapped to), and so doesn’t resolve unrooted paths relative to the document root. To deal with that, I use the DOCUMENT_ROOT variable that is defined in all of willshake’s site configs. Note that %{DOCUMENT_ROOT} is just for mod_rewrite, despite what Apache’s documentation may lead you to believe.¹⁷

server/httpd.conf.d/alias versions
previous conf next

AliasMatch "^(/static/.+)/v/.+?/(.+)" "${DOCUMENT_ROOT}/$1/$2"

Now it just remains to cache versioned files “forever”:

previous specific cache policies next
previous conf next

<LocationMatch "^/static/.+/v/">
  <<cache forever>>
</LocationMatch>

For example,

previous caching tests next
previous sh next

test "/static/style/v/abcde/some_style.css" "should be forever (versioned)"
test "/static/script/v/123456/some_module.js" "should be forever (versioned)"
test "/static/doc/v/abc123/some_document.xml" "should be forever (versioned)"

2.3 compute versions

But what are the actual version numbers? Again, the server will accept anything at all. And the number can be anything at all—so long as it changes when and only when the file’s content changes.

In other words, it’s a hash. How about an SHA1 checksum? It’s good enough for Joe Armstrong.¹⁸.

The actual hashing is done in the various locations where the files are created. See versioning files in the site.

2.4 versioning the requests

Now comes the hard part—actually using the hashes.

Again, the technique is to change the references to versioned files, so that the browser is always asking for the latest version. That’s going to mean very different things for each type of file, because they’re referenced in different ways.

2.4.1 scripts

First up are scripts, because the following types will build on this. Once more, the goal is to load scripts from

/static/script/v/some_version_number/module.js

instead of

/static/script/module.js

Now, the “normal” way that scripts are requested on the web is through the <script> tag:

xml next

<script src="some_imaginary_script.js"></script>

In that case, all of those <script> tags would have to be rewritten to use the correct version numbers.

This would be like rewriting the instructions, which means starting over again. So yes, you’d be guaranteed to get the latest references to everything, but it’s a bit of “robbing Peter to pay Paul,” since the instruction sheet (HTML) now has the same problem that you just “solved” for the scripts. In HTTP terms, invalidating a requesting resource on the basis of a change in an external resource seems to undermine the whole point of per-resource caching.

Luckily, willshake uses a script loader. So script requests are not directly baked into the HTML but put together on-demand.

Of course, the script loader—RequireJS—is itself a script. But before I get into bootstrapping vertigo, I’ll note that RequireJS is a special case, since it’s a third-party component (i.e. written by someone else), and I don’t plan on making changes to it. So really, it can be cached “forever”:

previous specific cache policies next
previous conf next

<LocationMatch "^/static/script/require">
  <<cache forever>>
</LocationMatch>

To confirm this,

previous caching tests next
previous sh next

test "/static/script/require.js" "should be forever (special case for require)"
test "/static/script/require-2.2.0.js" "should be forever (special case for require)"
test "/static/script/require-99.js" "should be forever (special case for require)"

For everything else, require will have to be told the “real” (i.e. fake) location of all the scripts. Fortunately, require lets you specify aliases for modules by using the paths configuration.¹⁹

This data needs to be used from the browser. For reasons that will be clear in a moment, I’m making it a plain old JavaScript variable assignment.

: $(PROGRAM)/write_file_versions | $(ROOT)/hashed/<all> \
|> ^o build hash index ^ %f %<all> > %o \
|> $(ROOT)/scripts/prologue.list/versioning_part_1.js

In the process, I’ll group the files by path, which makes it a lot smaller (although with gzip maybe it doesn’t matter).

BEGIN {
	print "var ws_versions;"

	# Don't bother with versioning unless you're actually running a server.
	# Like in the "mobile app," for example.
	print "if (window.location.protocol != 'file:')"
	print "  ws_versions = {"; _ = "\""
}
match($1, /^[./]*\/site\/((.*)\/)?(.*)$/, path) {
	bag["/" path[1]][path[3]] = substr($2, 1, 8)
}
END {
	started = 0
	for (dir in bag) {
		if (started) printf ","
		started = 1
		# Dummy first entry to avoid conditional comma in loop
		printf _ dir  _ ": {" _ _ ":" _ _
		for (file in bag[dir])
			print "," _ file _ ":" _ bag[dir][file] _
		print "}"
	}

	print "};"
}

Note that this script can’t go into site/static/script like the other scripts because that would create a cycle in the build graph. Why? Because it contains input from file_versions.js, which in turn contains input from hashed/*, which in turn contains input from site/static/script. The system doesn’t allow that.

Note also that as a result, the script cannot use ES2015, since it doesn’t go through the transpiler pipeline.

Scripts are the most straightforward because RequireJS lets you configure aliases up front.

This script follows the definition of ws_versions.

This follows the RequireJS recommendation to use a global var when defining require before loading require.js.

This configures require before it’s defined. It tells RequireJS where to request each module.

javascript next

var require = (function() {

	// Fall back to given names if versions are not defined.
	if (!ws_versions) return;

	var scripts = ws_versions['/static/script/'],
		versioned_scripts = {};

	if (scripts)
		Object.keys(scripts).forEach(function(file) {
			var version = scripts[file],
				name = (file + '').replace(/\.js$/, '');
			if (name && version)
				versioned_scripts[name] = 'v/' + version + '/' + name;
		});

	return { paths: versioned_scripts };
}());

As a result of all this, the prologue itself must be revalidated, since it holds the keys to all of the other dynamically-loaded resources.

previous specific cache policies next
previous conf next

<Location "/static/prologue.js">
  <<revalidate>>
</Location>

To confirm this,

previous caching tests next
previous sh next

test "/static/prologue.js" "should be revalidate (special case for prologue)"

2.4.2 transforms and data

If the HTML files are the “instructions sheet” in this analogy, then the next group of files—transforms and data—may be called the “meta-instructions,” in that they’re used to make the instructions in the first place.

Admittedly, the analogy breaks down here, because there’s no furniture manufacturer that tells you how to make the instructions themselves. Unless you think of a DIY kit that assumes you have a 3D printer. Well, willshake is that kind of site. Its framework (getflow) is all about being able to rebuild the site “from scratch,” even in the browser, just as is done in the factory (the server).

These “master plans” are never referenced by the HTML itself; rather, the HTML is their output. So there’s no issue about having to deal with their part numbers on the instruction sheet.

But they are requested by scripts when the browser starts fabricating. For this you can intercept and rewrite the names as the requests occur.

This is not currently used for any “data” documents. Doing so would just be a matter of adding them to the list in compute versions. I don’t intend to do this for the plays, since out-of-date versions won’t break anything.

previous javascript

require(['getflow'], getflow => {

	// This version dictionary should have been set by the earlier script.
	const versions = ws_versions;

	// Fall back to given names if versions are not defined.
	if (!versions) return;

	// Now wrap the GET function to use the version hashes.
	const __GET = getflow.GET;
	getflow.set_GET(path => {

		try {
			let type, hash;
			const [__, dir, file] = /^(.*\/)(.*)$/.exec(path) || [];

			if (versions
				&& (type = versions[dir])
				&& (hash = type[file]))

				path = dir + 'v/' + hash + '/' + file;

		} catch (error) {
			console.error("Getting file version", path, error);
		}

		return __GET(path);
	});
});

change rules/more/home--versioning
previous xml

<require module="GET_wrapper" />

TODO use versioning for `getflow.xml`

The site “blueprints” (a.k.a. getflow.xml) has to be current, for all the same reasons as the rest of these files. It can be versioned, and it is included in the hash list. However, right now, I don’t have a way to make getflow load it from a different path, since, by the time I hook its GET function, it’s already loaded! That can be done with some further modification to the getflow client script.

In the meantime, it must be revalidated to guarantee correctness.

previous specific cache policies next
previous conf next

<Location "^/getflow.xml">
  <<revalidate>>
</Location>

To confirm this,

previous caching tests next
previous sh next

test "/getflow.xml" "should be revalidate (TEMP special case for getflow.xml)"

Actually, I think this is the default, by virtue of its not being included in any other rule. At any rate, there’s still pending action on this.

2.4.3 stylesheets

This is the trickiest one.

In fact, I’ve decided not to version stylesheets. Instead, they use revalidation.

previous specific cache policies
previous conf next

<LocationMatch "^/static/style/[^/]*rules\.css$">
  <<revalidate>>
</LocationMatch>

It’s not that it’s technically impossible to version the stylesheets. They’re much more difficult than other types, for a number of reasons. But rather than discussing those reasons, I’m going to attack a more fundamental premise—that the stylesheets should be scoped at all.

By breaking up the style rules into “regions,” you (theoretically) reduce the initial load time of pages that require only some of the rules. But at what cost? It means that traveling into new regions will mean fetching new stylesheets just at the moment when you need them. That’s bad for the flow.

I’d rather pay for the stylesheets up-front (as many sites now pay their javascript cost), and know that all transitions from that point are going to be smooth.

previous caching tests
previous sh next

test "/static/style/rules.css" "should be revalidate (stylesheet)"
test "/static/style/main-rules.css" "should be revalidate (stylesheet)"
test "/static/style/v/version/other-rules.css" "should be forever (versioned)"

Of course, if the stylesheets gets too large, I’ll revisit this.

3 pre-rendering the site

While it’s useful to run application code on the server, it’s also a liability. It requires more memory and processing than serving static files, and of course, more opportunities for errors and security faults. And of course, it’s slower.

Meanwhile, the benefits of pre-rendering are many. It’s a great way to find errors and bad links, or indeed missing links. You learn just how big (or small) your site is, and how fast (or slow). It creates a portable, independent snapshot of the site, which is easily indexed. You can deploy the snapshot to take a load off of your server. Static file servers are easier to configure. Ye olde “cache invalidation” is also easier for static files, since you really have no idea whether—let alone when—a dynamic resource was “modified.” And (in a hybrid setup, anyway) you don’t need to pre-render the entire site—just as much of it as you want.

Currently, willshake.net is small enough that it’s practical to pre-generate the whole thing, and that’s what I will do here. Strictly speaking, this needn’t require a web server. Getflow itself could be modified to function as a standalone static site generator (i.e. on the command line). But right now it doesn’t do that.

Instead, you can use wget to download the site recursively. Usually this would be done from a local instance.

previous sh next

host="${1:-http://localhost:8080}"
out_dir="${2:-../pre-rendered}"

mkdir -p "$out_dir"                             # or tee will fail
wget \
	--recursive --level inf \
	--directory-prefix "$out_dir" \
	--force-directories \
	--ignore-tags 'img,audio,link,script' \
	--reject-regex '\?' \
	--exclude-directories 'static' \
	--no-verbose \
	"$host" 2>&1 \
	| tee "$out_dir/log"

Note that logging may not work without the quotes around the hostname.

Without --force-directories, I found that some directories would end up lacking an index.html.

As of now, the server completely ignores any query parameters; they are only used by scripts in the browser. At the same time, wget doesn’t like them, for some reason—I think because it tries to create a filename with a question mark. So to keep everyone happy, any paths with a query are excluded using --reject-regex.

The files thus created aren’t terribly useful by themselves. But if they are mixed with the “actual” static content on the site, Apache will serve the pre-rendered pages instead of running the WSGI handler. This can be done on a per-folder basis with symlinks.

program/make-hybrid-site
previous sh

folders="${1:-plays poems about}"
port="${2:-8080}"
rendered="${3:-../pre-rendered}"

host="localhost:$port"

echo "Unlinking folders: $folders..."
for folder in $folders; do
	if [ -L "site/$folder" ]; then
		rm "site/$folder"
	fi
done

echo "Starting an ad-hoc site on port $port..."
./start-site --port "$port" &

# Make sure that the site has had time to start up.
echo "Sleeping for 2 seconds..."
sleep 2s

echo "Removing old pre-rendered site at $rendered"
if [ -d "$rendered" ]; then
	rm -r "$rendered"
fi

echo "Rendering site to $rendered"
program/render-site "http://$host" "$rendered" 2>&1 \
	| awk '/URL:/ { print $3 }'

site=$(ps -aux | grep "[m]od_wsgi-express.*$port" | head -n1 | awk '{print $2}')
echo "Shutting down ad-hoc site, process $site..."
kill "$site"

echo "Symlinking rendered folders to site: $folders..."
for folder in $folders; do
	ln --symbolic --relative "$rendered/$host/$folder" "site/$folder"
done

The deployment script will follow those links, treating the pre-rendered pages just like any other files.

There’s one little problem with this. The rendered files don’t have an .html extension, so Apache doesn’t serve them with a Content-Type header. This doesn’t stop browsers from figuring out that they’re HTML, but it’s not good form, and will confuse some clients. So if you’re serving extensionless HTML, Apache should know about it.

server/httpd.conf.d/extensionless-html
previous conf

<Location "/plays">
ForceType text/html
</Location>
<Location "/poems">
ForceType text/html
</Location>
<Location "/about">
ForceType text/html
</Location>

This would preferably done without reference to the specific directories being pre-rendered. But putting that ForceType directive at the root would be more incorrect. As it is, this is only incorrect if any non-HTML files were in the pre-generated directories.²⁰

3.1 BUG links to localhost break the above

The above will follow links to e.g. localhost:8000, which I’ve included in some places. wget will not follow other hosts by default, but that doesn’t stop it from following the same host on another port number. The result is that it tries to pre-generate both sites and havoc ensues. I’ve worked around this temporarily by marking such URL’s as code, so that they don’t get rendered as links.

Footnotes:

“Persistence,” from “Hypertext Transfer Protocol (HTTP/1.1): Message Syntax and Routing,” RFC 7230 http://tools.ietf.org/html/rfc7230#section-6.3

“§ KeepAlive Directive” from “Apache Core Features”, Apache HTTP Server Documentation

§ 3 “XML Media Types”, RFC 3023, “XML Media Types”, 2001

⁴

While this saying has been widely attributed to Phil Karlton (even by Tim Bray (see “literate programming”)), people are still skeptical. If you can’t find a satisfactory source for this either, you’re in good company: neither could Martin Fowler.

⁵

Caches can require eviction for other reasons. There is usually a limitation on how much space a cache is allowed to use, and existing items are often evicted to make room for new ones. Cached items may also be transient for security reasons. But here, we’re only concerned with cache “freshness.”

⁶

R. Fielding et al., § 13 “Caching in HTTP”, RFC 2616, “Hypertext Transfer Protocol – HTTP/1.1.” June 1999

⁷

“HTTP Cache Control max-age, must-revalidate” http://stackoverflow.com/a/8729854 See also the RFC section referenced, “Cache Revalidation and Reload Controls” http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.9.4

⁸

“How the sections are merged”, Apache HTTP Server Documentation

⁹

R. Fielding et al, § 14.21 “Expires”, RFC 2616, “Hypertext Transfer Protocol – HTTP/1.1.” June 1999

See also Eric Law’s 2010 post, “Use Sensible Long-Lived Cache headers,” which reports that some systems at the time erroneously treated max-age as a 32-bit integer, meaning that values in excess of 136 years (!) were being misinterpreted.

¹⁰

“Filesystem, Webspace, and Boolean Expressions”, Apache HTTP Server Documentation.

¹¹

“Make it Work Make It Right Make It Fast”, C2 wiki. Saying attributed to Kent Beck.

¹²

“cohere”,

from com- “together” + haerere “to stick” (see hesitation)

Online Etymology Dictionary.

¹³

An example of this is getflow’s use of special attributes to identify inserted markup. The values rendered by the server must match the ones calculated on the client in order for page transitions to work. Yeah, it’s fragile and stupid. But it does work as long as the client and server are looking at the same blueprints. Which means that keeping files up-to-date is not just about speed, but about correctness.

¹⁴

It’s basically like the what’s described in this article, except that this actually is the implementation, not just a cut-and-paste job. Also, this uses hashes instead of timestamps. Never use timestamps when you can use hashes. Files would be re-requested if and only if they have actually changed.

¹⁵

“What’s the best way to version CSS and JS URLs?”

¹⁶

“can mod_wsgi-express be used with RewriteRule in the included config?”

¹⁷

See “§ Variables” in the Apache HTTP Server Documentation.

¹⁸

See “The Mess We’re In” from Strange Loop, September 2014 (on YouTube).

¹⁹

“Configuration Options”, RequireJS API

²⁰

“ForceType Directive”, Apache HTTP Server Documentation.

web performance

1 speed

1.1 persistent connections

1.2 compression

1.3 caching

1.3.1 default

1.3.2 cache “forever”

1.3.3 cache for a long time

1.3.4 versioning

1.3.5 force revalidate

1.3.6 testing cache policies

2 coherence

2.1 the trick

2.2 fake immutable resources

2.3 compute versions

2.4 versioning the requests

2.4.1 scripts

2.4.2 transforms and data

TODO use versioning for `getflow.xml`

2.4.3 stylesheets

3 pre-rendering the site

3.1 BUG links to localhost break the above

Footnotes:

about willshake

Shakespeare