web performance
This document covers some measures that are targeted specifically at making
willshake.net
faster.
1 speed
Speed (also known as performance) is all about not being annoying. In relationships (at least client-server relationships), there are several effective ways to not be annoying, and they all involve not asking for something over and over.
Persistence is about not asking for the same connection over and over.
Caching is about not asking for the same resources over and over.
Compression is about not asking for the same byte sequences over and over.
Fortunately, all of these techniques are built into the platform. All you have to do is ask for them (over and over).
1.1 persistent connections
Persistent connections are about “staying on the phone” with the web server when you know that you’ll be asking for a lot of things, instead of making a separate call for each one—which would be super annoying to everyone.
Consider an image gallery. It has lots of images on it. Those images are not actually contained by the gallery page, they are referenced. In order to get all of those little pictures, the browser has to request each one separately. Probably, all of those images come from the same server that was just contacted to get the gallery page itself.
Fortunately, HTTP 1.1 specifies a way to “stay on the line” with the server, using what’s called a “persistent connection."1
Apache supports keep-alive connections, which you can enable through the
KeepAlive
and KeepAliveTimeout
directives.2
KeepAlive On
KeepAliveTimeout 100
That handles (as I understand it) the ugly business of actually persisting the connections, but it doesn’t actually send the header indicating that persistent connections are available.
Header set Connection keep-alive
I’ve sometimes seen two values get sent (as in keep-alive, Keep-Alive
), so I
suspect that for some requests this is getting added elsewhere.
1.2 compression
Compression helps speed up the overall use of a web site, on the assumption that network transmission is slower than computation. We use it.
In Apache, mod_deflate
provides on-demand gzip compression for responses.. It’s
enabled on the basis of MIME type.
AddOutputFilterByType DEFLATE text/plain
AddOutputFilterByType DEFLATE text/html
AddOutputFilterByType DEFLATE text/css
AddOutputFilterByType DEFLATE text/xml
AddOutputFilterByType DEFLATE application/xml
AddOutputFilterByType DEFLATE text/javascript
AddOutputFilterByType DEFLATE application/javascript
AddOutputFilterByType DEFLATE application/xhtml+xml
AddOutputFilterByType DEFLATE image/svg+xml
This assumes that the deflate
module is already loaded.
On the difference between text/xml
and application/xml
,
If an XML document – that is, the unprocessed, source XML document – is readable by casual users, text/xml is preferable to application/xml… Application/xml is preferable when the XML MIME entity is unreadable by casual users.3
Based on this, I think that application/xml
, which is the default I’m getting
from Apache, is correct.
1.3 caching
There’s an old saying in the computer world:
There are only two hard things in computer science: cache invalidation and naming things.
—Phil Karlton4
The insight is that invalidation—knowing when the cache is out-of-date—is the hard part. Caching is easy. Sure, cache everything. But how long?
—Hey willshake.net
, it’s me again!
—Yeah, what do you want?
—That hysterical scene from Much Ado About Nothing.
— Again? I just gave you that.
—Yeah but the user loves it.
—Look, Firefox, I’m kinda busy. And most of this stuff hasn’t changed in 400 years. Could you just keep copies of things instead of asking me over and over?
—Okay. For how long?
—Um… forever?
—Forever? Okay. Got it.
Time passes. Things change for willshake.net
—change dramatically. But
Firefox never calls again, and never knows. Finis.
What happens? We update things. The thing called willshake.net is not the same as it was yesterday, thanks to a phenomenon usually called “progress.”
If the user’s browser keeps a copy forever, then the user will never see any progress, and will assume you and your web site are derelict. Or, the browser will combine incompatible versions of old and new resources, breaking the whole thing and leading the user to the same sad conclusion.
What should happen instead? A cache should be evicted when it’s out of date (or “stale”).5
The HTTP specification describes two main caching strategies: revalidation and expiration.6
With expiration caching, resources are fresh for a pre-determined time period. The initial response includes a description of the (relative or absolute) time period. Until the cached version “expires,” the browser won’t request it at all, no matter how many times the user wants it. This is of course the fastest possible result (short of not needing the thing in the first place). But it requires you to tell in advance when you expect the resource to change. This is usually not possible to do exactly, so tends to be used with approximations. For example, if something is updated about once a day, you can use a 24-hour cache expiration. This may be a good balance, but there are still two downsides. The user won’t see your update until that 24-hour period is over, even if you just posted a change. And if you don’t post any changes for a while, the daily user will still re-check every day.
With revalidation, you still ask for the resource every time you want it, but
using a “conditional request” that includes some way of identifying the version
you have a copy of. If the server determines that your version is current
enough, it can respond with a 304 Not Modified
, which is much cheaper than
re-sending the whole thing.
We’re going to use both methods, sometimes on the same resource. There is no one-size-fits-all approach, even within willshake. This has to be done on a per-type basis.
The following sections cover the different caching strategies in increasing order of unstraightforwardness.
1.3.1 default
There are thousands of locations in willshake.net
. Not every one will be
covered by a specific caching strategy. I don’t want to leave those uncovered
resources out in the cold, so I’ll start with a couple of general rules.
First of all, nothing in the site is un-cacheable. Everything can at least use revalidation.
Header set Cache-Control "must-revalidate, public, max-age=0"
On the meaning of must-revalidate
without max-age
, there is is some
dispute.7 In practice, you can’t rely on browsers revalidating
without an expiration directive, and I don’t blame them. The case can
reasonably be construed as undefined. When in doubt, cache.
Um, riiiight, but how exactly would it be revalidated? You don’t have a
Last-Modified
date, and you don’t have an etag… how in the world would you
honor a conditional GET
?
That said, this is a good default for pre-rendered pages, which are static and can support conditional gets. And it’s a good fallback generally. But this remark makes no sense. This policy should have no effect on generated content, which is fine. But I haven’t tested it.
This will cover all generated content, which is in various locations. In willshake, only the HTML is generated. But each HTML document can be affected by changes to any number of files, including getflow itself. So it’s impractical to determine a “real” modified date. Thus, as a general matter, generated content must be revalidated.
Files are simpler. They’re extremely well suited for revalidation caching—at least if their timestamps are to be trusted. But most files are not so important that they need to be revalidated every time. Check back in a week.
<LocationMatch "^/(favicon|static/).*">
Header set Cache-Control "max-age=604800, public"
</LocationMatch>
You might be wondering why not use something like
<If "-f %{REQUEST_FILENAME}">
to distinguish static from generated content. That won’t work, because Apache
inexplicably processes If
directives after Location
(and LocationMatch
)
directives. 8
I guess this is better, since it means no file system access is needed to set the header. But still.
I’m not persuaded by the above… something like this should absolutely be usable to set a header.
<If "-f %{REQUEST_FILENAME}">
Header set X-Debug-Static yes
</If>
Riiight, but this is still 24 hours.
Actually, there’s one more static file not covered above: robots.txt
. When I
have a problem with robots.txt
, I want to correct it right away. I don’t want
to wait a week. robots.txt
is not requested by ordinary visitors, anyway.
<Location "/robots.txt">
Header set Cache-Control "max-age=86400, public"
</Location>
I don’t know whether search engines pay any attention to the expiration header
on robots.txt
. This is in case they do.
Some test cases,
test "/plays/Ham" "should be revalidate (default)"
test "/robots.txt" "should be one-day expiration (special case for robots.txt)"
test "/favicon.ico" "should be one-week expiration (special static file)"
test "/static/something/or-other" "should be one-week expiration (general static file)"
These general rules have to come first. If a resource is covered by one of the more specific policies below, it will override this.
1.3.2 cache “forever”
I said that caching things forever is a bad idea. That’s what all this is about.
But, sometimes it’s a good idea. If you really know that something isn’t going to change, then you can safely cache it “forever.”
One problem, though: there is no “forever” in HTTP. Expirations are specified
in “delta seconds,” meaning the number of seconds since getting the resource.
And although there is no specific limit on the value of max-age
, the spec does
urge people not to set expiration periods longer than one year:
To mark a response as “never expires,” an origin server sends an Expires date approximately one year from the time the response is sent. HTTP/1.1 servers SHOULD NOT send Expires dates more than one year in the future.9
Fair enough. In practice, a file is very unlikely to stay in someone’s cache longer than a year, anyway.
Header set Cache-Control "max-age=54864000, public"
Okay, so what gets cached “forever”?
For willshake, this includes the font files. The font files won’t change. They might be reconstituted at some point (for better efficiency, or to correct problems). In such cases, the names can be changed manually.
<Location "/static/style/fonts/">
<<cache forever>>
</Location>
For example.
test "/static/style/fonts/some-font-file" "should be forever (font)"
I don’t think this applies to anything else.
1.3.3 cache for a long time
This will go for images. This one is tricky because although the images seldom change, they can change for various reasons—have already, and tracking these changes (and all references to images) isn’t a good tradeoff.
1.3.4 versioning
The trickiest caching strategy will apply to those files that change the most often—which are also needed the most often. For these files, the tradeoffs of simpler methods are unacceptable.
Instead, these files use a strategy called versioning. Since this does not have strictly to do with caching, it is covered in the section coherence.
1.3.5 force revalidate
Again, everything around here is cacheable. The “worst case” is that things need to be revalidated. And that is the default, specified earlier. But “static” files get a one-week expiration, on the assumption that—with the exception of the versioned files—there will be no particular harm in having stale versions.
But I use the “about” documents as a program reference, and so I want them to always be up-to-date. So I make them revalidate, just for my own purposes.
<Location "/static/doc/about/">
<<revalidate>>
</Location>
And to confirm that,
test "/static/doc/about/some-document.xml" "should be revalidate (about document)"
1.3.6 testing cache policies
Okay, let’s try it.
host="${1:-https://willshake.net}"
test() {
path="$1"
message="$2"
echo "$path : $message"
curl --head --insecure "$host$path" 2>/dev/null | grep Cache-Control
echo
}
<<caching tests>>
Except that they do now… that I’m not using mod_wsgi-express
. Does this have
to do with the Alias
rules?
Note that the test cases don’t need to point to files that actually exist. That’s because the configuration rules refer to “webspace” (the URL seen by the user) and not the file system (seen by the server).10
2 coherence
Make it work, make it right, make it fast.11
Web sites are like mail-order furniture. They have a bunch of parts (files), and they come from somewhere far away (a server). And some poor lackey (your browser) has to put the thing together.
Sometimes, just to be nice, they send you extra parts, in case you lose or break something. If you have room, you stash them somewhere (your cache).
Where things get interesting is, suppose you really like the thing and you order another one (revisit the site).
Of course, you want it as fast as possible. And because of limited inventory (bandwidth, server capacity), the fewer parts you have to order, the better. Your browser looks around at your cache and says, Hey, I already have part ‘Z’. I’ll just order the parts we need, and it’ll get here sooner.
Meanwhile, the desk designers (the site creators), they aren’t twiddling their thumbs, either. They keep improving their designs, and the latest model always has a slightly different set of parts.
And very often the newer designs obsolete the old ones. You can’t just put something together from an incompatible mixture of parts. You’ll end up with a broken, ugly mess. But your browser’s going to try anyway. The alternative is to start over and get all new parts (refresh). That’s going to take forever.
So you can see that there’s a conflict between speed and coherence when fetching web sites that change regularly. Interestingly, the word “cohere,” which is often defined as “to stick together,” is closely related to “hesitate” and could be interpreted as “tarry together."12 Being a coherent unit means being only as fast as your slowest part.
Now, some web sites might be designed defensively, so that this mixture of old and new parts is not a problem. Willshake is not one of those sites. It’s meant to be one coherent thing. All of the files that constitute willshake’s “bluprints” must be current.13
As to the difference between “make it work” and “make it right,” there is still some debate. But it’s clear that where correctness and speed are in conflict, correctness must take precedence.
Of course, if correctness were all you cared about, then you’d just say don’t cache anything. Since that’s not an option, this is about balancing correctness with efficiency.
2.1 the trick
The problem with the previous system is that the instructions use the same name to mean different things. Part ‘Z’ in one assembly might be slightly different in the next, but it’s still called part ‘Z’ because its role in the structure is the same.
The process, then, is biased towards first-time buyers (empty cache), who have to get all the parts anyway, so will automatically get the latest (and compatible) versions.
So if you really need to get the latest part ‘Z’, what’s the fastest way? If
you already have a ‘Z’ in your junk drawer (cache), you could call the vendor to
check whether or not the part has been updated (must-revalidate
). You’d have to
tell them some identifying mark, or the date that you got it (Etag
,
Last-Modified
), and they’d tell you whether it was current (304 Not Modified
).
It still takes time, but it’s much faster than the mail. If you don’t, well,
you’d have to order a new one anyway, and you only spent a little time asking
(conditional GET
).
The trick is that each time a resource changes, it is considered a new resource. This is done simply by giving it a new name.
Versioned resources are treated as immutable, so they can be cached forever. New versions will always be fetched as soon as they are needed, since, as far as the browser is concerned, it’s never seen them before.
So yeah. I’m solving a cache invalidation problem by naming things. Take that, Phil Karlton.
Of course, for this to work, you have to be willing and able to change the name in the place where requests are made. And you have to put the name and version into a table that itself can’t use an expiration date.
I won’t always be willing and able to do that. This will go for the stylesheets, the scripts, the transforms, and some of the documents.
Anyway, let’s get to it.
Specifically, the plan is to support versioned resources so that:
- they are served with an effectively infinite cache expiration
- they are referenced under a directory based on a hash of their content
(e.g.
/static/style/v/8fb3aa03/some.css
) - such directories are routed to remove the
/v/hash
part14
2.2 fake immutable resources
The server’s job in this con is to accept requests like
/static/style/v/ABRACADABRA-jorge-luis-borges/rules.css
—which is not a file on the server—and serve whatever’s at
/static/style/rules.css
as if it were never going to change.
The “version” number (/v/ABRACADABRA-jorge-luis-borges
) means nothing at all,
and can be anything. So anything like /v/{whatever}
will just be stripped out
of the path.
Why not just use a query instead, and avoid the need for rewrites? The browser will respect it, and the server will ignore it, so you get what you want for free, right?
Because it’s nonsensical, and it’s liable to screw some things up. For one, some proxy servers (supposedly) ignore queries in headers, and so you can get incorrect results if you’re using query as a version code.
The article “Invalidating and updating cached responses” supposedly used to say something about how proxies ignore queries.15 But it no longer says that. Yeah. Check the Wayback Machine I guess.
The simplest way to do this would be with a rewrite rule.
RewriteRule ^(.+)/v/.+?/(.+)$ $1/$2
Rewrite rules will not work from a configuration passed to mod_wsgi-express
using --include-file
, since the file is included after the handler is mapped. I
asked about this on the modwsgi group, and Graham Dumpleton added a feature to
mod_wsgi-express
on the same day that I posted the
question.16 So it may be that a future release
will include the --rewrite-rules
option, which would allow you to point to a
configuration file where they would work.
But I don’t use the rewrite module anywhere else, and I’d prefer to avoid
requiring it just for this. Instead, I’m using the simpler AliasMatch
(from
Apache’s core module).
There’s only one catch: mod_alias
doesn’t know where the site is. It’s really
designed to point to alternate locations on the file system (i.e. besides the
document root you’re already mapped to), and so doesn’t resolve unrooted paths
relative to the document root. To deal with that, I use the DOCUMENT_ROOT
variable that is defined in all of willshake’s site configs. Note that
%{DOCUMENT_ROOT}
is just for mod_rewrite
, despite what Apache’s documentation
may lead you to believe.17
AliasMatch "^(/static/.+)/v/.+?/(.+)" "${DOCUMENT_ROOT}/$1/$2"
Now it just remains to cache versioned files “forever”:
<LocationMatch "^/static/.+/v/">
<<cache forever>>
</LocationMatch>
For example,
test "/static/style/v/abcde/some_style.css" "should be forever (versioned)"
test "/static/script/v/123456/some_module.js" "should be forever (versioned)"
test "/static/doc/v/abc123/some_document.xml" "should be forever (versioned)"
2.3 compute versions
But what are the actual version numbers? Again, the server will accept anything at all. And the number can be anything at all—so long as it changes when and only when the file’s content changes.
In other words, it’s a hash. How about an SHA1 checksum? It’s good enough for Joe Armstrong.18.
The actual hashing is done in the various locations where the files are created. See versioning files in the site.
2.4 versioning the requests
Now comes the hard part—actually using the hashes.
Again, the technique is to change the references to versioned files, so that the browser is always asking for the latest version. That’s going to mean very different things for each type of file, because they’re referenced in different ways.
2.4.1 scripts
First up are scripts, because the following types will build on this. Once more, the goal is to load scripts from
/static/script/v/some_version_number/module.js
instead of
/static/script/module.js
Now, the “normal” way that scripts are requested on the web is through the
<script>
tag:
<script src="some_imaginary_script.js"></script>
In that case, all of those <script>
tags would have to be rewritten to use the
correct version numbers.
This would be like rewriting the instructions, which means starting over again. So yes, you’d be guaranteed to get the latest references to everything, but it’s a bit of “robbing Peter to pay Paul,” since the instruction sheet (HTML) now has the same problem that you just “solved” for the scripts. In HTTP terms, invalidating a requesting resource on the basis of a change in an external resource seems to undermine the whole point of per-resource caching.
Luckily, willshake uses a script loader. So script requests are not directly baked into the HTML but put together on-demand.
Of course, the script loader—RequireJS—is itself a script. But before I get into bootstrapping vertigo, I’ll note that RequireJS is a special case, since it’s a third-party component (i.e. written by someone else), and I don’t plan on making changes to it. So really, it can be cached “forever”:
<LocationMatch "^/static/script/require">
<<cache forever>>
</LocationMatch>
To confirm this,
test "/static/script/require.js" "should be forever (special case for require)"
test "/static/script/require-2.2.0.js" "should be forever (special case for require)"
test "/static/script/require-99.js" "should be forever (special case for require)"
For everything else, require
will have to be told the “real” (i.e. fake)
location of all the scripts. Fortunately, require
lets you specify aliases for
modules by using the paths
configuration.19
This data needs to be used from the browser. For reasons that will be clear in a moment, I’m making it a plain old JavaScript variable assignment.
: $(PROGRAM)/write_file_versions | $(ROOT)/hashed/<all> \
|> ^o build hash index ^ %f %<all> > %o \
|> $(ROOT)/scripts/prologue.list/versioning_part_1.js
In the process, I’ll group the files by path, which makes it a lot smaller (although with gzip maybe it doesn’t matter).
BEGIN {
print "var ws_versions;"
# Don't bother with versioning unless you're actually running a server.
# Like in the "mobile app," for example.
print "if (window.location.protocol != 'file:')"
print " ws_versions = {"; _ = "\""
}
match($1, /^[./]*\/site\/((.*)\/)?(.*)$/, path) {
bag["/" path[1]][path[3]] = substr($2, 1, 8)
}
END {
started = 0
for (dir in bag) {
if (started) printf ","
started = 1
# Dummy first entry to avoid conditional comma in loop
printf _ dir _ ": {" _ _ ":" _ _
for (file in bag[dir])
print "," _ file _ ":" _ bag[dir][file] _
print "}"
}
print "};"
}
Note that this script can’t go into site/static/script
like the other scripts
because that would create a cycle in the build graph. Why? Because it contains
input from file_versions.js
, which in turn contains input from hashed/*
, which
in turn contains input from site/static/script
. The system doesn’t allow that.
Note also that as a result, the script cannot use ES2015, since it doesn’t go through the transpiler pipeline.
Scripts are the most straightforward because RequireJS lets you configure aliases up front.
This script follows the definition of ws_versions
.
This follows the RequireJS recommendation to use a global var
when defining
require
before loading require.js
.
This configures require
before it’s defined. It tells RequireJS where to
request each module.
var require = (function() {
// Fall back to given names if versions are not defined.
if (!ws_versions) return;
var scripts = ws_versions['/static/script/'],
versioned_scripts = {};
if (scripts)
Object.keys(scripts).forEach(function(file) {
var version = scripts[file],
name = (file + '').replace(/\.js$/, '');
if (name && version)
versioned_scripts[name] = 'v/' + version + '/' + name;
});
return { paths: versioned_scripts };
}());
As a result of all this, the prologue itself must be revalidated, since it holds the keys to all of the other dynamically-loaded resources.
<Location "/static/prologue.js">
<<revalidate>>
</Location>
To confirm this,
test "/static/prologue.js" "should be revalidate (special case for prologue)"
2.4.2 transforms and data
If the HTML files are the “instructions sheet” in this analogy, then the next group of files—transforms and data—may be called the “meta-instructions,” in that they’re used to make the instructions in the first place.
Admittedly, the analogy breaks down here, because there’s no furniture manufacturer that tells you how to make the instructions themselves. Unless you think of a DIY kit that assumes you have a 3D printer. Well, willshake is that kind of site. Its framework (getflow) is all about being able to rebuild the site “from scratch,” even in the browser, just as is done in the factory (the server).
These “master plans” are never referenced by the HTML itself; rather, the HTML is their output. So there’s no issue about having to deal with their part numbers on the instruction sheet.
But they are requested by scripts when the browser starts fabricating. For this you can intercept and rewrite the names as the requests occur.
This is not currently used for any “data” documents. Doing so would just be a matter of adding them to the list in compute versions. I don’t intend to do this for the plays, since out-of-date versions won’t break anything.
require(['getflow'], getflow => {
// This version dictionary should have been set by the earlier script.
const versions = ws_versions;
// Fall back to given names if versions are not defined.
if (!versions) return;
// Now wrap the GET function to use the version hashes.
const __GET = getflow.GET;
getflow.set_GET(path => {
try {
let type, hash;
const [__, dir, file] = /^(.*\/)(.*)$/.exec(path) || [];
if (versions
&& (type = versions[dir])
&& (hash = type[file]))
path = dir + 'v/' + hash + '/' + file;
} catch (error) {
console.error("Getting file version", path, error);
}
return __GET(path);
});
});
<require module="GET_wrapper" />
TODO use versioning for getflow.xml
The site “blueprints” (a.k.a. getflow.xml
) has to be current, for all the same
reasons as the rest of these files. It can be versioned, and it is included in
the hash list. However, right now, I don’t have a way to make getflow load it
from a different path, since, by the time I hook its GET
function, it’s already
loaded! That can be done with some further modification to the getflow client
script.
In the meantime, it must be revalidated to guarantee correctness.
<Location "^/getflow.xml">
<<revalidate>>
</Location>
To confirm this,
test "/getflow.xml" "should be revalidate (TEMP special case for getflow.xml)"
Actually, I think this is the default, by virtue of its not being included in any other rule. At any rate, there’s still pending action on this.
2.4.3 stylesheets
This is the trickiest one.
In fact, I’ve decided not to version stylesheets. Instead, they use revalidation.
<LocationMatch "^/static/style/[^/]*rules\.css$">
<<revalidate>>
</LocationMatch>
It’s not that it’s technically impossible to version the stylesheets. They’re much more difficult than other types, for a number of reasons. But rather than discussing those reasons, I’m going to attack a more fundamental premise—that the stylesheets should be scoped at all.
By breaking up the style rules into “regions,” you (theoretically) reduce the initial load time of pages that require only some of the rules. But at what cost? It means that traveling into new regions will mean fetching new stylesheets just at the moment when you need them. That’s bad for the flow.
I’d rather pay for the stylesheets up-front (as many sites now pay their javascript cost), and know that all transitions from that point are going to be smooth.
test "/static/style/rules.css" "should be revalidate (stylesheet)"
test "/static/style/main-rules.css" "should be revalidate (stylesheet)"
test "/static/style/v/version/other-rules.css" "should be forever (versioned)"
Of course, if the stylesheets gets too large, I’ll revisit this.
3 pre-rendering the site
While it’s useful to run application code on the server, it’s also a liability. It requires more memory and processing than serving static files, and of course, more opportunities for errors and security faults. And of course, it’s slower.
Meanwhile, the benefits of pre-rendering are many. It’s a great way to find errors and bad links, or indeed missing links. You learn just how big (or small) your site is, and how fast (or slow). It creates a portable, independent snapshot of the site, which is easily indexed. You can deploy the snapshot to take a load off of your server. Static file servers are easier to configure. Ye olde “cache invalidation” is also easier for static files, since you really have no idea whether—let alone when—a dynamic resource was “modified.” And (in a hybrid setup, anyway) you don’t need to pre-render the entire site—just as much of it as you want.
Currently, willshake.net
is small enough that it’s practical to pre-generate the
whole thing, and that’s what I will do here. Strictly speaking, this needn’t
require a web server. Getflow itself could be modified to function as a
standalone static site generator (i.e. on the command line). But right now it
doesn’t do that.
Instead, you can use wget
to download the site recursively. Usually this would
be done from a local instance.
host="${1:-http://localhost:8080}"
out_dir="${2:-../pre-rendered}"
mkdir -p "$out_dir" # or tee will fail
wget \
--recursive --level inf \
--directory-prefix "$out_dir" \
--force-directories \
--ignore-tags 'img,audio,link,script' \
--reject-regex '\?' \
--exclude-directories 'static' \
--no-verbose \
"$host" 2>&1 \
| tee "$out_dir/log"
Note that logging may not work without the quotes around the hostname.
Without --force-directories
, I found that some directories would end up lacking
an index.html
.
As of now, the server completely ignores any query parameters; they are only
used by scripts in the browser. At the same time, wget
doesn’t like them, for
some reason—I think because it tries to create a filename with a question
mark. So to keep everyone happy, any paths with a query are excluded using
--reject-regex
.
The files thus created aren’t terribly useful by themselves. But if they are mixed with the “actual” static content on the site, Apache will serve the pre-rendered pages instead of running the WSGI handler. This can be done on a per-folder basis with symlinks.
folders="${1:-plays poems about}"
port="${2:-8080}"
rendered="${3:-../pre-rendered}"
host="localhost:$port"
echo "Unlinking folders: $folders..."
for folder in $folders; do
if [ -L "site/$folder" ]; then
rm "site/$folder"
fi
done
echo "Starting an ad-hoc site on port $port..."
./start-site --port "$port" &
# Make sure that the site has had time to start up.
echo "Sleeping for 2 seconds..."
sleep 2s
echo "Removing old pre-rendered site at $rendered"
if [ -d "$rendered" ]; then
rm -r "$rendered"
fi
echo "Rendering site to $rendered"
program/render-site "http://$host" "$rendered" 2>&1 \
| awk '/URL:/ { print $3 }'
site=$(ps -aux | grep "[m]od_wsgi-express.*$port" | head -n1 | awk '{print $2}')
echo "Shutting down ad-hoc site, process $site..."
kill "$site"
echo "Symlinking rendered folders to site: $folders..."
for folder in $folders; do
ln --symbolic --relative "$rendered/$host/$folder" "site/$folder"
done
The deployment script will follow those links, treating the pre-rendered pages just like any other files.
There’s one little problem with this. The rendered files don’t have an .html
extension, so Apache doesn’t serve them with a Content-Type
header. This
doesn’t stop browsers from figuring out that they’re HTML, but it’s not good
form, and will confuse some clients. So if you’re serving extensionless HTML,
Apache should know about it.
<Location "/plays">
ForceType text/html
</Location>
<Location "/poems">
ForceType text/html
</Location>
<Location "/about">
ForceType text/html
</Location>
This would preferably done without reference to the specific directories being
pre-rendered. But putting that ForceType
directive at the root would be more
incorrect. As it is, this is only incorrect if any non-HTML files were in the
pre-generated directories.20
3.1 BUG links to localhost break the above
The above will follow links to e.g. localhost:8000
, which I’ve included in some
places. wget
will not follow other hosts by default, but that doesn’t stop it
from following the same host on another port number. The result is that it
tries to pre-generate both sites and havoc ensues. I’ve worked around this
temporarily by marking such URL’s as code
, so that they don’t get rendered as
links.
Footnotes:
“Persistence,” from “Hypertext Transfer Protocol (HTTP/1.1): Message Syntax and Routing,” RFC 7230 http://tools.ietf.org/html/rfc7230#section-6.3
“§ KeepAlive Directive” from “Apache Core Features”, Apache HTTP Server Documentation
§ 3 “XML Media Types”, RFC 3023, “XML Media Types”, 2001
While this saying has been widely attributed to Phil Karlton (even by Tim Bray (see “literate programming”)), people are still skeptical. If you can’t find a satisfactory source for this either, you’re in good company: neither could Martin Fowler.
Caches can require eviction for other reasons. There is usually a limitation on how much space a cache is allowed to use, and existing items are often evicted to make room for new ones. Cached items may also be transient for security reasons. But here, we’re only concerned with cache “freshness.”
R. Fielding et al., § 13 “Caching in HTTP”, RFC 2616, “Hypertext Transfer Protocol – HTTP/1.1.” June 1999
“HTTP Cache Control max-age, must-revalidate” http://stackoverflow.com/a/8729854 See also the RFC section referenced, “Cache Revalidation and Reload Controls” http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.9.4
“How the sections are merged”, Apache HTTP Server Documentation
R. Fielding et al, § 14.21 “Expires”, RFC 2616, “Hypertext Transfer Protocol – HTTP/1.1.” June 1999
See also Eric Law’s 2010 post, “Use Sensible Long-Lived Cache headers,” which
reports that some systems at the time erroneously treated max-age
as a 32-bit
integer, meaning that values in excess of 136 years (!) were being
misinterpreted.
“Filesystem, Webspace, and Boolean Expressions”, Apache HTTP Server Documentation.
“Make it Work Make It Right Make It Fast”, C2 wiki. Saying attributed to Kent Beck.
“cohere”,
from com- “together” + haerere “to stick” (see hesitation)
Online Etymology Dictionary.
An example of this is getflow’s use of special attributes to identify inserted markup. The values rendered by the server must match the ones calculated on the client in order for page transitions to work. Yeah, it’s fragile and stupid. But it does work as long as the client and server are looking at the same blueprints. Which means that keeping files up-to-date is not just about speed, but about correctness.
It’s basically like the what’s described in this article, except that this actually is the implementation, not just a cut-and-paste job. Also, this uses hashes instead of timestamps. Never use timestamps when you can use hashes. Files would be re-requested if and only if they have actually changed.
See “§ Variables” in the Apache HTTP Server Documentation.
See “The Mess We’re In” from Strange Loop, September 2014 (on YouTube).
“Configuration Options”, RequireJS API
“ForceType Directive”, Apache HTTP Server Documentation.