Reducing 404 error pages on your website
Do you know how many resources are consumed by your origin web servers to determine that a requested URL is non-existent and render the Not Found response page?
Many site owners cannot answer this question, and most of the time this is fine because a well-managed site will rarely need to serve a 404 under normal conditions.
With the effort that goes into marketing and SEO to bring traffic to your site, good site managers will ensure that old pages that are no longer appropriate will not simply be deleted but will instead redirect to a new URL with relevant content. Serving a 404 would be a lost opportunity.
Unfortunately, pages that once existed and no longer do are not the only source of Not Found responses. Every now and then a site will get scanned. That is, it will receive a barrage of requests for many different URLs to see which serve an interesting response. This is much like a search engine crawler but potentially malicious or, at the very least, suspicious.
Sometimes the URLs requested will take the form of possible administrator login URLs for different combination of web platforms and configurations. Other times there will be requests will be for URLs that look like common names of data export files or website backups.
Often the number of requests will easily be in the thousands.
That’s a lot of HTTP 404s
From our experience improving the websites of our customers, we see many 404 pages that have a significant impact on origin web server resources.
Sites today are now typically more dynamic than they are static. Determining that a URL exists can no longer just test for the presence of a file on disk but must instead execute web application code.
This is somewhat worsened by sites preferring “clean” URLs, ie omitting the traditional
.aspx suffixes. I’m not suggesting that friendly URLs should be avoided, merely highlighting that has become a contributing factor to 404 processing costs.
Once the request is being handled by the application code we then see additional costs. The most common is the establishing of server-side session state. This server state is often never used again as the scanning clients rarely honour session cookies. Eventually this can lead to all the available session state storage being consumed and along the way potentially slowing access to the state for legitimate users.
Another pattern, that regularly appears, is a Not Found response that queries the database to find a list of categories or products that may be relevant to then provide helpful links on the resulting 404 page. These database queries can be expensive and compete for database resources needed to serve requests for other users.
For occasional requests that result in a 404, performing this work is a minor concern. But when many of these requests happen quickly in a short time frame, this workload can quickly overwhelm the web servers and leave them unable to efficiently handle legitimate requests from users trying to purchase your goods or services.
Sadly, the solution is not trivial.
Caching at the CDN is often ineffective in this scenario for two reasons:
- Every requested URL is different, the scanner does not try the same URL twice, so every request is a cache miss, and is proxied to the origin.
- Understandably people are often reluctant to cache 404 responses in general in case it impacts the deployment of a new page, or the cache space consumed by 404 responses competes with other resources needed for real responses.
It is possible to get protection from a Web Application Firewall (WAF), depending on the nature of the scan. If there are reliable patterns in the requests a WAF can be configured to reject them before reaching the origin.
Such patterns might be:
- A common
User-Agentheader present in all requests that is different from any browsers your real users use.
- A common client IP address or range (if the scan is not distributed).
- Request headers that are sent by modern browsers but are absent in requests from the scanner.
There are two caveats though:
- You need to be wary of false positives, ie using a pattern that blocks requests from your real users too.
- Access to your website can be disrupted before you are able to identify the patterns and configure the WAF to block them.
Often the best solution is to implement 404 handling in your web application efficiently.
If possible, serve static content for 404 responses. If that’s not feasible, consider caching the results of the database queries in the application to avoid the cost of querying the category list every time, for example.
Also, consider changing the server session state logic so that it doesn’t create or update a session in response to a request for a resource that is not found.
If you’re really feeling adventurous you could try pre-generating a white list of all possible URLs that are valid on your site. This list would be similar in concept to a
sitemap.xml which you may already have. This list could then be used to create some Varnish Cache VCL that would reject anything not on that list before it even reaches your origin web servers.
Are you interested in trying out section.io’s CDN and easily implementing Varnish Cache? Click below to get started with our 14 day free trial, no credit card required, or read our documentation to learn more.