I wonder - the big surge happened over a bank holiday weekend, and there's another coming up soon. The thought occurred to me after hearing a radio programme about the surge in the number of companies being registered at Companies House, using random addresses of unsuspecting British folk. The chap investigating this phenomenon found it was likely because of a recent ban, by their own government, on Chinese companies dealing in crypto currency. The traffic suddenly dropped for a few days, which he found were a public holiday in China. Could it be that the opposite works, and attempts at getting into networks rise when IT support staff are likelier to be fewer in number? Except at the Coffee Shop, of course.
There used to be a pattern hour by hour with the server consistently three times as busy during the day than at night, and with weekend peaks about a half of those during the week. That was accounted for by it being primarily a server of specialist information for IT professionals. It's a different server these days, of course - we're in the cloud and with a lot more compute power to handle requests, but the nature of those requests, and indeed what we serve, has changed.
If you see an address ending in ".html" it is highly unlikely it will actually be a file on a disc or in a memory somewhere that contains hypetext markup language - rather it will be a program that fills in a template with data from a database, sometimes with significant compute involved and with those databases having a long archive in them which influences the output. The same applies to most of the ".jpg" images we serve.
Crawlers, bots and spiders - automata which for the most part are the same thing as each other - typically grab a ".html" page, look through it for hypertext links to other .html pages, go off and grab those, and so on until they have grabbed the whole public site. We want them to do so, for the most part - at least the more useful content. Yes, I want to people to be offered Coffee Shop content when they search on Google, or ask Alexa ... I probably want plagiarism searches that the Universitys use to reveal that a student has copied from what a member wrote in 2016. And I would like chatbots to be informed by our content too.
But
* Compute power and storage is much cheaper these days, and so there are lots of spiders crawling around - almost an infestation
* The number of different pages (URLs) that we have ever grows with our long historic record
* The cost of servicing each individual request grows as each individual request needs to consider that individual record in the database.
* Extra facilities added
Combined, those
three four are like
cube quadratic rule, and the increases in data volume are phenomenal. The image database which was set up because it was getting a bit big as folder with perhaps 500 pictures now serves some 20,000 using several gigabytes of storage. Our little forum has over a quarter of a million messages on it to be searched through. The new passenger flow system that I put in the other week has some 4 million records to analyse and sort and there are some 10,000 different pages for the spiders to find, each of which "has to" to go through those records and sort the results.
Server load (just one of a number of monitors) from last night - ideally it should be at or below 1 job queuing at any time
The hard black line is yesterday ... the coloured lines are previous days
How do we control the load? Methods include:
1. We tell benign crawlers to avoid some areas on the server
2. We identify some crawlers and send them cached (slightly old) data rather than regenerating every time
3. We identify some crawlers and return a "go away"
4. Extra code added to quickly eliminate data in searches efficiently