How do we know when one of our dear rotating-rust servants has given up the ghost?
It's much different from how it used to be now we're running what is known as Enterprise Storage. Keeping everything from the notes about how to best preserve pancake batter mix to stuff that we're not allowed to disclose - plans for world domination, formulae for the gravitational fields generated by the mass of visual data in a fly's eye, a map of the Queen's antimacassar, none of those things. This has changed from when we used to have data on disks directly attached to servers, now we have vast arrays of disks, presenting a huge mass in a small floor space in our data centres. Servers that used to fit in a room the size of a couple of basketball courts now fit in the space of a slightly cramped tennis court.
Now if we have a component failure the system alerts us and our maintenance people by mystic powers (email mainly), and assuming no further analysis and diagnostics are required a replacement part is shipped by them immediately. This is a mostly automatic process. Once the part has arrived we make our way to the super heavy device and swap the part, all completely silently and transparently, without the knowledge of any of the humans using or depending on the service, and with no loss of data.
How it used to be though was slightly different.
A disk would fail, as shown in the image below where there's a cross through the disk icon. We would hopefully get an email about the failure. There's an entire system configured to monitor all our kit, using Simple Network Management Protocol (SNMP) to probe servers and then to check the response. The response from the server would be checked against the Management Information Base (MIB) for that type of device. The MIB would have been configured in advance to determine what to do depending on the response from the device. If the response matched a particular set of criteria, such as a that triggered by a disk failure as below, then an email would be sent. The monitoring software would also have a visual representation of the device in a degraded state. With luck we would spot the problem and, during working hours only, manually raise a call with a maintenance company for a spare to be sent, we'd then swap the component without any impact to users or service, hopefully.
If a lot of things are being monitored, e.g. service types, service status, CPU utilisation, RAM consumption, disk space, network throughput, etc, then it is likely that a lot of things may enter a degraded state at some point even if only temporarily.
If this is scaled up to the point where thousands of servers are being monitored then it becomes unmanageable. There may be many dozens of temporary status changes per hour. Nobody responds to these type of alerts any more as there are many hundred a day, every day, and they are rarely due to a failure of a component. This is the position we are partially in now.
Eventually you will miss a component failure. Oops. So things gradually get monitored and managed in other more targeted ways. All of the really important kit is now monitored in a more modern manner although there are many things still monitored in the older way. And there still is a slight reliance on humans to recognise what is a larger problem from within all the alerts of printers being turned off, switches being restarted, ducks quacking, etc. Fortunately those items are not problems for us, and the owners of those services have their own, modern ways of monitoring.
An old server, the like of which we no longer use...
So why do we keep this old monitoring kit running? It provides some data for statistical analysis and benchmarking, and it has charts and graphs of data throughput that we don't yet have in other as easy to access ways. However there are new methods coming, and if thee new methods can be easily implemented then the days of the old monitoring could be drawing to an end.
For official/internal use only:
7777
7777
0-9
No comments:
Post a Comment