Maintenance refers to the ‘technical support’ of your systems. This includes both hardware and software related systems. The areas you should cover are:
Let’s look at these in detail..
Infrastructure refers to the platform on which your product and service is built on, and any additional equipment necessary. An infrastructure example for a B2B, mobile app could include:
- A code repository (ex: Github)
- An application platform (ex: Heroku)
- A database server (ex: Amazon Web Services)
- Equipment (ex: Smartphone)
Designing the application infrastructure is usually the responsibility of the Software Architect. Maintaining the infrastructure is usually the responsibility of a systems or ‘devops’ engineer.
Review your infrastructure periodically. Assess your customer’s usage and performance against the limitations of your current platform so that you can anticipate if a future upgrade is necessary and take action before the system crashes on your customer.
When it comes to performance, you want to assess the stability and efficiency of your product’s operations. Common metrics you should look at is your application up-time and run-time. Uptime refers to the % of time available for your customers. Run-time refers to the time required to perform an operation.
Uptime and run-time standards depend on the product and service you are delivering to your customers.
If you provide financial software to a bank, then your system uptime standards must be incredibly high during operating hours ~ 99.99%. You can’t have your system ‘temporarily down for maintenance’ in the middle while the stock market is open. Any unforeseen downtime would be a critical loss for customers.
Let’s say your customer’s desired outcome depends on timely access to information. Most likely, your product will be accessing and retrieving data in real-time from a data source. In this case, you should keep a close eye on the run-time of your products operations. Run-time delays in retrieving and/or processing data may cause a strain on your customer’s experience, rendering the product or service useless at the time when your customer needed it the most.
Performance measures are most effectively implemented using an automated monitoring tool that can log and visually display the performance of your system.
The goal of ‘monitoring’ is control. What you’re looking to ensure is that your system is performing against an expected behaviour.
Basically, you want to implement some sort of automated monitoring that keeps operations within an accepted realm of control. Useful monitoring tools include dashboards, control charts, and trend graphs.
The desired outcome of monitoring is to ‘keep an eye out’ that everything is functioning as planned. In any case, you should define levels of ‘tolerance’, a sort of accepted deviation from the norm, and proactively take corrective measures when necessary.
Alerts are the final resort of a system’s maintenance set. Assuming you’ve developed the your system’s architecture, defined the relevant performance metrics for your customer and implemented a monitoring tool; the last thing to do is to configure alerting protocols to inform you in the case of a systems failure.
Avoid alerts that are purely informational. These types of inputs should be kept for monitoring dashboards. Use alerts to trigger a predefined action. The worst thing you can do is create ‘noise’ in your alerting system, making your team ignore these alerts.
Alerts should follow the following progression:
IF (condition) → DO (Action) → SO THAT (Customer Outcome to be achieved)
An example of a useful alert:
- If: Server is down
- Do: Inform system engineer that server is down and must be restarted
- So that: He can restart the server and your customer can continue using your product.)
An example of a useless alert:
- If: System is up and running
- Do: Inform engineer that systems continue to function properly
- So that: _____
The goal of this post was to layout some of the building blocks of maintenance and reliability engineering for your software products. Designing a system architecture that supports the growth of your service, your user base, and your company is key unlocking scalability!
In the next post, I’ll finish off this segment of supporting your customers by introducing the important function of Account Management. Stay tuned!
Interested in more?
- For more on the topic of Alerting: Conflating the roles of Alerts and Dashboards
- An incredible resource on designing scalable and reliable systems for millions of users: Google’s Site Reliability Engineering Book