Software bugs cause unexpected problems at every company.
Some problems are small. A website goes down in the middle of the night, and the outage triggers a phone call to an engineer who has to wake up and fix the problem. Other problems can be significantly larger. When a major problem occurs, it can cause millions of dollars in losses and requires hours of work to fix.
When software unexpectedly breaks, it is called an incident. To triage these incidents, an engineer uses a combination of tools, including Slack, GitHub, cloud providers, and continuous deployment systems. These different tools emit updates that can be received by an incident response platform, which allow the on-call engineer to have the information they need centralized to more easily work through the incident.
On-call rotation means that different people will be responsible for dealing with different incidents that occur. When an incident happens, the current engineer who is on-call may not be aware that a similar incident happened last week. It might be easier for the new engineer to triage the issue if they have insights about how the incident was managed during the first time.
Chris Riley is a DevOps advocate with Splunk. He joins the show to discuss the application of machine learning to incident response. We discuss the different data points that are created during an incident, and how that data can be used to build models for different types of incidents, which can generate information to help the engineer respond appropriately to an incident. Full disclosure: Splunk is a sponsor of Software Engineering Daily.
Sponsorship inquiries: firstname.lastname@example.org
The post Incident Response Machine Learning with Chris Riley appeared first on Software Engineering Daily.