Monitoring your servers is a natural part of a DBA's responsibilities and essential for delivering reliable and stable services. But are all SQL monitoring tools created equal?
This article is about the why, what and how to monitor SQL server. With this knowledge in hand you are better equipped to decide what you need from a monitoring tool and how to benefit from it.
Just about the first thing you want to do as a DBA is to see the status of the database instances you manage – and that’s where we start with SQL monitoring. With only a handful of instances you may be able to do it by hand, but once the numbers start increasing you will need some kind of tool or solution to help you. And that’s where you face the first question – what kind of monitoring tool should I go for?
There are many SQL monitoring solutions out there to choose from, or you can roll your own – so let´s talk about what is important and why when it comes to monitoring SQL servers.
Monitoring in 5 steps
First of all – why do you want to monitor all those SQL instances? It may sound like a simple question, but lets look closer at some reasons for monitoring instances:
- Status. The first thing you can think of is to see if the instance is up and running. Very basic and fundamental. You want an easy way to see if all is running.
- Alerts. There is a large range of things that can go wrong, from server down, out of disk space/memory/cpu to lock and wait situations, jobs not running – the list is pretty much endless. You want to be alerted when any of these events occur. But still, this is reactive - you get alerted after the event has occured.
- Preventive action. You want to be notified before something goes wrong, so you can take preventive action. You will need to set thresholds for resources and performance and generate a warning when one of the critical parameters are moving towards or beyond the threshold. This will give you a chance to take corrective action before the system degrades or grinds to a halt.
- Actionable in-depth information. You want to have detailed, useful information on each instance so you can understand the status and health and take corrective action. This is where we move from data: “Something is wrong” to information: “This is why there is a problem and what you will need to fix”.
- Helicopter overview and workflow integration. You want to see the overall status and health of your entire estate, so you can understand the overall situation and prioritize your work. On larger sites in particular it can be hard to see where your time and attention is most needed. You will need a lot of information on all instances and ability to compare them to find where your time is best spent.
What to monitor?
It may be a simple question, but there is a lot you can monitor:
- Scan networks to discover new database servers
- Basic parameters to monitor such as availability, cpu, memory, log files and disk usage
- More detailed parameters such as active sessions and transactions, queries, cache hit ratio, logical reads/sec, waits, locks, page reads/sec
- Dynamic views – active jobs, longest-running queries, services, processes.
- Your own custom site or application specific parameters. You may have certain parameters you want to watch related to a specific application or local situation.
- Global views of entire estate
Monitoring collects data. This needs to be stored if it is to be used later for analyzing trends or reporting.
The more data you collect, and the more frequent you collect it, the more space you will need to store the data. So you should consider:
- How detailed information do you need?
- How frequent do you need to collect it?
- What performance impact can you permit on your server?
- Will the data collection affect your network traffic?
- Where will the data be stored? Locally on each instance or in a central repository on a large SQL server?
- How much server capacity do you need to store all the data?
When or how frequent to Monitor
Monitoring should be done continuously as long as the server is in operation.
Some parameters like cpu load should be monitored very frequently, like every second, in order to provide a good and detailed profile of load on the server.
Other parameters like disk space can be monitored much less frequently.
For each parameter we need to find the optimum frequency. Too frequently will produce a load on the system being monitored and generate too much data to be stored. If not often enough, data will not be detailed enough to let you analyze and find the cause of a given problem.
It is a question of balancing resource usage vs usefulness.
Monitoring solutions can collect a lot of data. Too much data leaves you overwhelmed and looking for that needle of useful information in the haystack of raw data can be frustrating Too little information can lead to inability to locate issues or reach incorrect conclusions, which can be very costly.
If you collect too much data, you will run into two other problems as well:
- You need to store the data. Too much data means you will build up a huge repository which may be an administrative and cost challenge in the long run
- All that data will load up your network, and will slow down your analytical tools when you try to use the data to extract information.
Automation can scan through all that data and notify you only when something is out of the normal and needs your attention.
You will then need to turn the raw data into useful information that gives you insight into what’s going on and what you can do about it.
What´s important is how you present the data and how you can turn the data into useful, actionable information.
- Instance analysis. Produce relevant information and insight on a single instance to understand what’s going on and what needs to be done
- Group analysis. If you run multiple instances on a virtual cluster (who doesn’t?) you sometimes need to see all instances on the cluster under one to understand what´s affecting the performance of the cluster as a whole.
- Estate analysis. From time to time you should take a bird’s eye look at your entire estate and see if your resources are optimally allocated. Maybe there are servers and instances not being used or underutilized, that can be freed up and reallocated to servers and instances that can benefit more from those same cpu/memory/storage resources. '
- You should read Optimizing SQL Server operations and Optimizing SQL server licenses for more insight into optimizing the server farm.
- The data also forms the basis for performance tuning your servers. If you run in the Azure cloud, Azure database performance measurements looks into topics unique to cloud databases.
Good monitoring tools should be able to automate most, if not all of this for you.
This may not be obvious but monitoring itself can be a rather time- and resource consuming exercise. You should consider:
- What is required to set up the initial monitoring server? Hardware, underlying database, time to configure?
- What is required for installing monitoring on a new server? Agents? Scripts?
- If you at some later time want to apply a change or update to the monitoring on a server (add new parameter, update software, change trigger value) – what is required? For one instance? For 50 instances?
- Can you customize monitoring parameters, tasks, reports, frequency, threshold values to fit your requirements?
- How do you handle DBA access? Is it role-based? Can you log and audit what the users are doing?
- Monitoring is only half the equation. How easy is it to go from monitoring/viewing an instance to manage/fix that instance? Can you switch seamlessly from monitoring to managing without switching console or logging in/out? Monitoring will be an essential part of your workflow, so it should be integrating in your workflow tools.
- How easy/flexible is it to configure distribution of alerts and reports? What about distribute to different persons/locations depending on time of day/week?
For large estates you have a few more things to consider:
- How will the monitoring solution perform when you get into hundreds or thousands of instances? This topic is explored in database monitoring in complex networks and performance issues to watch out for when monitoring in complex networks
- With hundreds or thousands of instances – how does the dashboards and views look? Is it still meaningful? Can you sort them or prioritize them in meaningful ways, so they still give you useful information?
- Large estates are often split on multiple sites or networks behind various firewalls. How do you monitor across firewalls? Will you need to open a lot of ports, or can you place a (sub) server behind the firewall so you can do the monitoring locally for each site or network, and then consolidate up on a global level?
- With a large estate you will usually need more secure means of access and privileges. Can you configure this easily?
- With a large number of instances, you will need a way to name, group and tag instances in a meaningful manner, so you can search/view on function, location, department, version, platform or any other user-defined property.
Take monitoring to a new level
You can perform monitoring on many levels, from the basic manual methods to fully automated enterprise-class monitoring and integrated management solutions.
Level 1: Basic monitoring
For this level almost anything will work. Simple scripts, any infrastructure monitoring platform, even manual monitoring can work.
Level 2: Scripted monitoring with basic alerts
For this level simple monitoring solutions with check and alert can function. We also see a lot of roll-your-own scripts being used here.
Level 3 and up: Advanced monitoring schemes
Here things get more complicated. A lot more parameters are checked, and checked more frequently. Values are compared to thresholds and alerts or warnings generated when required. This can quickly become too complicated for roll-your-own solutions and manual monitoring is no longer realistic.
High-Maintenance DIY Solutions
No article on monitoring can be complete without mentioning DIY monitoring scripts and tasks. A lot of DBAs enjoy writing their own utilities, and some are quite good at it. But it comes with a price.
Monitoring can be a simple task, or a complex and exhaustive process for full control over larger complex estates depending on your ambition and needs.
What you decide to implement will depend on the organization’s requirements, your budget and your skills.
If you do decide to invest in a 3rd party solution there are many things to consider that we have highlighted here. Since you will usually live for a long time with an investment in a monitoring solution you need to spend some time to decide what you need now ( and in the future as things grow and expand) before you jump in.
The cost of a monitoring solution can be considerable, but so can the benefits also.
Look for capabilities for preventive operations, performance management, resource utilization and most important of all – much improved productivity and DBA satisfaction.