September, 2021 - Monitoring Stuff

SCVMM + S2D + SCOM, or: how to traumatize SQL Server…

Sometimes, even proper change hygiene can’t prevent issues. If that surprises you, you haven’t been in IT long enough… and recently, we again saw proof of this.

We use Hyper-V as our virtualization platform, managed by SCVMM and monitored by SCOM, using the SCVMM Management Pack. So far, nothing really weird (although I know several VMWare fanatics who would disagree, but let’s skip that part…)

As part of the usual lifecycle management principles, the team responsible for the virtualization and storage decided to replace our classic storage environment by a more software-defined approach involving the new Storage Spaces Direct (S2D) functionality in Windows Server 2016. And that is managed through SCVMM.

Now, SCOM is not a very fast application – and in part, that’s because the SQL side of it is not the most efficient application out there. Basically,if we do not see a deadlock a few times a day, we’re worrying that something is wrong… and if the slowness gets worse, we usually look at the SQL side first.

And that is also what we did when we started getting complaints that SCOM wasn’t just slow, it was completely freezing. Checking the SQL side we saw plenty of indications of resource hogs. Now, I don’t pretend to be a SQL specialist… I just know some very good resources, like Paul Randal’s fantastic blog on Wait Stats and of course, the massive amount of information from Brent Ozar.

We also filed a support case with Microsoft, and after quite some – very well executed – investigations we concluded that both the resources and the configuration of the SCOM database servers should be OK for the workloads. I did get some good suggestions for fine-tuning, but we were sure there were no dramatic underlying issues with AlwaysOn, disk I/O and the likes.

Looking at all the information I had gathered, I couldn’t stop thinking that it must be a previously undiscovered entity – and a big one. And by big, I mean that you’d have a diagram view with thousands of objects. At first, I thought it was our new VDI implementation – if you think of a cluster with several nodes and a few thousand VDI’s, that would classify as a big entity. Fortunately, there’s a good way to disable discovery of the resource groups – check out this blog post. We had that override in place for Server 2012R2, but not yet for Server 2016.

We applied the override, and thought we fixed the problem. But when we checked the next morning, we apparently hadn’t even come close – SCOM was still pretty much frozen. So our journey went on… Using sp_who2 I created a temp table, checking for the head blocker. I then set a SQL Profiler trace on that process, and there was a query that didn’t fire very often but when it did, it went to the absolute top in CPU usage. The query mentioned “DiscoverySourceID”, along with a GUID.

Ever heard of “letting sleeping dogs lie”? Well, there’s a dog in the SCVMM Management Pack. It’s a discovery, called Microsoft.SystemCenter.VirtualMachineManager.Storage.2016.Discovery.Discovery. (Yes, it actually has the word “Discovery” in it twice). And it’s not vicious, it’s downright rabid… think Cujo.

The GUID that i found in the query, led me to the discovery, and the trace showed me the Management Server that the discovery was running on. It fires a Powershell script, that logs event 108. In the OperationsManager eventlog on the Management Server, I then found this:

Whoa.

Hang on.

20.000 objects? 24000 relationships?? How many group memberships would that involve? With the recursive memberships??? Like I said, think Cujo… this was a downright horror movie for SQL Server. When that much info is dumped into the OperationsManager database, SQL Server probably simply got traumatized.

Looking through the script, it apparently discovers all storage objects in VMM. And all of their relationships. For more info, check the systemcenter.wiki page…

We disabled the discovery, and then ran Remove-SCOMDisabledClassInstance several times over the course of a few days. Pretty soon, SQL was getting happy again…

From the Microsoft Support case, this was identified as a bug and registered as such. I hope to be able to update this post soon with a fix…

Keep monitoring!

SCOM Resource Pools, part 1: how & why to create one

Creating a SCOM Management Pack is not the hardest thing in the world, as long as you want to monitor something on a single box. There’s an excellent course by Brian Wren on Channel9, and Kevin Holman has made life a whole lot easier by publishing his famous VSAE fragments.

But what if you want to monitor larger entities, like firewall clusters or storage arrays? Preferably by using not a single point of failure? Well, SCOM has a solution for this: Resource Pools. Kevin has written a great blog post on these, detailing how to populate them and what the requirements are, if you want high availability.

Some time ago, we started getting complaints from one of our systems management teams. We have quite a large number of Web Application Availability Monitoring configurations active, and these were using the agents on two dedicated servers. One of these was experiencing problems, causing false alerts. We ended up reinstalling it, but that left us with something I despise: reconfiguring all the web monitors, manually.

That started me thinking about the Resource Pools. What if we could move the web monitors to a Resource Pool? Better yet, what if we could set up our own Resource Pool – and use that to run workflows on as we saw fit? Would that be feasible?

It took some time, but yes – that will work. It turns out that creating your own Resource Pool isn’t even that hard, and using the Powershell scripts from Kevins blog you can populate them with any Management Servers you want to use for this purpose. It’s even possible to include agents – but please keep in mind, Microsoft does not support using agents in a Resource Pool. That being said, we have this exact setup running – without any problems so far.

The XML is quite simple, if you know how to read SCOM MP’s. For obvious reasons, it contains a class for the Resource Pool:

<ClassType ID="MoSt.Custom.ResourcePool.Class" Base="SC!Microsoft.SystemCenter.ManagementServicePool" Accessibility="Public" Abstract="false" Hosted="false" Singleton="true"></ClassType>

There is, however, no way to discover a Resource Pool. The discovery is done by discovering the Resource Pool Watcher:

<ConditionDetection ID="Mapper" TypeID="System!System.Discovery.ClassSnapshotDataMapper">
  <ClassId>$MPElement[Name="SC!Microsoft.SystemCenter.ManagementServicePoolWatcher"]$</ClassId>
    <InstanceSettings>
      <Settings>
        <Setting>
          <Name>$MPElement[Name="SC!Microsoft.SystemCenter.ManagementServicePoolWatcher"]/PoolId$</Name>
          <Value>$Target/Id$</Value>
        </Setting>
        <Setting>
          <Name>$MPElement[Name="SC!Microsoft.SystemCenter.ManagementServiceRuntimePool"]/Name$</Name>
          <Value>Custom Resource Pool by MonitoringStuff</Value>
        </Setting>
        <Setting>
          <Name>$MPElement[Name="SC!Microsoft.SystemCenter.ManagementServicePoolWatcher"]/PoolName$</Name>
          <Value>$Target/Property[Type="System!System.Entity"]/DisplayName$</Value>
        </Setting>
        <Setting>
          <Name>$MPElement[Name="System!System.Entity"]/DisplayName$</Name>
          <Value>$Target/Property[Type="System!System.Entity"]/DisplayName$ Watcher</Value>
        </Setting>
      </Settings>
    </InstanceSettings>
</ConditionDetection>

With this XML, you should be able to create a SCOM Resource Pool to provide highly available monitoring ; the complete file is available on Github. However, you might want to wait – I’m working on a complete, sealed MP with some classes that are targeted against this Resource Pool, ready to use.

Keep monitoring!

Kicking things off…

Well, here we are. I’ve been working on IT monitoring for about 20 years now, and I’ve been thinking about starting a blog for quite a few of those. Over the years I built a decent amount of experience – and in my opinion, Oscar Wilde was wrong. Experience is much more than just the name we give to our mistakes. At least for this blog, I figured I’d include not just the challenges, problems and mistakes – you’ll probably be more interested in what I did to get out of them. Hopefully it will help you get things done as well.

I’m not sure on how regularly I’ll be posting; you’ll just have to monitor this blog for updates… (yes, pun fully intended).

And in case you’re wondering: yes, it’s still work in progress. As you may have noticed, I went for a pretty old theme – one that has a sidebar, but does not have a zillion options to configure. The choice was basically made for lack of time to find a better option. With the amount of themes that’s available, it has become impossible to see the forest for the trees (and by the way, that English saying actually translates to Dutch word for word with the same meaning: “door de bomen het bos niet meer zien”.

So if you have any suggestions, please let me know! And that goes for the layout just as well as for the content. The purpose of this blog is to share experiences – and that goes to pieces if my message doesn’t come across.

And one final note: I try my best to provide accurate and useful information, but I’m by no means perfect, nor do I have the resourcs to thoroughly test and research every eventuality. Therefore, feel free to use any information – but keep in mind that you do so at your own risk. I cannot be held responsible or liable for any issues that may arise out of your use of anything you find in this blog.

Enjoy!