Preventing Server Corruption – Video - Stedman Solutions, LLC.

Anyone who’s experienced database corruption knows they do not want it to happen again. Corruption repair is stressful and could require money and downtime you can’t afford. In some cases, there is no “repair” just rebuild. Unfortunately, no database is immune from server corruption, but there are certain alerts and specific practices that when implemented will greatly lessen the impact of corruption. In this short 15-minute video, Steve Stedman and Derrick Bovenkamp will explain what to implement and to have in place incase corruption hits.

Transcription:

Derrick Bovenkamp 0:05
What can we do to prevent corruption?

Steve Stedman 0:07
Yeah, everyone that we’ve helped fix corruption for, I wish they had a time machine where he go back and cover this section two years or five years before they, they got hit with corruption. So really, there’s really no way to completely prevent corruption, there’s always some risk of it happening happening there. But the goals are really to reduce the likelihood of corruption being hit, and increase the chance of survival. So for instance, the example of the computer running under someone’s desk that was impacted by a lightning storm. Well, that could reduce the likelihood simply by having a good battery backup, or even a good off site, backup of the data file on a regular basis. But with everything, there’s the whole goal is how do we reduce the likelihood and increase the chance of survival? Because it, there’s always a chance it’s going to happen?

Derrick Bovenkamp 0:55
Yeah, so I mean, the first thing is, and it’s one of those increase the chance to survive all is regular check dB, you know, as often as possible, I would like to say, as long as your storage can do it, once a day, minimum once a day, sometimes more often than that, but at least once a day, with the exception of some huge databases.

Steve Stedman 1:19
Yep. And I, I’ve heard the example of people have tried to explain that their database is so big, and it is so important to the business to be online and not bogged down at all that they can’t afford to run check DB on it. Well, that may be a scenario that you want to make a backup and run check DB on another server occasionally, or that you but the fact that it’s so important, you can’t do without it. That’s all the reason to be running check dB more often.

Derrick Bovenkamp 1:46
Yeah, for sure. There’s a few clients that we work with that have, you know, databases in the like two terabyte range in their their storage is slow, it takes, you know, 14 hours to run a check dB. That’s, that’s an example. Unfortunately, the customer and they get out there, they’ve already signed to buy new storage, you know, that’s an example of you need faster hardware.

Steve Stedman 2:10
Yep. And I think with that, if you can’t possibly do it every day, well, if you could do half of them every day and half of them or half of every other day and alternate days. That’s one way to do it, too. It’s not as good as every day. But it’s maybe good enough based off of that environment. And there’s even some that I run check DB three or four times a day on some clients, because it’s important data. And they’re small databases. And I want to know about it as soon as it happens. So we have more options to recover and recover quickly. Now, the other key thing with check DB is to make sure that you’re doing alerting on the check DB failures. I don’t know how many times we’ve talked to people and they said, Oh, yeah, I’ve got my check DB job running. And we look and yep, it’s been running for months. And yep, it’s been corrupt for months. And nobody’s looked at the success or failure of the job. And nobody’s nobody’s has any notification. So it could sit there for years and be corrupted, and they wouldn’t know about it. So it’s really important to get email and or text notification on this. Because even if it’s on a weekend, and you can find out about it on a Saturday, and know about it and be able to get it fixed over the weekend. That’s a much better situation than having to deal deal with it Monday morning when everybody’s screaming about it.

Derrick Bovenkamp 3:18
Yeah. And I was just gonna say, you know, the next thing on there is his immediate response, you know, that we put those two things on there, we act as if your job depends on it. And it could be a resume generating event. And, you know, we don’t mean that is, oh, there was corruption, you’re gonna get fired because there was corruption. We mean, that is, there are certain other clients that we’ve repaired corruption for that have told us they would have gone out of business, and we not been able to repair their corruption. Yep. So it may be that everybody in the company is losing their jobs, if you don’t respond to it quick enough.

Steve Stedman 3:53
Absolutely. So some of the alerting that we want to do besides just the failed check DB is on other errors in SQL Server, and you can set up alerting the SQL Server Agent for errors, 823 824 and 825. And these are usually really serious errors. And a lot of the times when we’ve had customers call us with corruption, we’ve looked back and seen that these errors have shown up in their error logs a couple of days or even more, before they got hit with the real problems. And these are usually an indication that something is starting to go really bad with the the communication with desk.

Derrick Bovenkamp 4:31
Yeah, and then beyond those, there’s there’s some other severity 20 through through 25. Alerts, or if you go 19 to 25 is what we typically wanted or you know, there’s a couple false positives in there that happen and false positive meaning like yes, the the alert or the air happened, but it may not be anything to worry about but I’d much rather get notified on on this stuff that I need to know about right away and have, you know, one or two kind of false positives in there, then not know about it.

Steve Stedman 5:09
Yep. And the big key to success at this is to be able to respond immediately when you see these alerts. And it’s one of those things that literally, if you start seeing them on a Friday night, and you say, I’ll deal with them on Monday morning, when I get back in the office, well, that database may be gone by Monday morning. So and the options you have to recover, it may be lesser at that point, as well. So get on it as soon as you can, when these when you see these kinds of problems.

Derrick Bovenkamp 5:33
Alright, the next thing is update. And you know, the old system administrator thing is don’t fix what isn’t broken. But you know, the unfortunate thing of today is, especially with security holes, and those security holes, leading to getting your data, Ransom wired, you know, updates are important. And, you know, we shared with you two examples of two clients that probably could have avoided corruption completely had their firmware on their I scuzzy switches been up to date. Now, that doesn’t mean you need to be on the bleeding edge and updating things day comes out. But it does mean that you need to be you know, get on those notifications, a lot of vendors have notifications, where they’ll email you when updates come out. And some of the really good vendors that only email you, but you can read through it right on the email and decide, is this something I need to do now soon, or I can wait till the next one. And so you know, do those firmware upgrades. That’s, that’s it, you know, your storage, servers, networking, and all of the above Windows updates, the same keep windows up to date that especially helps with ransomware. And SQL server updates, too. You know, there is at least one specific SQL Server version that Steven I will almost refuse to touch unless it’s upgraded from

Steve Stedman 6:55
version. Oh, you mean 10.5? dot 20 510?

Derrick Bovenkamp 6:59
Five dot 2500? Yeah. Yeah, if we ever see that one, then we really get shivers up your spine. And that’s because that that, that that version, that version of 2008 is known to have stability issues. Yep.

Steve Stedman 7:13
So Derek, if the client or the management team doesn’t want to do updates, because they don’t want to have the hassle of downtime? And they say, well, we’ll do updates once a year? How would you respond? Or how would you deal with that?

Derrick Bovenkamp 7:26
So, you know, it’s one of two things, it’s actually a few different things. The first thing I would tell them is that, that not only ups your risk of, you know, security and not getting those updates, that also ups your risk of when you go do the updates, are you going to have a problem? Because you were so far back. And a lot of times, it’s not once a year, it’s many years, right? So far back, you’re jumping so many versions, you may have a problem. And then it’s like, well, what version caused the problem? Oh, yeah, that’s one thing. Another thing is then that makes that downtime window much longer. And the client already doesn’t like downtime, and you’re telling them you need a mini our downtime, instead of a short downtime, I really try to push back. You know, I like to do updates once a month, it seems to be a good cut between, you know, like, every two weeks is too much. It’s too airing on staff, you know, updates once a month seems seems to work out? Well, unless there’s a really urgent one. And, you know, sometimes slipping to, you know, once every other month or once a quarter, depending on the client’s needs.

Steve Stedman 8:38
And you know, with that one thing that actually scares me more to be doing than than fixing corruption is applying updates on a SQL Server that it hasn’t had any windows or SQL updates in four or five years. Because, or maybe has never been rebooted in that time. Because, you know, you don’t know that it’s necessarily going to come back when you when you apply those updates. And that always scares me. So yeah, you know,

Derrick Bovenkamp 9:00
it’s it’s funny, we always used to say like, oh, man, that systems been on for a year, like, look how stable it is. Yeah, and now that’s more like, is it gonna reboot when you reboot it? Exactly. And I’ve certainly seen that in the working world. There was one time we it’s not not for Steadman solutions. But we we took over IT services as medical clinic and they had servers that had been up for, you know, many years, especially with Windows, you’re at that point where will it boot up? Because everybody’s everybody’s worried.

Steve Stedman 9:38
Alright, other ways to reduce impact are backups, good backups, frequent backups, and off site backups with that it’s not just SQL backups, but also virtual machine and system backups. Not just backups, but backups that have been confirmed and tested and tested recently. There’s one One system I work with where we’ve got a, all the backups go off to one server and then that server runs and does a random restore of those backups. And then does a check DB after does the random restore of those and confirms that things are good. And if it ever runs into an issue, it alerts and it even applies transaction logs when it’s doing that. So that’s an example of knowing the backups work, because it’s going to alert us if the backups are broken somehow.

Derrick Bovenkamp 10:23
Yeah, and you know, that’s another good point. And there’s a lot of software that will do that for you. No Veeam will do that for you these days. The other thing I want to drive alone is you don’t have different types of backups. Just because you have Veeam, or Zerto, or some other backup product or Commvault. Backing up your virtual machine, don’t think that that’s good enough for your business case, you don’t really explore it. In most instances, you want those virtual machine backups, and you want SQL Server backups separate on top of that. And that gives you options. And options is what you need anytime you experience, you know any kind of system outage corruption or not. So an

Steve Stedman 11:07
example of that would be if you just had like a daily full virtual machine backup, and then you had SQL Server transaction log backups. If your full backups were happening at like 11pm, but you had corruption at three o’clock in the afternoon, if all you had was the full virtual machine backup that 11pm might be your only option to recover. But with those transaction log backups, you could do a recovery to that 11pm point, and then replay, do the SQL full backup and then replay the transaction logs up to moments before the corruption was introduced. And that may make the difference between losing half a day’s worth of business or not. Now, the importance of that may vary depending on how important your data is. But most people consider their data pretty important.

Derrick Bovenkamp 11:48
Yeah, and part of that testing goes along with I’ll mention it real quick is you know, make sure that you don’t your backup systems aren’t fighting each other. And one of those is if you’re doing separate SQL backups with log backups, you want to make sure that your VM level backups are set to do a copy only and not truncate the transaction logs. Or you may very well feel like you’re getting great backups. And you got to restore them and realize that your transaction log has been stomped on, as Steve calls it, and there is no restoring except to the full.

Steve Stedman 12:20
Yep, so virtualization and storage. I mean, a lot of the times as a DBA, you’re thinking just about the SQL Server, but a lot of other factors go into this. And I know Derek, this is kind of your specialty, not mine.

Derrick Bovenkamp 12:34
Oh, yeah. So you know, we already talked about having notifications set up in SQL Server, you know, get those notifications set up on all your systems, get them set up on your storage, get them set up on your virtualization level, get them set up inside of Windows, and most of the server vendors out there, they’re not only are they have notifications, if you set them up, right where they notify you, if you’re under warranty support, they notify the vendor and you know, there’s one vendor in particular that I’m thinking about I had a hard drive failure a few months ago and drive failed in the middle of the night while I was sleeping automatically re automatically took action use the hot spare rebuild it rebuilt the the raid so there was no data loss and kept me an email. So I woke up to the email and then I even better woke up to another email from the vendor saying hey, your array home phone bad harddrive? Where do you want us to send the replacement drive to you within four hours.

Steve Stedman 13:39
That’s, that’s so much nicer than waking up to just your drive being gone. And data missing? Yeah,

Derrick Bovenkamp 13:45
databasing. So get those and the other thing is, there’s, there’s a lot of, especially at larger companies, we’re you know, maybe you’re not you’re the systems administrator and the DBAs are separate. So you may not have insight into the system. So you know, go take that system administrator out to lunch, you know, become friends with them, make sure that they have insight into the systems. And especially most of the customers I work with, the data is so important. They can’t afford not to have hardware and software support warranties on their system. So you know, there’s there’s use cases that maybe you don’t need the warranty because you have enough redundancy built in and you have other options. But you know if it’s if you’re relying on that system to keep the doors open in your business, and you go out of business, if it goes down have have good warranties on that and have support so you can call them when you are having a problem with that storage array or that server.

Steve Stedman 14:52
And you know, if you’re working with your management team or your finance or your budgeting team about paying for that support, you can use this some of these examples from our presentation here as maybe case points to prove how important it is to have that