- Host: Steve Stedman
- Guest: Derrick Bovenkamp
- Recording Date: November 27, 2024
- Topic: Database Corruption
Stedman SQL Podcast Episode 9 – Corruption
Steve Stedman and Derek Bovenkamp discuss SQL Server database corruption, its causes, symptoms, and prevention strategies. Corruption often results from IO issues, such as firmware problems with iSCSI or antivirus software misconfigurations. Symptoms include failed queries and check DB errors. They emphasize the importance of regular check DBs, ideally daily, and testing backups. Effective solutions include restoring from backups or engaging experts like Stedman Solutions. They highlight the risks of ignoring corruption, including potential data loss and compliance issues. Best practices include using OLA Halogen scripts, maintaining up-to-date hardware, and staying current with SQL Server updates.
Podcast Transcript
Steve Stedman 0:15 Hey everyone, welcome to the Stedman sequel podcast. This is episode number nine, and I’m your host, Steve Stedman, want to point out some places where the podcast is available, available on YouTube, and we’ve got a short link there. Stedman.us/podcast YouTube. We’re also available on Spotify. Stedman.us/podcast Spotify. Or you can visit our main podcast page at stedman.us/podcast now, the main podcast page now has a new section for guests. If you want to be a guest on the show, you can visit that page and fill out a form and pitch what you want to be a guest to talk about. We’ll consider it and consider having you on the podcast. One other thing we want to point out this is our last week that we’re actually doing the live stream while we while we do the recording going forward, we’re going to record ahead of time, edit it a little bit, and then publish it. We’ll still be publishing on about a weekly basis, but what this will do is will allow us to maybe edit out some of those goofs and errors that we make along the way, and they can sound a little bit better overall. Some news from Stedman solutions. There’s a new database health monitor release coming soon, in the next week or so, it has several new exciting features, including a really cool virtual log file visualization report. We also have a SQL schema search where you can search the contents of stored procedures and functions and tables and things like that to look for specific data and several other new reports. All right, this week’s podcast is SQL Server database corruption, and I’d like to welcome Derek Bovenkamp as our guest this week. Derek is someone who’s helped me with almost all the corruption repairs that we’ve done over the last eight years or so with Stedman solutions. Derek is the most knowledgeable person besides me on database corruption at Stedman solutions, and he’s always looking forward to that next repair of a corrupt database. Hey, Derek, welcome.
Derrick Bovenkamp 2:22 Hey, thanks for having me on Steve. I look forward to it because it’s always a good topic when we can teach people about corruption repair and how to prevent corruption in the first place.
Steve Stedman 2:34 Yep, Yep, absolutely. And I think that I would rather do what we can to help people not end up with a corrupt database, or have options to quickly fix it, versus having to do a full corruption repair. So all right, well, let’s jump into it. Then the first thing we wanted to cover was, what exactly is database corruption? What is it? What causes it?
Derrick Bovenkamp 2:59 Well, yeah, that’s a great question. I mean, I would say it’s when the pages that are written to disk either aren’t written or aren’t read back or stored on disk not in the way that they should be, and that can cause all kinds of issues from a database that doesn’t work at all, or more frequently, what you and I see is certain tables are information in certain tables that you can’t pull, or it gets pulled in a funny way because of the corruption.
Steve Stedman 3:31 Yep. And as you mentioned, pages there, your entire SQL Server database is made up of a bunch of 8k chunks, and these chunks are called pages, and everything that’s stored in every table is stored in 8k pages on disk. And when something goes bad on writing that page or reading that page, you end up with something that appears to be corrupt. So common causes of this, I think the biggest thing we see is that it’s related to IO, where something happens that impacts the IO. So what kind of things have you seen Derek, that have caused IO to go bad in some way?
Derrick Bovenkamp 4:11 Oh, you know, we’ve seen lots of things, and, you know, I think we’ll dig into it later in this discussion. But you and I were talking about this last night, and pretty much every time we’ve seen corruption, if not every time you and I have dealt with corruption, it’s always been something related to IO. And the one that comes to mind is one that we’ve actually seen three or four times together. Is something has gone bad with the i scuzzy config between the host i scuzzy in the sand that it’s talking to and has caused something to write bad to the sand. And that’s where we’ve seen it the first time. And I think in two of those instances, it was a firmware issue with the i scuzzy switches that we were able. Repair the corruption and then have them upgrade the firmware on their switch and not not have the issue again.
Steve Stedman 5:06 what’s what’s interesting on that is, if you have local storage, your database talks to your local storage controller, and then that talks directly to your discs and saves things locally. But when you’re using something like I scuzzy, that’s talking to a net drive interface that’s then mapped across the network, traveling through your local network card, your network wiring to a switch, and then from that switch off to a storage array, and then from that storage array to the specific disks in the race there’s a lot more steps in that process for something that could go wrong than if you have direct attached storage.
Derrick Bovenkamp 5:43 Yeah, and I’m not sure to share a quick story. I’m not, not sure if you remember, but we’ve actually seen it at two different customers, both i scuzzy, and it was, it was not actually corruption. The database would read back as corrupt. But every time we would check it in check dB, which we’ll talk about later, a different part of the database would be corrupt than it was five minutes ago. And if you remember that one, it wasn’t specifically corruption, but we were able to fix it. It was also a nice, busy issue. We’re able to fix it by moving the virtual machine to another host, and then all of a sudden, the database was fine,
Steve Stedman 6:22 yeah. And with that one, the disk itself was never corrupt, but it appeared corrupt because of the read errors we were getting at that point. So we were lucky. It was writing correctly.
Derrick Bovenkamp 6:34 it was writing correctly. And I think the key here that you’re going to hear is Steve and I hammer multiple times during this discussion, is IO? It all comes down to IO. And is your IO? Are your storage devices? Well, in hand is firmer up today, is it? Is it configured the way it’s supposed to be? That’s your most common cause of corruption. So Steve, as we move on here. What are the symptoms? What symptoms would people see when they have corruption?
Steve Stedman 7:09 gosh, that’s there’s a lot of different symptoms that people would see. I mean, sometimes people see it where they just rebooted their SQL server and the database doesn’t come back online. Other times we’ve seen it where they run a query and halfway through the result set, it hits a bad page and the query fails with some type of a message, usually about something related to IO or consistency. Sometimes, sometimes you don’t even notice it until you do a check dB and it shows that there’s something corrupt somewhere in that data file might be corrupt in some part of it, or some table that you don’t use very often, that you might not even notice for a few months.
Derrick Bovenkamp 7:49 Yeah, and I think you and I have also seen where the customer doesn’t even notice the corruption, because they would only notice it if they hit a certain record. So maybe it’s a payroll software, and it’s only if you pulled that very specific payroll from three years ago you notice the corruption, or you and I have dealt with multiple times electronic medical record software, and it’s only if you access that specific document for that specific patient you see the corruption, but then the rest of the time, everything looks like it’s fine when it really isn’t. And I think the big, big topic that that leads us into is is check dB, and is check DB being run differently, or, sorry, is check DB being run often?
Steve Stedman 8:44 Yeah, you know, that’s one of those, I don’t know, challenging conversations we sometimes have with clients where some some get it and they want to run check TV as frequently as we recommend. Others, they’re like, well, we we don’t have enough IO or we don’t have enough processor power to run check DB very often, and they want to run it like once a month, and it’s just, it’s just too risky to do it that way. I mean, we recommend when you can run check DB every single day. We even have a couple scenarios where we run it twice a day, but most of the time once a day is reasonable. Where we run check DB on all the databases on your SQL server once a day. But not only do we run check dB, but we make sure that notification is set up. If that job fails, that we get notified that there’s something wrong, and we have additional notifications set up so that if the database does not get checked, meaning it failed or crashed somewhere along the way and worded out without an error that we know about it.
Derrick Bovenkamp 9:38 and I’m not sure what’s worse, I know you and I have both had that sinking feeling where you either see that check DB has been running and failing and they didn’t notice it, or we also get the customers that you sign on and you see check DB has never been run, or it’s been run five years ago. So the real key there is to run check. DB often, ideally daily. There is some instances where the backup solution that’s in place, or the database is so big, you know, we talk about weekly, but I would say daily is certainly the gold standard that I like to do. I like to apply to customer servers and my own SQL servers,
Steve Stedman 10:20 and I know whether we’re starting out with a performance assessment or a SQL Server health assessment, or we’re doing managed service customer I know one of the first things you always do when we jump on a server before doing anything else, is we run check db to make sure that all those databases are clean before we ever start on it. And that way we know that the database is in good shape before we start trying to do anything to make it better.
Derrick Bovenkamp 10:45 exactly. Why don’t we dig into a little bit, you know, Steve, how do we like to run check DB on customer systems?
Steve Stedman 10:56 Well, we use the OLA halogen scripts to help with this. And by default, that’s set up with two different jobs, primarily, one is the user databases, and one is the system databases. And the system databases, one almost always runs quickly, and then we get in the user database one, and we want to make sure that we run that after hours, during a low load time. We try and avoid running that when things like index rebuilding is going on, or other things that would cause a lot of IO we try and do it in the slow time, and we try and check, I mean, we get every database checked every single day. Now, there’s some times that we’ll do that and find out that you might have 15 databases, and one of those databases is gigantic and the rest of them are small. Sometimes what we’ll do with that is break that one big database out into a check DB job of its own, and then exclude that one on the other check so that we can do a little bit more parallelism there.
Derrick Bovenkamp 11:51 but I think, I think the big key there is we like that user databases, or the user database is minus if you’re breaking it off into your own and that’s because when the customer adds another database two years down the road, it’s automatically going to get checked every night, versus some of the other ways to set you know, if you set it up manually, you built your own job. Maybe you’re specifying the database names, and you add another database, and you think you’re running check DBM on your databases, but you’re really not. You’re missing, missing the new one.
Steve Stedman 12:25 that’s one of those things that if you’re using the default, what’s the default maintenance tasks in SQL Server? It’s pretty common that people will set that up. You just pick, let’s check all these databases, then you add a database later, and it gets missed. So we really like to avoid the default maintenance tasks and either use the OLA halogen solution, which runs as a job, or if, in some scenarios where it’s too big to actually check every single day, we do a custom job that’ll just go and look at the ones that haven’t been checked recently and run those and try and get through about half of them every single day. We have several scenarios. We’ve done that, but to do it as a SQL Server job, as a SQL Server Agent job, rather than as a maintenance SQL Server schedule, maintenance task.
Derrick Bovenkamp 13:16 great.
Steve Stedman 13:20 So then, what do you think about backups as a protection for corruption?
Derrick Bovenkamp Oh, I mean, you know, back backups are the world, you know, in every instance you know, where we’ve ran into corruption, and we’ve had a lot of times we have good Well, I think you and I have had any database corruption job that we’ve actually taken on. We’ve had good outcomes, and every single one of them, but the best outcome has always come from being able to just go to your backups, and you avoid a very expensive, costly corruption repair by doing a restore your backups. And so I, I really, really believe that backups are important and making sure that your recovery point objectives and your recovery time objectives are met with your backup strategy.
Steve Stedman 14:15 Yep, and shameless plug here for a moment. Derrick just mentioned recovery point and recovery time objectives. Next week’s podcast is actually talking about recovery, point recovery time. So if you want more information on those, come back. That’s the end of my shameless plug. One of the things I want to point out here is that those backups, I mean, if you have corruption, oftentimes a full backup will back up that corruption. But typically, the log backups will not back up the corrupt part because it’s backing up the actual transaction that occurred. And we’ve been able to see that you I mean, some people just say, well, corruption happened. Let’s just go restore the last full backup and lose a day or two’s worth of data. It, and you’re back up and running. But off what we’ve done in a few scenarios is we’ve restored the last full backup. We’ve restored a differential backup on top of that before the corruption hit, and then replayed the transaction logs through to a point in time to go past where the corruption occurred, and we’ve been able to get a complete recovery done that way. That’s we’ve done that with some pretty big databases, greater than a terabyte in size.
Derrick Bovenkamp 15:23 yeah, I think you remember the same story that I I’m remembering, and think that database was pushing two terabytes, and we were, we were able to get the customer up within the matter of a couple of hours by taking determining there’s corruption, and then being able to do a restore, taking the full backup that occurred before the corruption, and then play those log backups all the way forward, because the log backup did not have the corruption in it. In that case, there was zero data lost. So perfect point in time recovery. It just took us. Took us a couple hours to get it done. But again, we did that in a matter of hours, and that was, you know, very low cost to the customer versus if that had been a full on corruption repair, you know, you’re probably talking at least overnight, and very expensive to the customer in addition to the outage.
Steve Stedman 16:27 yup and that’s one of those things that some of the more time consuming and the more expensive corruption repairs that we’ve done have been ones where somebody hits corruption, we talk to them about their backups, and we say, well, what? What’s your most recent backup before the corruption occurred. And I said, Well, we don’t even know when the corruption occurred, but our last full backup we had was eight months ago, and our backup system has been failing ever since. So we go check that full backup and we find out, Oh, gee, the corruption was there then it just didn’t get quite as bad or noticeable until now. And those are the ones where we have to repair with no backups. And that can be the that can be quite a challenge, although we’re very successful at it as well.
Derrick Bovenkamp 17:07 Yeah, so we’ve really had some of the most common causes of corruption already. You know, you’re going to hear me say again, it’s Io, Io, Io. And and the one that you that the story that we just talked about a second ago that was not i scuzzy, like we previously talked about, that was actually due to antivirus software that didn’t have the proper configuration in it. And there is a Microsoft has very specific recommendations for antivirus that almost nobody follows. But I really encourage people go look and you know, maybe we’ll get it we’ll get it linked. We’ll get it linked in the comments of this, the link to the Microsoft document that you should follow when you set up a SQL Server and make sure you get those exclusions in place, because in a very active SQL Server, it is possible for the antivirus to get in the way of how SQL Server is running the desk and reduce corruption.
Steve Stedman 18:13 Yep, and that’s one of those things that when we’ve seen that happen the first couple of times, we didn’t realize it was the antivirus right away, and we fixed it, and then we got it working again, and then it hit again the next week, I think, and we had to go.
Derrick Bovenkamp 18:26 we were able to fix it and not have a problem after that. And I really see that. I mean, Steve, what is the percentage of people you see that actually add those antivirus exclusion rules? Maybe 2% customers, almost nobody. So it’s not often, but we have seen it have a problem.
Steve Stedman 18:48 Absolutely, another cause has been power fluctuation that has impacted the IO in some way. Either a server gets unplugged, or we had one of my very first corruption repairs we ever did, I think might have been before you got involved, Eric, but the very first one was a lightning storm. Their SQL Server was running on, not even on a UPS of any kind of just running on a power strip underneath the desk. Lightning storm hit the power lines. Power Surge came through, fried the power strip shut SQL Server off midway through writing, I guess, and we ended up with a crop database because of that. So that’s that’s one thing that can cause it. I think, IO issues when we’re talking about network, attached storage of any kind, making sure that your network is solid, making sure that your switches and your hardware and all that are up to date on right patch level, and making sure that you’re using Enterprise quality storage rather than, I mean, don’t expect your database to be solid if you plug in a USB drive and expect that to be your database storage.
Derrick Bovenkamp 19:54 So Steve, one of the things that we see people do when they encounter Correct. Option, and they see an error in their SQL server log, or they realize they have corruption, they go to Google, and what’s one of the first things that comes up in Google to fix corruption? What does Google say?
Steve Stedman 20:12 Oh, well, you got a lot of pages where people are suggesting you run repair, allow data loss, and if you go, don’t you on your Derrick Bovenkamp blog page, have like a giant red alert button or something like that. Repair allow data loss.
Derrick Bovenkamp 20:30 I’ve got the Apollo 13 master caution alarm that’s shaking around a parallel data loss.
Steve Stedman 20:37 So what repair allow data loss does is it basically says, scan my database and any page or any 8k chunk of that database that looks corrupt, just throw it out. It’s sort of like if you spilt coffee on a book and you went through in every book page that had some coffee spilt on it, you’re going to throw away, and then you try and read the book later. It’s not going to make any sense. But what repair allow data loss does, and some people phrase it as well, it’ll get rid of data if it needs to no it’ll get rid of data if it sees corruption at all, it’ll just throw it away. And one of the things we see, I mean, how many times have we seen Derek where somebody says, Well, I ran repair allow data loss, it didn’t fix it, but it got rid of some of the the errors, and then now we need to recover the database. Not only do we have to fix the corruption, but we have to get back all those rows that were somehow thrown out.
Derrick Bovenkamp 21:30 I mean, I would say, like in the extreme, lucky it gets stuck and it can’t throw anything out and it and they’re okay, but a lot of times we, you and I have seen it, it’ll throw an entire table out, not just an 8k chunk. When there is maybe only a couple 8k chunks that are actually corrupt, it’ll just be like, throw that entire table out. And we’ve seen people, and I don’t think, and you know, to be fair to them, they’re in a stressful situation and they’re they’re reading what people say. We’ve seen people run that and not really realize that. Hey, maybe it says it’s not corrupt now, but I may have lost crucial data. And if you’re talking about payroll data or banking data or healthcare data, you know how many of you are willing to just, you know, go into a filing cabinet and pick a folder and throw it out and know you’re never going to need it again.
Steve Stedman 22:29 And you know, if, if this is my bank, and they’re running repair allowed data loss, and I suddenly looked and see that my account balance is not what I expect it to be, and they just happen to throw away a month worth of my transactions at the bank that would not make me very happy. So one of the things that we always tell people before ever using repair, allow data loss, call us and we can talk about options, and that’s one of the things we offer a free 30 minute consult. If you’ve got a corrupt database, call us, and sometimes we’ve been able to talk people through in 30 minutes and come up with a solution. They go and do it on their own. Didn’t cost them anything. Other times, it leads to a full corruption repair on our part. But either way, we’re going to use that 30 minutes to take the best stab we can at doing whatever we can to get that database repaired for you. Now with that, there’s also the repair rebuild. And from my experience, repair rebuild is okay to run if you’re running DVCC check table with repair rebuild and you’re just repairing an index in a table or something like that. But oftentimes, if it’s something that’s actually in the data pages for a table and not an index, it’s going to push you to run repair allow data loss. Oh, and that’s one of my peeves as well is that when you run check dB, it will often tell you repair, allow data loss is the only option to get this back and or to get your database back. And yeah, it’ll bring your database back, but it might be missing a whole lot of data after you run that, so do not run repair, allow data loss. And just as a reminder, because it Yeah, it throws Data database out. So Derek, I know we’ve done a lot of corruption repairs for people, and when is the right time to call in the experts for database corruption from from your experience,
Derrick Bovenkamp 24:14 I would say, you know, you call the experts in early as soon as you realize that you have corruption. I think it’s smart to call somebody in and, you know, if you determine that we’re just going to go back and do a restore, and it’s not that going to be that big of a deal, then you know, maybe you, maybe you do a quick consult, or you have a couple hour engagement and and you’re good, I think the riskiest thing to do, and a lot of us it, people do it. And I’ve been guilty of it, of myself is, you know, you go to Google and you start finding things, and that’s the repair a lot of data loss. Maybe Google suggests some other things. You and I have seen people buy software that promises to fix corruption and doesn’t fix the corruption. I. You know, there’s no software that I know of that you can buy and install on your computer, and it’s going to fix your database corruption. It’s going to require a human behind that. So I would say, you know, talk, talk to an expert. Steve said, you know, you were saying earlier, shameless plug, you know, call us. We’re, happy to talk with you and do a consultation for free. And you know, we’ll give you the best advice that we can give you. And you know, sometimes that is, we need to do a whole corruption repair. And then sometimes that is, we talked to you. We talked about your recovery point, objective, your recovery time objective, which I guess will be on next week’s episode. And and they just say, hey, just just go back and do a database restore, because you’re fine losing two hours of work today versus the cost to repair a database. So I think you and I, we take pride in that, that we’re always going to give you the best advice we can give you, even if that advice is don’t engage us to fix this corruption.
Steve Stedman 26:02 what’s interesting with that is I think we’ve got the most corruption repair calls on Thursdays. And I think the reason that we’ve got them on Thursdays, and usually it’s a Thursday afternoon, is that the local team is working on it. They’ve tried everything they can. The weekend is approaching. They know the business is in trouble, and they know they don’t have a hope of fixing it at this point. Now, we’ve seen that there’s been a number of corruption repairs where they’ve called us on a Thursday, that we’ve let them run on a Friday, and then immediately it ended business on a Friday. We started a corruption repair. You and I tag teamed it through the weekend. Basically worked 24/7 well, not 24/7 but 24 hours a day until we had that corruption repaired. And in several cases, in many cases, we had that repaired and ready for them to go by Monday morning.
Derrick Bovenkamp 26:56 And yeah, our one of our biggest wins, I and I think it was a Thursday night. We engaged, they engaged us, and we had a deal signed, and we did that repair overnight. We were actually able. Their server was fast enough, you and I did it overnight, and they were open for business the next day in the time of the year. It it’s, it’s interesting that we’re doing this close to Thanksgiving. I think you and I remember a over Thanksgiving corruption repair.
Steve Stedman 27:29 I think we’ve done two out of the last five years where you and I worked the entire Thanksgiving weekend on corruption repair. And, yeah, it’s today. Is today. We’re recording this in live streaming on Wednesday, the day before Thanksgiving in the United States. And yes, we have had two out of the last five Thanksgivings where Derek and I have had a very profitable weekend, but we’ve worked the entire Thanksgiving weekend in order to get someone’s database repaired and up and running before Monday morning came around, and we were successful.
Derrick Bovenkamp 28:00 So Steve, we’ve talked a little bit about, you know, who to call and what causes corruption and repairing corruption. You know, what are the hidden costs of ignoring corruption?
Steve Stedman 28:12 Oh, boy. Oh boy. That’s a That’s a tough one, because it’s one of those things that somebody may say, Hey, I’ve had a corrupt database for four years, and it really doesn’t impact my business. I just know that when I query this one table that I get errors once in a while. Well, that’s kind of like ignoring cancer. If you have cancer, it’s it’s not going to get better by ignoring it, and you might get lucky for a while if you ignore it, but ignoring it may lead to complete and catastrophic data loss, you may end up losing your entire database if you ignore it. Now, some people say, Well, if I ignore it, well, the database continue to rot. Will it continue to grow and spread? And technically, once you’ve got corruption, it’s not going to, like, leap from one table to another and like, like a virus or something like that would grow. But if you have corruption in an index, and you do rebuilds and things like that, it could lead to the corruption spreading throughout an existing table that’s already corrupt. But oftentimes your database will get worse, not because the corruption spreads, but because whatever cause of that corruption might still be there, like your i scuzzy example you mentioned earlier, Derek, if you’ve got an i scuzzy device that’s throwing errors and causing bad rights, your database is just going to get worse and worse and worse until until it doesn’t exist anymore, or until it fails so bad that you can’t bring it back online.
Derrick Bovenkamp 29:39 Yeah, and I know we’ve had a few instances over the past, you know, five or 10 years, however long, we’ve been doing this together that, you know, there’s been a few where we’ve really had to have a sit down conversation with a customer and be like, you have database corruption, and this needs a. To be addressed, and the one that I’m thinking about, and Steve, for those of you who don’t know, Steve is a volunteer firefighter, EMT and isn’t there a term that you have for the patient being really concerned about a minor injury and not realizing that they have a major injury.
Steve Stedman 30:22 Oh, yeah, that’s a distracting injury.
Derrick Bovenkamp 30:26 we have there, there. I don’t remember all the details, because it was a while ago, but there was a very specific one where I think the customer, I mean, I think maybe they called us because of a performance and their database was slow, and they were very concerned about these reports running slow and completely ignoring the fact that we were telling them that they had corruption, and their database was in a very unhealthy spot. So yep. And that’s like, corruption seriously, yeah.
Steve Stedman 30:55 back to your analogy of me being a firefighter, and EMT, I mean that would be like a patient that has a head injury, and if we don’t deal with that right away, they may have brain damage, versus I’ve cut my hand and it’s squirting blood everywhere. The blood squirting everywhere out of your hands is the distracting injury. In this case, it was their slow performing database that was a distracting injury, and the root issues they were having were because of corruption. So okay, well, I guess on to some of the horror stories that we’ve seen around database corruption, and I know well, first off, before we get into horror stories, I want to play out one difference between a ransomware, encrypted database versus a corrupt database. Now, we’ve seen a lot of horror stories around ransomware, but oftentimes, if a database becomes ransomware, encrypted, SQL server thinks it’s corrupt because it’s not the right format, of course, yeah, but it’s not something that can be fixed to any kind of a corruption repair when it’s ransomware, ransomware is that the whole database file has been encrypted, and that’s something we can’t typically help with. From a restore perspective on restoring the file, now we can help you with backups if those are available. But as far as actual corruption horror stories, I mean, I can think of one, well, I guess one. We got the call, I’ll jump into one, and then I’ll ensure but it was a medical clinic, and they had a a RAID array with a, I guess, in your rate array, you have a checksum drive. So you’re usually able to lose one drive in a raid five array without losing any data, as long as you get that drive replaced quickly. What had happened is they had lost two drives, and then they replaced them, and somehow the rate array was still online. But when we looked at the data file, and I think they had four, so basically four stripes of data across the five drives, effectively, and we looked at the data file and actually scan through it with the hex editor, and the data file looked fine. It looked fine. And as we scan down, then we hit a section that was just complete garbage, and then it was fine again. And then there was another section that was complete garbage, and what that was caused by was they had lost an entire stripe on their their rate array. Hopefully I’m using my terminology on this correctly here, where one whole disk, or roughly 25% of their file, was just gone. And that’s one where, okay, so the first thing I recognized was there’s no fixing that. There’s no fixing that file, because the data is just completely gone. So what we took a look at was what backups you have available. Well, they weren’t exactly concerned about backups prior to this, and they didn’t have any backups available. And I had to explain to this medical clinic that, I’m sorry, there is no way possible to recover this. Yeah. And I mean, that’s one of them where we had our 30 minute call. I probably ended up turning into an hour and a half call consulting them initially, and came back and said, Gee, we’d really love to help you with this, but there is no way possible to fix this because your data is gone and you don’t have any backups from prior prior to that.
Derrick Bovenkamp 34:14 and I remember that, I think we felt so bad for them, we don’t even think we build them for that call?
Steve Stedman 34:23 No, it was the free 30 minute consult.
Derrick Bovenkamp 34:25 It turned into an hour and a half, an hour and a half. And I remember, I think they were able to find a database from, like, I want to say it was like two years earlier, and that was all they had, and they had to rebuild from that they lost. Like, it might have not been quite two years, it might have been 18 months, but they lost over a year of data, and that, that’s all they had. That’s all they could go off. And that’s very close to the corruption repair. We’ve been on a lot of, sorry, the ransomware. We’ve been on a lot of calls where ransomware has taken effect. And that’s, you know, the usually. The data file is gone. I know there’s been a few where they paid the ransom and unencrypted the data file. And I think, you know, we looked it up and, you know, on the hex editor, and it was basically like 50 gigs of zeros. You know, there was no data there to repair. So I think that is separating a, you know, database corruption versus entire loss of the file or an entire loss of a stripe from a RAID set.
Steve Stedman 35:28 And speaking about the ransomware thing, I mean, we’ve dealt with enough people who got hit with ransomware and trying to help them after the fact that it’s pretty clear that the ransomware development teams, the people who build this ransomware, don’t have the best QA teams on testing their unencryption process or their decryption process after people fail. And oftentimes, because your database files are some of the biggest files you have in your entire system, they oftentimes will not decrypt even if you pay the ransom. And that’s I mean, that alone is a nightmare if you don’t have those immutable backups.
Derrick Bovenkamp 35:59 Yes, for sure. You know, I mean, we’ve talked about some horror stories, but really, we have a lot of success stories. And you know, we talked about the the over Thanksgiving we’ve done the overnight, the over weekend, I would say every, every corruption repair that we’ve taken on, we’ve turned out successful in the end, and we’ve either recovered 100% of their data, or we were able to tell them exactly what they lost. You know, there was a couple where we said, you know, you lost this record and this record, but the rest of the table is fine, versus before, you know, there was great parts of the table, or the entire table they couldn’t read.
Steve Stedman 36:43 Yep. So let’s talk about some of those wins then, and I’ll jump in with one. One of the big wins we had was where the client, their servers, were too slow to do the corruption repair. So what they did is they spun up to, I don’t remember, was Azure or AWS servers. I think was AWS servers. I think it was AWS Yeah, that were really hefty, high powered, really expensive servers, if they’d kept around for any amount of time, but it gave us super fast servers to work on for a weekend in order to do the corruption repair. And we probably that’s a corruption repair that probably would have taken us a week to do, but we were able to get it done in a weekend, because of those high powered AWS servers that we can work on and do the repair much faster than would have been possible on existing hardware. Yeah, I
Derrick Bovenkamp 37:31 remember that specific one, and those were like five to $10,000 a month servers, but they only needed them for a weekend, and we were able to do that repair, and that’s when they it would have taken a week or more to do on their existing hardware. It was slow. And this is something to think about when you know this is a reason to buy new hardware. When you need new hardware is to make sure you have enough overhead that you can run that check dB. You know that we’ve run into ones where, you know, it’s hard to do the corruption repair because the server’s slow, but that’s also why they weren’t doing the check dB. Is the server slow? So make sure that you have good hardware, and that’s also a reason you Steve, we’ve had a couple of customers where it wasn’t the only reason why they upgraded to SQL Server Enterprise, but there was multitude of reasons, but one of the reasons for upgrading to SQL Server Enterprise was the multi threaded check dB, which, oh yeah, took a six plus hour check DB and made it a two hour check DB because the hardware was fast enough to handle it. It just needed that multi threaded support that comes with SQL Server Enterprise.
Steve Stedman 38:48 Yeah, and that brings up an interesting question there, then, is the cost of SQL Server Enterprise worth paying for in order to get that? And I think that’s where it depends sometimes, sometimes it’s absolutely worth every penny if it means you’ve got a solid system that not going to knock your business out. Sometimes it might not be worth it, but it’s one of those things that varies for every business. So, yep. So how about testing backups? I mean, that’s one of those things that and I’m going to talk about this again in our RTO and RPO conversation. Is I don’t think a backup is a backup unless you’ve actually tested it. And so many times we’ve run into people who thought they were running backups and they never have ever tested that backup. And I think that we’ve seen some sad stories where people didn’t have an option for recovery because they hadn’t their backups that they thought were running were not actually working. Yeah.
Derrick Bovenkamp 39:47 I mean, I think, you know, the testing, the testing is important, and that’s also how you come up with your strategy. You know, I this applies to more than just SQL Server, but, you know. How often are you getting a full backup? And the thing that I think about is a lot of the times I like to get a full backup once a week. Differentials daily and in log backups every five to 15 minutes, some cases, every one minute, depending on the customer’s requirements for recovery point objective, but I, I’ve certainly seen less with SQL Server, but more with other backup software. You know, there’s some backup software that that they say, Well, we do, we do basically, like differentials forever. We only ever do, or not differential forever. But basically we, we only ever do what’s changed, and we never do a full backup at we only do a full backup once and anytime that I’ve used software like that, I always like, hey, let’s like, every week get a full backup, and then we’ll do our incrementals. Incremental forever was what I was looking for, which is not a SQL Server term, but you’ll see that with backup software and make sure that your I would not do an incremental forever. But what, what I’m thinking about, and you and I have seen this before, is backup software and internal SQL Server backups fighting with each other. Instead of one of them being set to copy only, they’re both truncating the logs, and in that case, because the customers didn’t test their backups, they didn’t realize it, and they didn’t have any monitoring in place to tell them that their backup software was fighting each other, and they only had full backups to go back to. They couldn’t go back to these log backups that they thought that they were getting because they were being basically truncated by the backup software.
Steve Stedman 41:46 yeah, we’ve seen that too many times. And in fact, that’s one of our checks that we have in our standard monitoring. So at this point, if someone’s listening to listening to this, and they’re just thinking, wow, I don’t ever want to be in a position that I have to recover from a corrupt database. What kind of things could you think of as best practices or ideas that might help minimize the risk of getting hit with corruption?
Derrick Bovenkamp 42:09 I would say, certainly, you know, no. The the free options out there is, you know, get the OLA hologram scripts. Set those up. Make sure that you’re doing check DB regularly. Make sure your notification set up so you are getting an email if it fails, make sure your backups are set up. Make sure your backups are tested. And you know it is okay to raise your hand and say you need help. And a lot of us in the IT industry, you know we want to know everything, and we want to do everything, but it’s totally fine to raise your hand and go, you know, we need to bring in an expert on this area. You know, I’ll handle everything else, but we need a SQL Server expert. We need a DBA and that that’s where the Stedman team, Stedman solutions comes in. We have multiple options from will take over your entire DBA role to smaller, more monitoring level options, or you maybe want to bring us in just for a single engagement to get your secret, get all your backup set up, set all this up for you and and then you’ll monitor it from there. So I think that the big thing is make sure that you’re doing your backups, make sure you’re doing your check dB, make sure you’re staying up to date with your hardware drivers versions. One of the things we didn’t talk about is, you know, some of the earlier versions of SQL Server were more known for issues. And you know, you see Microsoft, sometimes it’s not just security updates. When you see those Cumulative Updates, there’s there’s updates in those updates that fix data availability issues. So stay on top of your updates. And, oh yeah, don’t be years back.
Steve Stedman 44:04 so you’re on that topic. Derek, if I just throw out a random number like 10.5, dot 2500, what is that? What does that mean to you when you hear something like that?
Derrick Bovenkamp 44:16 was that the initial release of 2008
Steve Stedman 44:19 or two I think it was one of the first patches on
Derrick Bovenkamp 44:22 it was one of the first patches. And anytime I see that number, that’s the one that we we refuse to work on until it’s been patched to a higher level. But, you know, there is a lot of the times Microsoft’s gotten better with this. We’ve certainly, certainly seen, because there’s been some security issues, they’ve been pushing out Cumulative Updates and service packs inside of Windows Update. But before they did that often, and still, you and I see it sometimes, we’ll see a release to manufacturing version of SQL Server on a computer, and we’ll be like, hey, that’s great. You’re still running SQL Server 2016 It’s still in the Microsoft support, but you haven’t installed an update in five years, which means you’re open to security issues.
Steve Stedman 45:11 So one of the things, probably some of the most expensive work that we ever charge for is database corrupt repair and with our SQL Server managed service offering, one of the things we include at no extra charge, is database corruption repair. So how can we get away with giving away to our managed service customers our corruption repair without charging them anything extra for it?
Derrick Bovenkamp 45:40 Well, I would say that, Steve, that’s because, primarily, we have the backups in place. So we’ve worked with them to make sure they have the backups in place so we can do a restore. But the secondary, which is really close to a primary, is we have all the notifications in place. So one we’re doing check DB every night, but we also have notifications in place for disc right errors in discrete errors. So usually, if we run into corruption or possible corruption, we know about it pretty much instantaneously, and we can work with the customer to do a restore instead of a very expensive corruption repair.
Steve Stedman 46:28 absolutely, and that puts us in in the position of the absolute least downtime. Because, I mean, we’ve seen people, they get the downtime of three or four days because they have to do corruption repair, whereas with our managed service customers, we’re taking care of them so that we’re ready and we can react quickly when that happens.
Derrick Bovenkamp 46:47 so Steve, you know, how do we recommend customers handle like emergency response and strategies and prioritizing recovery steps with corruption during a major outage? Because a lot of the times when we run into these guys, run into these customers, it’s the system that they have to use, like they they don’t work without that system.
Steve Stedman 47:09 so earlier, you mentioned that I’m a volunteer firefighter in an EMT, and I’ve seen some pretty, pretty scary, pretty life threatening kind of things happen, life or death situations in in that when we’ve dealt with corrupt databases, I’ve seen some business issues that are pretty big, but I’ve never seen a life or death situation. So whatever is going on with your system, take a breath, relax for a minute and realize whatever goes on here. Nobody’s going to die. Okay, maybe somebody’s going to lose their job, but in the end, nobody’s going to die on this corrupt database you’re dealing with. And ask for help. Stay calm, ask for help, and we can get involved and help with that. And sometimes the help is just telling you what to do or how to handle it. Other times, we can jump in and do the full repair, but know when to ask for help. If your house is on fire, you’re generally not going to try and put it out yourself. You’re going to call the fire department. Hopefully think of a corrupt database that way. Think of it as ask for help, and we will do what we can, but also practice, practice, practice so that you’re in a position that this is not the first time you’ve ever dealt with a corrupt database. I mean, we have the database corruption challenge that I did, I think in 20, 2015, that you can get through our database corruption class, you can get and practice all of those corrupt database repairs. I mean, practice it ahead of time and be in a position where this is not your first time ever doing a corrupt database. When you talk to people, you squeak in your voice because you’re so panicked. I mean, that’s a sure giveaway to management that you don’t know what you’re doing so well, practice time so you can be calm and and take a good approach to it.
Derrick Bovenkamp 48:58 and one of the things, and I think, I think we should do a session on this. It’s not directly SQL Server related, but do a tabletop exercises. Tabletop exercises are becoming very popular in the ransomware area of sit down with management and go, Okay, we just lost our main operational system. How are we going to operate? How? How are we going to have continuity of business in think where you’re at before it happens, because then you Yeah, is it going to be pretty? No, it might not be pretty, but at least you have a you have a binder, and you have people that know the process that’s going to happen if that happens before it happens, and that way you can focus on staying calm and and working the problem instead of, you know, running around trying to fix things. And I think that’s one of the other reasons that you can bring in help you bring in firefighters. You know, firefighters usually, probably don’t get stressed. When there’s a building fire, because they’ve done it before, and they know exactly what they’re doing. And that would say the same thing with Steve and I, is when we hit a corruption repair, we’re usually very calm, and we’ve done it before, so we go through all the steps that we need to do to bring the customer back online, because we’ve done it before. So bring in the experts, is what I would say
Steve Stedman 50:24 bring in the experts. Practice on your own ahead of time, and just back to the Boy Scouts. Be prepared, right? Yeah, be prepared. Okay, so how about corruption when it comes to compliance? Like, let’s say you’re in a business where legally, you have to keep history of something, whether it’s banking transactions or medical records or things like that, and you hit a corrupt database that destroys your data and you didn’t do the right things to get it back. What’s What does that mean for your compliance?
Derrick Bovenkamp 51:00 it, can mean fines. It can mean, you know, may make major, major issues. It could mean the end of that business, depending on what those fines will be. I know in the banking industry, when certain things happen, the government will actually come in and take over the bank, you know. So I think we’ve seen stuff where it is, it is meant the end of the business. If they can’t get the corruption repaired, you know, they’ll sell out to somebody else. So, yeah, know what your regulatory requirements are, and just know what your requirements are to keep the business operational. And that’s where sometimes you and I have seen a disconnect with, you know, this is the most important thing to keeping the business running, and we’re not getting backups of it, or we’re not going to put any money into you, and we’re running on 10 year old hardware, and we refuse to put any money into the hardware when the doors will shut in the business if we don’t put The money in. You know, one of the another session that you and I should have, Steve and we’ve helped customers with this is sometimes we’ve been able to significantly reduce their SQL server licensing costs by getting them into new hardware. We’ve got them down from, you know, 3216 3264 core machines to eight core machines, or four core machines. Just, you know, just because the hardware was so old, they were spending so much money on licensing, if they just updated, they could cut their licensing costs,
Steve Stedman 52:33 yep. But back to some of the outcomes from corruptions. I mean, one of those things that I often have heard in the DBA world is a resume generating event, something happens so bad that your only fix for it was to work on your resume and go find a new job. Now we do everything we can to make sure that that never happens, and if somebody’s in that position, you should call us before working on your resume, and we’ll see what we can do to help you out. But being prepared, I think, is the best way to not generate a resume, generating event for sure. All right. Well, Derek, thanks for joining today. You’re welcome, Steve. At the end of the podcast, we normally have an ask Steve question, and the question that came in today was, what are the best practices for defragmenting indexes, and can this cause corruption? And since you’re the guest Derek, you want to, you want to take a shot at that,
Derrick Bovenkamp 53:30 first shot at it. And you know you, you actually touched on it. I think earlier in our discussion, is rebuilding indexes is not going to cause corruption. However, it can make a problem worse. So if you have that underlying IO problem, and you and I have seen this a few times, the it was caused during an index rebuild. The index rebuild didn’t it wasn’t what caused the problem, but the problem was already existing, and because of that, all extra IO that’s when the problem originated. So it won’t cause the problem, but it can bring the problem out, if that makes sense,
Steve Stedman 54:11 yeah, if you’re already having some issue, probably IO related, that’s causing the corruption, anything that’s changing, large amounts of data can make it worse. And whether you’re doing just an update on a table or you’re rebuilding indexes. Yes, that can make it worse. However, on a regular basis, I would never recommend staying away from defragmenting indexes prevent corruption. Now, if you do have corruption, you already know about it, one of the first things I would do is turn off some of those index rebuilding jobs, and some of the things that are going to put a lot of load on that database and cause a lot of IO because, yes, that would make it worse at that point. But no, doing good index and statistics maintenance is never going to be the cause of corruption. Something else may be failing on your system that causes it all right? Well, I think that wraps it up. Thanks, Derek. Well. One of the things I want to mention as Christmas is approaching, Stedman Solutions has a great Christmas gift or stocking stuffer, available to purchase for anyone you know that likes to write queries, or even if they don’t like to write queries, but they have to write queries. It’s a good gift for them. It’s our SQL Join type socks available stedman.us/join socks. And you can get socks with inner join, left outer, join, RIGHT OUTER, join, left outer, join with exclusion, all the different ways you can join tables on your socks. Great stocking stuffer for any SQL programmer or DBA and if you ever have to go for a job interview and someone asks you that question about, well, can you tell me the difference between inner join and outer join? If you’ve got these socks on in that interview, you will literally wow your interviewer at that point. All right, so speaking of the holidays, let me I put together a holiday whatever your plans for this holiday season. Just know, if you’re a Stedman solutions managed service customer, we’ve got your SQL Servers covered. Nothing to stress about. There. You can get away, do what you want to do for the weekend or the holiday, and just relax knowing our team will be there to take care of your SQL servers over the weekend. All is calm, relax and enjoy your Thanksgiving. All right. Well, join us next week as we talk about RTO and RPO, and remember, you can watch this in other episodes on YouTube and on Spotify, and links are available at stedman.us/podcast thanks for joining us. That wraps up this week’s episode. You Steve, thanks for watching our video. I’m Steve, and I hope you’ve enjoyed this. Please click the thumbs up if you liked it. And if you want more information, more videos like this, click the subscribe button and hit the bell icon so that you can get notified of future videos that we create you.
Contact Info
Stedman Solutions, LLC.PO Box 3175
Ferndale WA 98248
Phone: (360)610-7833