Master SQL Server Corruption - Stedman Podcast Tips

Host: Steve Stedman / Shannon Lindsay
Recording Date: February 18, 2025
Topic: Corruption
Listen on Spotify

Stedman SQL Podcast Season 2 Episode 6 Corruption

In this episode of the SQL Server Podcast, we dive deep into one of the most dreaded issues for database administrators—SQL Server database corruption. Corruption can strike at any time, causing data loss, downtime, and major headaches. But with the right knowledge and tools, you can detect, repair, and even prevent corruption before it becomes a disaster. Join Steve Stedman and Shannon as we discuss: – Common causes of database corruption – How to detect corruption using DBCC CHECKDB – Best practices for repairing corruption and recovering data – The role of backups and corruption-resistant strategies – Real-world stories from fixing corrupt databases Whether you’re an experienced DBA or new to SQL Server, this episode is packed with actionable insights to help you protect your data and keep your databases running smoothly.

Podcast Transcript

Steve Stedman 00:00

Welcome everyone to the Stedman SQL podcast, season two. Episode Number Six. I’m your host, Steve Stedman, and today, Shannon Lindsay joins me as my co host. Welcome Shannon. Some news. Just had a recent release of database health monitor with several new features, including six reports based around finding and diagnosing deadlocks, as well as some new alerting where we’ve got email alerts set up on a number of different things that you can schedule for you or your team around SQL Server events.

Welcome listeners to the podcast. Two episodes ago, our indexing episode was one of our most popular so far. So if you didn’t see that one, check it out. You can find that on YouTube or on Spotify. And this week’s topic is on database corruption. So onto the topic.

Shannon Lindsay 01:17

Well, I know you’ve worked with corruption for a while, so when did you start working with it, and how often do you think that that comes up?

Steve Stedman 01:27

Well, I think I started working, gosh, I worked with corruption prior to 2015, a few times, just working on it and kind of diagnosing specific issues and then learning and trained more about it. But then in 2015 after, well, right after I started Stedman solutions, I created the database corruption challenge, which was an online competition where that I hosted, where we had 10 corrupt databases over about 10 weeks. I say it wasn’t exactly 10 weeks, because sometimes it was every seven or eight or 10 days instead of every week, but we basically had a competition with a bunch of people online, competing on who can be the best fix this corrupt database. I would go out and create a corrupt database, and then I would release it and see who could fix it the quickest.

So yeah, so 2015 I got pretty serious about it, and then after that, we started taking on quite a few corruption repair projects through Stedman solutions. So for instance, it’s February now, and we’re recording this, and we’ve done two corruption repairs for clients so far this year. And with that, that’s kind of a typical pace, maybe one a month or there, but sometimes it speeds up, sometimes it slows down.

So but the thing with corruption is there must be something about Thursday afternoons, and because that’s what I found, is that when people call up asking for corruption help, if it’s a Thursday afternoon, we almost always end up going into the weekend to do a corruption repair for that client, or soon to be client, as soon as they soon as we get a contract signed. And I don’t know what it is, but if maybe you’ve got corruption, you’re working through the week, you realize the weekend’s coming and you don’t have a solution yet. Well, that’s when people reach out to us. And for some reason, more than half of the corruption repairs that we’ve done, for some reason, have come in on a Thursday afternoon, and usually the we have that initial contact with the 30 minute consult, we figure out if it’s something we can help with or not, and then from there, we get a contract in place, get initial payment, and then we usually get rolling on it as soon as the contract and payment are in place, which usually means later Thursday night or possibly Friday, sometimes end of day Friday, to work around the business schedule. And we’ve done several of those where we’ve been able to do the repair and get it done by Monday morning.

Shannon Lindsay 03:54

Yeah, what tends to be the cause of SQL Server database corruption? What tends to start that corruption?

Steve Stedman 04:05

Oh, that’s a good question. And I think there’s a lot of confusion or misconceptions around this. And with a recent repair we did the client asked me, Is there something that could have been done in the application, or something that their client could have done, or somehow something that the program could have done by running T SQL queries to somehow corrupt their database. And then the answer to that is generally no. I mean, if you’re just running straight T SQL statements, select, inserts, updates, deletes, merge, stuff like that, you’re not going to cause corruption.

Now, what it usually comes down to is some issue related to IO, and there’s a lot of different layers with IO, disk IO that start with SQL Server at the top level says, write this page to disk and a page chunk of data. And then from there it goes the operating system, and then the operating system goes to the disk subsystem drivers, whatever they may be where they’re low disk, or whether it’s Network Attached Storage or something like that, and then from there, it eventually goes through a number of different layers until your data actually ends up on byte AS bytes on disk somewhere. Now with that, any one of those layers, if there’s some kind of a glitch or problem or a hiccup, you could end up with corruption, and we’ve seen this where you had a client that we did a repair on, had a legging strike in your area that took out power to the building, but it got hit with a surge right as data was being written against that caused a small chunk of corruption that we ended up being able to repair and fill in the blanks. Sometimes it can be as simple as somebody’s power switch or the power cord on a server and that the IO was interrupted somewhere, right? And it can get completely written to disk. We’ve seen it where it’s been drivers. For instance, when we had a network attached storage situation with one client, it was drivers on the network switch that were causing the bytes being written to disk to be corrupt. But really what causes corruption is something wrong with the IO between when SQL server says, Put this on disk until when it ends up on disk.

So another example with that an IO issue is we had a client that had corruption several times just day after day after day on the same specific SQL Server running as virtual machine, and it was actually corrupting almost as fast as we repaired it. And what we found was, we’d run CheckDB, we get one set of corruption. We’d run CheckDB again, we’d get a different set of corruption. And we actually figured out with that one that there was some problem with reading from the disk, so it was erroneously reporting that there was corruption, and we ended up moving that virtual machine to a different host. And it’s been five years since that move happened, and there’s never been corruption since they did something to reprovision or rebuild that original host. But it had to do something with the virtual machine, a host server that was hosting that, and by just migrating that to a different host, the corruption went away. Never saw corruption on that one again. So, it’s something to do with writing a disc, and there’s a lot of layers that your data goes through before it actually ends up as bytes on disk. So anything along there that can corrupt it or mess with it can cause problems.

Shannon Lindsay 07:21

Is ransomware encrypted database the same thing as corruption or is that different?

Steve Stedman 07:28

You know, it’s different, but we get a lot of calls from people where they have a database that’s been encrypted with ransomware, and they’re getting similar errors to what you might get with corruption, but usually with ransomware encrypted databases, it’s the entire data file that’s been encrypted, or, from SQL Server’s perspective, it’s the entire data file is corrupt to the point that nothing can be read because it’s been encrypted by the ransomware. Now that’s a very different situation, and how we go about resolving that, or how you go about resolving that is very different when you’re working with a ransomware issue versus a corrupt database. Now, with the corrupt database, usually, most of your data file is fairly intact, and you’ve just got a few holes here and there that have been damaged when the specific corruption hit with ransomware, it’s the whole file that’s usually destroyed.

Shannon Lindsay 08:21

How can I know if my database is corrupt?

Steve Stedman 08:26

Well, the only way to really know for sure is to run CheckDB frequently. Now, what happens a lot of the times, if for people who don’t run CheckDB frequently, is they end up running the query and they get some messages about IO issues or integrity of there’s a handful of different messages that you get back from your query showing that something failed when it was trying to read data. But the only way to know for sure is to run CheckDB frequently and to make sure that that you’re clear.

Shannon Lindsay 09:03

How often would you say that you should run CheckDB?

Steve Stedman 09:08

Well, this, this is a topic of some great debate, and that there’s some people who never run it, and they end up trying to run it once a month, and they think that that’s an improvement, but really it’s not that much of an improvement. So my answer for this is, how important is your data and how much do you care about it? And if your data is important and you care about it a lot, then you should run CheckDB frequently. And by frequently I mean a minimum daily, or maybe twice daily or continuously. If your system load allows for it, the more corruption that I see, the more frequent I think CheckDB should be run. And I have one system that I work with that it’s got a real light load on it. It’s never overloaded on CPU or IO. So I run CheckDB on that server every four hours. Where others? I mean, the minimum standard that I want to go with is once a day. And I wish I could do every check TV, on every database, every four hours or less. But really, it comes back to the importance of your data. And if your data is not important, yeah, don’t run check TV. Don’t worry about it. Don’t do that. If your data is important and it’s vital to your job or your company or your business, well, run it frequently. And what we run into a lot is where you’ve got a system that’s maybe grown to the point that it’s a little bit overloaded, meaning you don’t have enough IO and CPU to be able to run CheckDB and do reindexing and do your backups and run your data migration, or your warehousing type jobs and stuff like that, all in a 24 hour period. And some people say, well, okay, we can slide CheckDB to once a week, but the thing with that is that running it once a week, there’s a whole lot of data you can lose before you find out that you’ve got a correct database. So, I would say, if you’re in that situation where you’re having a problem, I mean, the thing is, if your system’s overloaded, you’re having problems, and you don’t have enough of a window or a big enough time to run CheckDB. Well, perhaps you need a bigger server, faster server, or better server. And what happens a lot is people say, Well, I have to run my re-indexing jobs every night because that’s more important for performance. Well, sure, that’s probably true from a performance perspective, but I’d rather know that my database is solid and stable and not going to disappear next week because of corruption in the file, then have the database running as fast as possibly could. So, I guess the thing is, if you can’t run CheckDB daily, you need to look at your budget, figure out how to get a better server so you can run CheckDB daily. I didn’t used to be that pushy about it, but having seen more and more corruption where people run CheckDB infrequently, that they could have recovered much better had they been running it regularly. I just, I’m getting more and more pushy on that one, because it’s important,

Shannon Lindsay 12:12

What are some common DBCC, CheckDB problems that you’ve come across?

Steve Stedman 12:19

The biggest one, I think, is CheckDB not being run, or with that people thinking that they run CheckDB three years ago, they set up a job to run CheckDB, but they put it on the wrong schedule, and it ran like for a month, and then it hasn’t run into in the last three years since then, that’s a big issue where we ask people when we’re looking at encryption repair on systems that we don’t work on regularly, where we say, well, when was your last clean check? DB? Oh, well, 18 months ago, or 24 months ago, we had a clean CheckDB, but it hasn’t been checked since then. That’s, that’s a huge issue. I mean, that’s a gigantic issue that that shows that any time in that last 18 month window, corruption could have been introduced, and you don’t know when the other one is the I mean, not being run is bad. Being run occasionally, CheckDB, running occasionally, like weekly. Like I said earlier, if you’re only running CheckDB weekly. Well, there’s a lot that can go on and a lot of busy systems in a week that could be catastrophic to your business. And I think weekly is one of those that people kind of do as a compromise, but it’s not really that much better than I mean it’s not a good scenario to be running it weekly. And then the other situation with it is with CheckDB being run less frequently than the backup retention period. So, let’s say you’re kind of squeezed, and you’re in a bad position where you’ve only got three or four days’ worth of backups available, and you’re running CheckDB once every week, or once every two weeks. Well, that’s almost pointless, because you’re not going to have any backups available to be able to recover from that go as far back as when you discovered that corruption was introduced by because of the infrequency of CheckDB running there. So, I would say that. I mean, if you’ve got backups, I mean, whatever your backup retention is, you’d better be running CheckDB much more frequent than that protection period, and then the other one is not checking the results or alerting on CheckDB failures. So, let’s say you set up a job in SQL Server Agent that’s going to run CheckDB daily, or let’s say you’re really cool and you’re going to do it every three or four hours, but you don’t turn on alerting when that job fails. Well, it doesn’t really matter if you’re running CheckDB if you’re not going to get alerted when there’s a failure. I mean, that’s like having a smoke alarm in your house and your house catches fire and but you’ve taken the speaker out of it. Well, make all the noise. That’d be totally silly. Well, running CheckDB without notification on Fill. Is just as silly. And then the other one is people not turning on alerts. Is a problem I’ve seen around this where if you turn on alerts in SQL Server Agent for severity, 21 through 25 errors and errors 823, 824, and 825, these are all the kind of errors that will start being thrown by SQL server when you’re starting to see IO related issues or the things that cause corruption. So these things, those specific severity errors, 21 through 25 and errors 823, 824, and 825, are the kind of things that are like your smoke alarm. Those are the kind of things that they’re going to tell you there’s a fire before your whole house. Your whole house burns down. And if you know about those errors, 823-820-4825, or the alerts, 21 to 25 that’s the kind of thing that when you see those come in, you should stop what you’re doing and figure out how to deal with those right away, because that’s an indication that you have in pending Doom approaching very quickly with your single server. What

Shannon Lindsay 16:05

What is MSDB suspect pages table?

Steve Stedman 16:10

Oh, that’s a good one. The MSDB suspect pages table, I like to describe it is, it’s a complete distraction from reality and what this is, there’s, there’s a ton of misinformation about this, and some people believe that if I look at the suspect pages table and there’s nothing in there, well, I don’t have any corruption. Well, that’s not true. So I’ve dealt with many, many, many corrupt databases over the years, and the only time that I ever see anything that gets written to the corrupt page or the suspect pages table, is when you’re doing a restore of a database, and corruption is encountered when the Restore is occurring. It’s found a bad page in the restore that will certainly get written to the suspect pages table. It’s not a useful tool for detecting corruption. Might be a handy tool if you’re trying to tell if you’re restoring the database that’s corrupt, but beyond that, it’s almost completely worthless, in my opinion.

Shannon Lindsay 17:07

How do backups play into recovery from a corrupt database?

Steve Stedman 17:13

Backups, good backups, solid backups, are frankly the absolute most important thing that you can have to help with database corruption. So a lot of people will say, Well, I’ve got three or four days worth of backups, and that’s all I have space for. We can’t afford a bigger disk to store the backups on. Well, having good backups can help you fill in those holes during a corruption right there. So let’s say today’s Wednesday and on Monday you had corruption introduced that you lost some, some amount of data in your database because of that corruption. Well, if we could go back to Monday morning or Monday right before that corruption happened, and pull those missing rows out of that out of that backup, and put them back into your databases. Being repaired, you’ve got a much greater likelihood of being able to have 100% data recovery, but if you only have an occasional full backup, or you don’t have the ability to do a point in time restore, well, you’re really having your hands tied, and what can be done on that repair. A lot of the time, when we do repair, we’ll recover everything we can out of the corrupt database and then figure out where the holes are, what’s missing, and then bring those things in from one or more backups. And if you don’t have good backups, will we lose that ability to bring in and fill in those missing holes of data. So my recommendation on that is that, and this is whether you’re dealing with corruption or not, but it’s going to definitely going to help you with corruption, is that if you have full backups regularly, you have diff backups, which the diff backup is everything that’s changed since the last full and then you have log backups, and the log backup is everything’s changed since the last log backups. With a combination of those three, you can pretty much do a point in time recovery for any minute or even up to the second during your that window that you have backups for. Now, one of the key things that you need to do, though, is you need to make sure that those backups are on different storage than your SQL Server, data and log files, so that if data, if corruption is let’s just say you’ve got one of those basic setups where you’ve just got a C drive. And on your C drive, you’ve got your operating system, you’ve got your data files, you’ve got your log files, and you’re also writing your backups to the C drive. Those aren’t backups. And the reason I say that is that if something happens to that C drive, or even if it’s a C, D and E drive that are spread over the same spinning disk, or not spinning disk, but the same physical disk, well if something goes bad on that disk, all. Of those locations are going to be bad.

So, if you have local storage, we recommend your backups go to network storage somewhere. If you have network attached storage or a SAN, we recommend that your backups go to some other type of network attached storage that is different than that San. And some people say, Well, can I just write my backups to the same drive as my data files, and then copy those backups off to the network. Well, no, the problem is, if you have corruption that’s introduced by bad IO on that disk where your data is, if you’re putting your backups there, your backups are going to be broken or worthless as well. So you got to make sure those backups are on different storage. The other thing with backups is you really need to focus on the retention period. Now, lots of people I talk to when we’re dealing with corruption say, well, at least those that have backups. They say, Well, we’ve got three or four days worth of backups because that’s all the storage we have, or that’s all the disk space that we have storage for. Well, that’s not true, because usually, if you take a look at an alternate backup method, like, let’s say you’re backing up full backups daily for four days, and that’s taking up 100 gigs of storage. Let’s just make up some numbers here. Well, there’s different backup schedules that we can do that can take up that same 100 gigs of storage or less and give you a longer retention period. For instance, if we just turn on backup compression, if you don’t already have that turned on, well, that’ll give you faster backups and faster restores. And if you happen to say, well, I can’t do backup compression because I have transparent data encryption enabled, you need to look into this more, because if you have TDE on SQL Server, 2017, or newer, you can change some of your backup settings, so you will indeed get compression on your TDE encryption databases. So just by turning on backup compression, you may be able to get almost twice the amount of backup stored in the same space. But then you take it and you change it, instead of saying, Well, I’m going to do a full backup every single day, and those four full backups for all my databases equals 100 gigs. Well, what if I instead say I’m going to do a weekly full backup and I’m going to keep those for three weeks, so that’s going to take up less space than those daily fulls are going to take. And then we’re also going to compress on which is going to take up even less space. And then I want to do differential backups daily, and we’re going to keep those for, let’s say, two weeks. And depending on how much change is happening in your database, those two weeks worth of diff backups may be less than one full backup in space. And then we’re going to do log backups every five minutes, and we’re going to keep those for one week. Well, you might be able to do that three weeks of fulls, diffs and logs, and keep them for a much greater amount of time while still saving space overall, and can end up with three weeks worth of coverage on your backups rather than four days worth of coverage on your backups.

So the thing you really need to focus I mean, the question I always ask when we’re doing a corruption repair is, when was your last good backup before corruption was discovered? And it might be. I mean, the answers I’ve heard to that are the great ones are, well, corruption was introduced on Tuesday, and we have a backup from Tuesday morning before corruption was there. That’s about as good of an answer as you can get. Well. But the reality is, quite often I ask that question, and they say, well, one, we don’t know when corruption was discovered. I mean, we just found out it’s about it today, but it may have been there for a month because they don’t have CheckDB running regularly. Or they say, Well, we thought we were running daily backups, but those have been failing for four months because of whatever reason, and we don’t have any div backups, that’s kind of the worst case scenario. So with that, my recommendation, I mean backups are key in being able to get back to a good state. And we’ve even done corruption repair. Where the way we repaired it was we applied a full backup, we applied a diff backup, and then we just replayed the transaction logs up to the current point in time, and because all those backups were stored on different locations than the corrupt databases, we were able to get past the corruption and rebuild the entire database just by using those full dip and log backups. So yeah, backups are probably the most important thing in recovering from a corrupt database.

Shannon Lindsay 24:21

What role do indexes play in corruption repair?

Steve Stedman 24:25

Well, the interesting thing about indexes, and here’s the shameless plug, if you didn’t hear at the very beginning, I referenced our episode two before this, we talked about indexes as one of our most popular episodes yet. But the thing I want to point out with indexes is indexes change? Well, at least clustered indexes change the structure of the table. So if you don’t have a clustered index, your table is just in basically random order. It’s organized as a heap, and it just happens to be wherever data happened to fit when it was inserted into that table. Is where the table. That data exists on disk if you have a heap without a clustered index, and there’s problems with that, it’s much more difficult to recover the good rows out of that table if there is no clustered index on it, because oftentimes the heap will be if there’s corruption found, it will abort earlier through pulling the data out of the table than it will if you had a clustered index on it in a sorted order. Now, if you have a clustered index, you could say, give me all the data from the top down, and then give me all the data from the bottom up, and somehow find in the middle where the holes are, where corruption was introduced or it was missing. You can’t do that if you don’t have a clustered index. So a clustered index on a table generally will make corruption repair a little bit easier if we’re getting into an actual hands on repair. Now, the thing is, it’s a good practice, a best recommended practice, to have a clustered index on a table anyway. So from a performance perspective, everybody should have a cluster index on almost every single table with a very rare, rare exception. So what I would do is go look at how many tables in your database don’t have a clustered index, and figure out, well, how do you get the right clustered index on those tables? Now I would do that from a performance impact, not just for corruption repair, but if you have a cluster index on there, it will make corruption repair a little bit easier one day, typically. Now, the other thing with non clustered indexes. Non clustered indexes are super helpful, because a non clustered index is generally a copy of a number of columns that are often a different table. And that gives if we can query, just querying the non clustered index without ever hitting the corrupt table, we can get an alternate view of the table to determine how many rows might be missing or might be damaged. And those non clustered indexes are super useful for being able to figure out what’s missing and figure out what we need to fill in on the damage table in order to be able to fix it. Now with that, I wouldn’t recommend going adding a bunch of non clustered indexes just for corruption repair, but they’re great to have for performance, and if you’ve got a few non clustered indexes on your table, it’s going to give a few more options when we get into corruption repair, generally.

Shannon Lindsay 27:16

Yeah, like you said, there’s been a couple so far, even just middle of February, corruption repair emails that have come in and requests for help. What I think one of the major questions that comes in from those potential customers, or what are my options for repairing my SQL Server, corrupt database, like they come to you and they say, Alright, here’s the situation. What are my options?

Steve Stedman 27:46

That’s where, in a perfect world where there’s unicorns jumping around and rainbows and fairy tale, kind of environment, people might believe that all I have to do is run CheckDB and I’m going to get my database repaired. Well, if that was the case, nobody would ever call me in order to help with those corrupt database repairs, because it would just happen automatically. But the reality is, it doesn’t work that way. So with CheckDB, there’s really two different things CheckDB can do to repair it, to help repair your database. One of them is repair rebuild, where which is a parameter on CheckDB. But I generally don’t like to use CheckDB to run a repair. I will use check table to run a repair on a table by table basis. And repair rebuild is a great way to fix index, or, sorry, to fix corruption in a non clustered index, for instance. So, but you could also fix the corruption in that non clustered index, if that’s the only place on a table where corruption was encountered, by dropping the table and adding it. And then there’s the repair allow data loss option, which is, if you start Googling and it’s your first time ever doing a corruption repair, you’ll find somebody that tells you, Oh, just run CheckDB. Don’t do that. Run check table instead. But if you run it with repair, allow data loss, it’s just magically going to fix your data. Well, the reality is, and I also like to say Danger, danger, danger, when you start talking about repair, allow data loss. What repair allow data loss does if it’s successful, and it’s not even always successful, is it scans through your table and says, Hmm, here’s a data page, 8k chunk of data that has corruption in it. Let’s just throw it out. Oh, here’s another one. Let’s just throw it out. And it scans through your entire table and figures out, oh, here’s maybe 20% of your table that had corruption in it, and it just threw it all away. Well, what rose did it throw away? Gee, I don’t know. It doesn’t tell you. I mean, it just threw away a big chunk of your table, and you don’t even know how much of your table is done. Now, most business cases can’t deal with that type of uncertainty with just randomly deleted data. So we really recommend that you never run repair allow data loss unless you have. A good backup, a good copy of your data, and you’re just running it to see what could what? What could help fix that table? Now we’ve done probably over 100 corrupt databases that we repaired, and only one time out of those 100 plus repairs done has repair allowed data loss actually had a good outcome. And I was totally shocked when it happened, but it’s repair. Allow data loss is the worst advice 99% of the time. Now, if you can make a backup copy of your corrupt database, try it out and see well, maybe you’ll find out what can happen there and you can compare it. But we generally stay away from repair allow data loss, except for unless we have a copy of the database that we can work off with. Now, part of the thing we look at when we’re looking at, what are the options for returning corrupt databases, where is the corruption? And this is really important, because is the corruption in a user table, or is it in a system table, or some part of the database structure itself? Now, if the corruption is in a user table, well, you could simply drop that user table, add it back in, and pull the data back into that table from backups, and you’d probably be in pretty good shape. But if it’s in a system table, well there’s not a lot of ways to repair system tables. So usually, if the corruption is in a in a system table, that means we have to take on a very different repair strategy than if it’s in a user table, because system tables, it’s unlikely that you’re going to repair that database, and you need to usually start out by creating a new empty database and migrating all of your data to that new database, or going off of a backup and adding the missing data in, or different options like that. The other thing we look at is, when we’re looking at options for repairs, how many tables are corrupt? If you’ve got 4000 tables in your database and all 4000 of them are corrupt, that’s going to be a very different problem than if you’ve got 4000 tables and only three of them are corrupt, and that part of that comes into like, what’s it going to take to repair it? And we have tools that we’ve built that will help with that corruption, and it will help when there’s many tables that are corrupt. But still, if you’ve got one table that’s corrupt versus 4000 it’s a heck of a lot more work to repair those 4000 tables. The other thing we’re looking at is where exactly back to where is the corruption? If it is in user tables, is the corruption in non clustered indexes, or is it in the clustered index? And if the best case scenario is that all of your corruption is in non clustered indexes. Well, that’s easy. Just drop those non clustered indexes. Re add them, do CheckDB, and your corruption should be gone in that case. So, but the one thing the options here are, reach out to me, reach out to Stedman solutions, and we can help you with that corruption repair. And we offer a free 30 minute consult. If you’ve got a corrupt database and you just want my time for 30 minutes to talk about options, reach out. We’ll give you that. And there’s been even a couple of times where we found like corruption and non clustered indexes that we were able to do the repair with the client in that free 30 minute consult. But at the end of that 30 minutes, we’re generally going to know here’s where the problem is. Is this going to be a big issue to get it repaired? Is this going to be a small thing? Is this something we can repair? Is this something that you’re that’s hopeless and you should just look for a new job? I mean, I don’t mean to be too snarky on that, but sometimes, if your database is too corrupt, well sometimes the only way to fix it is with a new resume and go get a new job. Hopefully it doesn’t come to that, and we’ll do everything we can to prevent it from coming to that for you. But sometimes, if people don’t have the right backups, they don’t have the right check TV strategy and all that, it can lead to a game over situation for that database. But we will do everything we can to help before it comes to that,

Shannon Lindsay 34:02

Does TDE change anything in dealing with a corrupt database?

Steve Stedman 34:08

Oh, yeah, absolutely. So TDE is transparent data encryption, and I want to make sure when I talk before I tell you about the things that it changes in corruption repair, that I want to make it clear we recommend TDE for almost all databases, unless you don’t care about people stealing your data, and for anything in medical or finance or anything that’s confidential or HR related, you always want to make sure you have your databases TDE encrypted. Now, however, when you’re working with TDE, the way it works is that, in it encrypts everything that gets written to disk. So what that means is, if you’re loading a data page, a data page just an 8k chunk of data, and let’s say that data page has 50 rows in it, well, with TDE encrypted, you’re either going to get that. Page back, or are you going to get nothing? Because if that encryption has been messed up through corruption, well, it’s not going to be able to unencrypt any part of that page. The whole 8k chunk will be gone or unrecoverable in that case. Now, if you didn’t have TDE, which we generally don’t recommend, but if you didn’t have TDE, well, it’s possible that we would be able to pull back some of the data out of that 8k data page. If there were 50 rows in it, we might be able to pull back 49 or even 50 of them out of the page by reading it. But with TDE encryption, it’s an all or nothing thing. When you’re reading a data page, it gives a few less options to walk through in the corrupt page data to help us understand the corruption. However, that being said, if your question is, Should I be TD encrypted, or should I not? Because it might make corruption repair a little bit harder, always do TD encryption. If you’re asking that, because it’s like, well, it’s kind of like the question of, well, if you live in a place where you’re going to get bringing up burglarized, if you don’t lock your house? Well, I mean, you’re going to lock your house, and I think TDE is kind of like making sure you always keep your house locked. So, yes, TDE makes the repair a little bit more complicated. It does eliminate some of the options that we have without TDE, but we still would not recommend turning off TDE simply if you expect corruption instead, I would recommend focus on more backups. Focus on better backups, focus on a longer backup retention. Focus on how you can run CheckDB more often and have a learning on it, rather than the problems that will arise out of TDE.

Shannon Lindsay 36:37

You said you’ve been working with corruption for several years now. So within that, I’m sure you’ve come across some pretty interesting situations. What’s the craziest corruption repair that you’ve come across?

Steve Stedman 36:53

Well, okay, so first off, yes, we’ve dealt with a lot of seriously corrupt databases that we’ve helped people with we’ve had pretty good success on many of them, and however, with politicians and judges and police and a number of things like that, the term corruption has been a little bit, I don’t know, taken over and overused, perhaps, by the media and those people who don’t like The police, or those people who don’t like our legal system or our politicians or whatever. And so unfortunately, the term corruption as it applies to databases is very different to corruption as it may happen in the government or the legal system. Okay, now to be clear, we only do database corruption. We don’t do government repair. That’s Elon Musk out there chasing that one today, but it was so the craziest corruption we came across was it’s 2am on a Sunday morning, I get a call from someone, and we’re on the West Coast. This person have to be on the east coast. So it was about 5am their time on a Sunday morning, and they needed help with corruption. They’re in a situation where they need my help right now with fixing a corrupt database. And I don’t know if they didn’t know enough to understand what my help page was describing or how we do corruption repair, but it turned out that they had just been arrested by, in their terms, corrupt police officer, and that they were going to have to go before a corrupt judge for sentencing, and they needed me to get into the police computer system and action with the police and the judge. Well, whether or not the police or the judge were actually corrupt, is debatable. I mean, I don’t even know what location this was coming from, but even if they were completely corrupt, that’s not the kind of thing I had, that I do. I had to explain to them that that’s a different kind of corruption than we work on. And I just felt so bad for this guy, because he thought he got me on the line and I was able to get him out of all the trouble he was having here. And no, there was no hope in being able to well my perspective, and being able to fix this. And but the craziest part of it is it took me almost five minutes of his one phone call after being arrested to explain to them exactly why we couldn’t help with that, because what the difference is between what we do with corruption repair versus trying to fix our system. So crazy things happen, and that’s one of those that I’ll never forget, that even though it was one of those groggy 2am calls that someone got a hold of me to help them with their corruption. Yeah,

Shannon Lindsay 39:41

What is your overall recommendation in order to, I guess, avoid getting corruption?

Steve Stedman 39:52

Well, that’s a little bit the way you ask the question, though, to avoid getting corruption, well, there’s no way to truly. Avoid getting corruption. The question is, how prepared will you be when it happens? Now you can go and buy the best hardware and have the most redundant power and the best disks and the best support and whatever, and you still may come across a case where you end up with corruption. So my recommendation is, don’t pretend that you’re not going to get corruption. Be prepared for when you do get hit with corruption, starting with frequent backups, full logs and diffs, frequent, CheckDB daily, or twice daily, or even more frequently, if possible, get help early. If you go to Google and you find things like repair, allow data loss. Don’t do that, because that might reduce your options for repair. Turn on alerts for SQL server failure, specifically severity, 2223 24 and 25 as well as errors 823, 824, and 825, and that last three, errors 823, and 824, and 825, those will be the most kind of indicative that if you see those that’s like you’re on the Titanic and you just hit an iceberg, you better stand up and pay attention and figure out what you’re going to do to fix those right away. Otherwise you’re going to sink. Take these very seriously and run CheckDB as soon as possible to find out if your database is okay. After seeing those kind of errors, and make sure your backups are good when you see those as well. So the other final thing I want to say on that, as far as overall recommendations, ask for help early. I mentioned our free 30 minute assessment or consult. Reach out to us, and even if you call me after hours and say, I’ve got a corrupt database, I need help with it. I’ll figure out what’s the soonest time we can get that 30 minute, 30 minute analysis in to figure out what your options are. And a couple of times we’ve actually completely fixed the corruption in that 30 minutes. So the sooner you get us involved, the more options you have for recovery.

Shannon Lindsay 41:56

Yeah. How can I practice corruption repair and learn more about it?

Steve Stedman 42:03

That’s a good question. So as I mentioned the very beginning, in 2015 I created a database corruption challenge. And then a couple years ago, I took a lot of the content from the corruption the database corruption challenge, all 10 weeks of it, and repurposed it and updated it and added documentation and instructions and all that, and put it into a class called our database corruption repair course, available at stedman.us/corruption and for $19.99 Gosh, sounds like a late night infomercial, but you can, you can sign up for that class. You can download corrupt databases. You can work through it. You can compare your solutions to what other people have done, and just get take a stab at here’s a corrupt database. If this was your database, what would you figure out how to fix it? So I think that that wraps it up. Thanks for Thanks for CO hosting today, Shannon, and thanks everyone for watching, please join us next week or next week, we’re going to discuss the detecting and tracking of changes in your database schema with our schema drift database diff tool. And keep in mind that all of our episodes are available on YouTube. You can get their shortcut with stedman.us/podcast YouTube, and they’re also available on Spotify, with stedman.us/podcast Spotify, however, Spotify is audio only on YouTube is full video, so thanks for listening. Have a great day, and hopefully you’ll learn something about corruption today that you can take away and make your next corruption repair easier. Thanks for watching our video. I’m Steve, and I hope you’ve enjoyed this. Please click the thumbs up if you liked it. And if you want more information more videos like this, click the subscribe button and hit the bell icon so that you can get notified of future videos that we create.

Podcast Season 2 Ep. 6 Corruption

Stedman SQL Podcast Season 2 Episode 6 Corruption

Podcast Transcript