Steve and Derrick Share Corruption Stories

Grab a blanket and get comfortable. It’s story time with Steve Stedman and Derrick Bovenkamp!

In this 10 min video they will share with you database corruption stories. From these stories hopefully you’ll learn what not to do. Things to watch out for to avoid server corruption.

Transcription:

Steve Stedman 0:04
Let’s roll onto the corruption stories. So the first one we’re going to talk about here was one of the first really big corruption repairs that I did after Derek joined the team. And for this one, we’re going to call this one twice fixed, because we really ended up having to do do things twice on this. And what we ran into was that they had 80 tables that were corrupt. And all of the corruption was in recently updated areas, meaning it wasn’t in old records, it was all in records that have been put in the database recently.

Derrick Bovenkamp 0:36
You know, this was this was, I think, actually my first or one of my first corruption repairs with Steve. And one of the things that we ran into is, you know, we we met with a customer, and we started working, oh, you know, about close to that 36 hours in that we thought we were repaired. We ran another check dB. And unfortunately for this customer that the database was so big, that I believe check DB was taking 12 to 18 hours to run. So we ran another check DB expecting it to be mostly repaired, and noticed that new tables were corrupt, that weren’t corrupt before. And corruption that we hadn’t fixed was back but different. Something else was going on with this one.

Steve Stedman 1:24
Yep. And with this, we started on a Thursday night. And the customer spun up two virtual machines on Amazon. AWS, we had two really, really high end fast servers to work on to do the work, which really helped a lot. But we did all the repair there. And then once we did it, we had scripts to reproduce it in rebuilt and rerun them to fix everything in production. And that’s when we ran those that we found out that yes, things were becoming corrupt again. And we started tracking it down as to what why is this happening? And it was pretty bad. I mean, it would be another table every few minutes was becoming corrupt. What did we end up finding there? Derek was the root. Yeah.

Derrick Bovenkamp 2:05
So I mean, this was one of the first times we worked together on corruption. And it was one of the things that really cemented us as a team is I was able to go take my systems administrator experience, and start working directly with the customer. While Steve was still repairing some of the original corruption, we started going, you know, system by system, we went to their virtualization layer, we went to their storage, we went to their ice cozy, we actually tracked it down to I don’t think we’re 100% sure if it was either a bad twinax cable or a bad switch. A little bit more to the story later I I’m into there was something with the switch, they were able to replace the cable upgrade the firmware on the switches and new corruptions stop happening. And we were able to finish the initial repair and the new tables that were corrupt and get them up and running that Sunday. It was quite a long haul.

Steve Stedman 3:01
Yep, it was Sunday night. But it was late enough Sunday night that I missed dinner with the family. But it was early enough that I didn’t have to stay up late on Sunday night after being up most of the night, Thursday night, Friday night and Saturday night working on this. It’s basically the takeaway on that is really to get like a good baseline and understand that when you do have corruption, don’t just run check DB once or check tables just run it multiple times. So that you know are things changing, or is is this getting worse over time. Our next one? We’ll call this one Thanksgiving. And this was a corruption repair that we did. It was reported on the Wednesday of Thanksgiving week. And it was unknown how long it had been corrupt. And of course, when you take a look at this, when people say it’s unknown how long has been corrupt, oftentimes we find out Well, there’s no backups as well. Because if we did have backups, we could go look at a backup from a week ago or two weeks ago or 10 weeks ago and see was the corruption there at that point. But we didn’t have any information as to how long it had been corrupt.

Derrick Bovenkamp 4:00
Yeah. And the unfortunate thing there for this customer is actually it had gotten bad enough to the point where it had shut down their their production and assembly. They were really at a standstill when they called us and we were able to jump in and actually work over Thanksgiving to get them up and up and ready to go.

Steve Stedman 4:20
Yep. And we were able to get it 100% fixed with zero data loss to get it back up to a clean system and get it ready to go by Monday morning. And really, if this hadn’t been fixed, they would not have been able to run their production line as I understand it. On Monday. One things opened up after Thanksgiving weekend. Now, one of the one of the things on this that was a big tech takeaway was that it had happened at some point in the past and they didn’t know how long it had been corrupt. So the key thing here is early detection and understanding. When you do get hit with corruptions, you have more options to do the recovery.

Derrick Bovenkamp 4:54
Yep. Next one here. We’re going to call health care and I don’t know what it is with steven eyes luck but seems like he’s corruption repairs always come around holidays this one was was just before new year’s eve and the actual corruption had happened on christmas eve they did have some notifications in place but because it was over christmas they didn’t have people checking them and by the time they had it figured out that something was really really wrong it was already a few days since the corruption happened

Steve Stedman 5:27
yep and because of that they could not run backups and the backup process they had in place had removed their older backups they were in a position where check db crashed real quick when we ran that we were able to run check table but they didn’t have any backups and this was a situation where as a medical clinic they needed this information the next day in order to be able to admit patients and do patient care in the in their clinics so we started this on tuesday night at 6pm derek and i tag team worked it through the night trading off on different tasks and by 9am the next morning i think they had a couple clinics that started a little bit late that because of this but we had it up and running for them ready to go the next morning and this was one of those that the database was so bad off that we had to create a new empty database and move everything from that corrupt database over to that new new clean database

Derrick Bovenkamp 6:17
you know and this one it was you know really rewarding for steve and i because you know the client wanted it by the next morning and wait obviously we couldn’t promise them that we told them we give it our best effort but you know we figured it would be 48 hours you know it was i forgot what time it was at night you know it might have been like midnight or 1am when that became when we thought 9am would become possible but steven i can admit it our personal mission yep to to hit that hit that deadline and i think we hit it you know within a few minutes and they were they were up and running

Steve Stedman 6:52
and this client had pretty good storage and i think that if they hadn’t had as fast of storage as they did there’s no way that we would have made it by that 9am deadline that they had so that was a big win for them this next one was called vm host issues and what we ran into here was that there were several like a couple dozen tables that were corrupt and basically we went through and we use the normal processes to find the problems track them down bring back all the corrupt data we had to pull some stuff in from backups and whatnot and we did the repair and after doing the repair we ran check db and we’re starting to see corruption in other tables

Derrick Bovenkamp 7:36
yeah and this one was a it was it was a small enough database that we’re able to run check db multiple times and every time it ran it was given us back something different and one of the things that was really really hard working on this one is and i think steven i use it as a good example and it’s something for you all to take back if you ever experienced corruption when you’re working with your management is you know i can put put your database in a really precarious place and you know just because it’s working or mostly working doesn’t mean that you can keep operating like that and you know the client here was really didn’t want downtime steven i had to work really hard to convince him it was it was something that needed to be fixed it needed to be fixed quickly

Steve Stedman 8:27
and on this there was two types of things that we were seeing there one was what i’d call real corruption where the data was definitely bad on disk and we had to go figure out what it was and get it fixed and then the other was what i call fuli or fake corruption where you run check db and when it read from disk it was reporting that it was corrupt but it was having read errors so what was on desk was actually fine but because of the issues that the virtual machine was having and that’s what we tracked it down eventually it was a vm was having difficulty on the hyper v host and we shut that vm down and moved it to different hyper v hosts and immediately and the storage was network attached storage and immediately after moving it everything started working just fine so it was really an issue with that vm being able to read from the shared storage or the the network attached storage correctly and by moving it to a different vm the issue went away but then we thought well was it simply because we shut it down and restarted it that it went away so what we did is we took one of the test sql servers and moved it back over to that vm host where the issue was happening originally and when we move that over there the test server immediately started having the exact same issues so something to do with that hyper v host and never found out what the host what was going on but we do know that running it on other hosts took care of the problem completely

Derrick Bovenkamp 9:48
yeah and i believe it did we did eventually get it fixed by you know getting the firmware on the switches upgraded and the host rebated this is when the host had been up for 600 some odd days it also Coincidentally had the same make and model of ice cozy switch as the other customer, twice fixed customer. It’ll lead into one of our suggestions here later in later in the presentation for avoiding corruption.

Steve Stedman 10:16
Yep. So these are really just four of the many corruption repairs that we’ve done over the last several years. But these are four that kind of point out specific issues. Hopefully this never happens to you, but just keep these in mind if it does.