Podcast Episode 10 - RPO and RTO - StedmanSolutions, LLC

Hosts: Steve Stedman / Shannon Lindsay
Recording Date: November 27, 2024
Topic: RPO and RTO

Stedman SQL Podcast Episode 10 – RPO and RTO

Steve Stedman and Shannon Lindsay discuss the transition of their podcast from live streaming to pre-recorded episodes, available on YouTube, Spotify, and their website. They announce the upcoming release of a new database health monitor with features like VLF visualization reports and SQL schema search. The episode focuses on RTO (Recovery Time Objective) and RPO (Recovery Point Objective) for SQL Server backups, emphasizing the importance of aligning these metrics with business needs. They highlight the role of transaction log backups, log shipping, and immutable backups in improving recovery times and points. The conversation also touches on the necessity of testing backups and maintaining open communication with management.

Podcast Transcript

Steve Stedman 00:15 Hey everyone, and welcome to this week’s live stream. Or no, it’s not live stream anymore. It’s the podcast. Welcome to this week’s Stedman SQL Podcast, episode number 10. I’m your host, Steve Stedman, and my co host this week is Shannon Lindsay, hello. Hey, Shannon, hello. So just to start with, one of the things I want to share is that our podcast is available on YouTube, on Spotify, and you can go to the main podcast page on Stedman solutions and find it there as well with links to these locations and all of the episodes. Also want to point out, this is one of the first podcasts that we’re doing without the live stream, meaning we’re recording it and then releasing it after we’ve done a little bit of editing. As we get to Episode 10, we’ve learned along the way that there’s a few things that work well and don’t work well, and things we can do to improve it. So we’ve decided, instead of doing the live stream, which we don’t get that many listeners on, we’re just going to release it on these locations on about a weekly basis and see how it goes. Seem to get a lot more viewers on YouTube and Spotify than we are getting on the live stream edition of it. So that’ll be you’ll get a better podcast that way, because we’ll build you a little bit of editing as well. Also keep in mind that for the full video presentation that’s available on YouTube, and that Spotify is only the audio version of the podcast. Some news we have. We’ve got database, health monitor release coming soon, with several new and exciting features. In a week or two, we’re going to have a podcast that talks about some of these features. The developers have been working very busy over the last month, and there’s some really cool new visualization, VLF visualization reports, so you can see how your log files are actually laid out. We’ve got a new SQL schema search as part of database health monitor that’ll let you search your entire schema, your store procedures, your tables, functions, everything like that for specific terms and keywords, makes finding things a lot quicker in the database and several other new reports. So that release should be out around the time that this podcast airs, and then we’ll be doing a database health monitor podcast about a week later on that so also want to throw out that if you want to be a guest on your podcast, if you have some SQL Server topic you want to share with our listeners, can visit our podcast page at stedman.us/podcast, and click on the podcast guest page. There’s information there and links so that you can schedule for one of our standard recording times. We used to have you email Shannon. You can still email Shannon at Stedman solutions if you want to be be a guest. We’ve got it set up with links now, so you can just schedule and pick a time that works for you, that’s on our schedule, and we’ll do a podcast together. Keep in mind, though, that we do have the final review on who gets to be on the show. So if you’re going to pitch something, pitch something good, and hopefully you’ll make it on the show. So this week’s topic is understanding RTO and RPO and for backups in your SQL Server.

Shannon Lindsay 03:27 What exactly is RTO and RPO?

Steve Stedman 03:32 Well, these are two terms that have come up. They’ve really come from, I think they evolved as a way to explain your recovery time objective and your recovery point objective. These are two really important things in managing systems, managing databases, things like that. So it’s interesting, because if you talk to like upper management, oftentimes what you’ll hear is, Well, we want to be able to restore any point in time from the database, and we want it to happen immediately. And you might talk to someone who’s actually doing the work, and they say, Oh yeah, we can get you a restorer of last Saturday. And that’s all we’ve got available. And it’s going to take us a week to get that back online if you need it. And there may be anything in between those so what recovery? What RTO, or recovery time. Objective talks about is how long it’s going to take to bring your system back online with the specific backup we’re talking about. And the RPO is, at what point do we want to bring that back online? And if you’re in one of those positions where you might be responsible for backups and doing the restores, and you discover that your management team maybe has a different view of how backups are going to work, or how they’re not going to work. This is a great way to communicate with them, where you can share the this is what we have and this is what we’d like to get to. It’s basically a way to be able to communicate. Status. It’s better to communicate this ahead of time, rather than after a failure, and find out that the RTO or the RPO does not meet the business objectives after the failure. It’s easier to find it out ahead of time. So what recovery point objective or RPO is, it’s basically when the system fails. What’s the worst case scenario for a point that you can restore up to. So it’s the point you can recover from. So let’s say you have a full weekly backup that runs on Saturday nights, and that’s your only backup that you have on your entire system. Well, you are your RPO, if you fail on a Monday, is going to be about a day and a half your recovery point objective, if you fail on midday on a Saturday, which is or late in the afternoon on a Saturday before that weekly full backup runs, well, you’re almost seven days of recovery point objective. Now, if you have log backups that are running every two minutes, for instance, or even every one minute, well then your recovery point objective, or your actually recovery point is one minute you can restore to any one minute increment, and then even sub minute, depending on whether you do a point in time, restore with that. And the objective is just really the goal that you want to have associated with that. And then that really, I mean, the main thing that you have around that is the type of backups and the frequency of backups that you can control what’s happening with that recovery point. What also associates with that is how much you can keep around. Now, you may have a recovery point objective or a recovery point of one minute, but how far back can you go on that? I mean, one minute might be you have 24 hours of one minute backups you can go back to, but you might not be able to go back to a one minute interval from three weeks ago.

Shannon Lindsay 06:49 So what examples, would you say for would be a good description of somebody that might meet might need a every minute.

Steve Stedman 06:59 Oh, so, I guess the thing I would look at that is, who or what systems do you have that the business cannot survive losing data, even a minute worth of data. So for instance, if it’s your bank or your doctor’s office or something like that, where, if I just do a deposit at the bank, and they have recovery point objective of a week and their system fails. Will they lose a whole week worth of transactions and a whole week’s worth of my deposits in that bank? That’s going to make me really unhappy. But if they have recovery point objective of 30 seconds or one minute, or something like that, and their system fails, well, they’re only going to lose those transactions that happened in the last 30 seconds to a minute with that, and that’s usually a close enough point in time that they can check with whatever clerk or teller took that deposit and get it sorted out. Think about like a health care facility where they’re prescribing medications and taking orders or getting test data back. Now, if you lose a whole day’s worth of data on who you prescribed medications to, that’s going to be a pretty bad scenario from a legal perspective, I believe. Now, if you lose 30 seconds or a minute of that, well that’s a short enough time frame that the doctor or the nurse or whoever’s doing that those medications will know that they can redo that, that data, and they don’t have to think back for the entire day. They only have to think back the last few minutes. Yeah. But then let’s roll into the recovery point of or recovery time objective, and these two work together. Now it might be that your recovery point objective is one minute, but if it’s going to take you seven weeks to get back to that one minute point in time, that’s not going to that’s not a very good scenario. So the recovery time objective is the worst case scenario for how long it will take the system to get restored and running again. And there’s some scenarios where you might have to go get data out of cloud storage, and you have to take an hour to download it, or do things like that to get backups from cloud storage, and that would slow down your recovery time. Objective. But if you have like what we do with a lot of our clients, where we have pre built restore scripts, where if there’s a failure, all we have to do is open up a pre built restore script that does the Restore, the full backup, the restore the differential backup, and the restore of all the logs to a point in time? Well, that’s a really quick way. And at that point, we’re not waiting for someone to go to Google and search and find out, how do I do a restore, and all those kind of things, we’re able to get the backup going really quick. So the recovery time objective is really the goal of how long it’s going to take to get it recovered, and the recovery point is how much data you’re going to lose. So an example, I mean, let’s go back to an example on the bank. For instance. I mean, let’s say that your bank had a failure and it had to restore from backup. Well, you can’t close the bank down for a week and wait. For your database to restore. If that’s what it’s going to take, you need to be back up and running quickly. And yeah, it might be that you have to keep manual transactions for a while in order to cover an hour or a half hour, or whatever the restore time is, but making sure that that recovery time objective meets the business needs is really important. Let’s jump into the relationship between RTO and RPO, and kind of keeping that like all in mind. So one of the things there that we want to consider is what type of backups you’re doing now, some people try and get an RPO and they do a full backup every six hours for a recovery point of every six hours? Well, me, I’d rather have a full backup once a week with a differential backup every six hours, because that’s going to run a lot faster. And then do things like transaction logs every minute, or every five minutes, or whatever the window is there that’s appropriate for you. Other things you can combine with this around your RTO and your RPO are things like high availability solutions and things like always on availability groups or log shipping. Now an always on availability group that’ll help you if your primary server fails, the other one, the secondary one in the group, takes over at that point. That’s a way to keep have almost instantaneous recovery point. Now that doesn’t help you if someone deletes all the data in a database, or if somebody drops a database, things like that, but other tools, like log shipping can help with that too. I mean, what log shipping is effectively doing is taking all your transaction logs and keeping them in a state of being replayed on a secondary server, so that in the event that that primary server fails, all you have to do is just say, turn it on and restore the database with recovery, and it brings that database back online. So you don’t have to wait for those full backups and log backups all to be restored. It’s log shipping and standing by there and ready, so that you don’t have to wait to do the restores. And that’s one of those. I mean, we’ve seen scenarios where with clients using log shipping, that they have a failure on their primary system, and we’re able to get them back up and running in a matter of minutes on the secondary system, rather than even looking for where do we go and restore backups they’re already being replayed on that secondary system. And then you want to keep in mind things like off site and cloud backups. Those are great for disaster recovery, but they’re usually not a really quick way to get data back. Oftentimes, when you put data in the cloud, like AWS cold storage or different Azure Blob Storage, it takes a while to bring it back to somewhere that you can use it. I know you can do a restore directly from there, but you’re looking at your internet speeds that get in the way of how long it takes to bring those back. So that’s an important factor. So you really need to look at, how do you balance the cost and complexity with the business requirements.

Shannon Lindsay 12:51 Yeah, everything costs money, right?

Steve Stedman 12:55 Yeah, absolutely, and that everything costs money. That’s an important thing, and that often times discussing different recovery point and recovery time, objective goals with management and discussing the cost with them that allows them to make different decisions on how much money are you going to spend on things so on to some of the tools and techniques. I mean, like I mentioned already, with transaction log backups for minimizing the recovery point objectives, but also log shipping, like I mentioned, and being able to do log shipping to move though to already be doing a restore of those logs. So all you have to do is activate the database if something happens, the primary is a great way to reduce that recovery, recovery time. And then keep in mind with transaction logs, even if you’re doing a transaction log every five minutes or every one minute, you can do a point in time recovery with those and say, recover up to 1041, and 36 seconds right before some catastrophic event failed. If you need to using those transaction logs, another important thing is to consider third party tools and monitoring solutions, and that’s one of the things that we do with database health monitor and with our managed services, is that we track when backups aren’t happening. We track when something breaks the restore chain, like an out of context full backup that gets moved and is not available, things like that, so that we can make sure the systems stay running. But the really most important thing is whatever you’re using for your backup and restore process is to be able to test it to make sure that you really can meet those goals. I mean, if you’re saying that it’s going to be a 30 minute recovery time, and you try it and it’s going to take you six hours to copy a file. Well, it doesn’t matter what you told people about that 30 minutes you’re stuck waiting for that file to copy.

Shannon Lindsay 14:50 Can you give me a couple examples of like, actual companies that you’ve worked with, not naming names, but just real life examples?

Steve Stedman 14:59 Yeah. What we’ve run into a lot of the time, and I’m going to take a medical system that we worked with where they had in place a daily full backup, and they thought that that was going to be good enough for a recovery point for their management and we started talking about it, and as soon as the management found out that they would lose an entire day’s worth of medical information, they needed something better than that. So what we did is we went to a recovery point objective that was almost zero. We went down to, I think, 30 seconds on our log backups, and then that allowed us to be able to have those log backups in a position where they could be restored quickly, and then having things scripted out fast storage. And more importantly, backups that were in a position that they had fast access from a secondary system allowed us to get a recovery time objective down to be very small. And I think where we’re at on that was less than 15 minutes in a catastrophic failure. We could do a full restore of the database and bring it back online to any point in time. Now, one of the important things in that too, is knowing who’s going to be doing that restore and what the process is, I mean, with that recovery time and making sure that people are trained and ready to be able to react to that. Yeah.

Shannon Lindsay 16:27 Do you think there were some lessons that they took away from that?

Steve Stedman 16:31 Oh, yeah absolutely. I think that the lesson that I mean, the biggest lesson that I’ve seen over and over again with different clients is communication, communicating what the reality is, what we can do right now, what our current recovery time and recovery points are, and then discussing those with management to understand what their objectives are and what they’d really like to have, and then figuring out, how do we take what we have today, which might not meet those objectives, and how do we get that to the point that we can meet those objectives, or at least get closer to those objectives.

Shannon Lindsay 17:04 What do you think are, what are some common issues and pitfalls, and how would you avoid them?

Steve Stedman 17:12 Well, I think a big pitfall on the recovery time is alerting, meaning if and a way to activate an emergency response plan there, meaning that, let’s say it’s Saturday, well, let’s just say it’s the holidays, and there’s several people out of office, and something fails on one of your core systems, but it might sit there for three or Four or five hours, being down before somebody notices it, if it’s around the holidays. So the biggest thing is kind of knowing that something is actually failing so that you can get people on it right away. Because even if you’ve got the best restore planning, you can know you can get that restored in five minutes. It doesn’t help you if it takes three hours before someone tells you it needs to be restored. Another one I’ve seen a lot of times is people who rely on backups without ever testing them. And I like to say that a backup it’s not about backups, it’s about restores. And unless you know you can do a restore those backup files, the backup file is worthless. And I’ve seen people where they’ve lost their entire system. And of course, this is not our customer, our regular customers, but they call us up after the fact and they say, Gee, something happened catastrophically destroying my database. And it turns out that my backups haven’t actually been running for the last eight months. Well, if you had a process that regularly tested those backups, well you’d know that they were failing, and that’s an important thing to do. The other thing is not accounting. I mean, a big pitfall we’ve run into is not accounting, accounting for hardware or network bottlenecks. And take the example of you’re doing great. You’ve got two data centers, you’ve got a data center in Seattle and you’ve got a data center in Phoenix, Arizona, and all your backups are sitting in Seattle. Your Seattle database failed, and now you have to copy all those backup files from Seattle to Phoenix in order to be able to do the Restore. If your network pipe across, your VPN or whatever you have between the two sites is not fast enough to transfer those backup files, you could be in a world of hurt at the time you have to do that restore. So that’s where we like to make sure that the backups are being written to, or at least copied to the location where you have to do the Restore, so that that’s using that bandwidth ahead of time, rather than at the emergency issue where you might be stuck waiting on that network bottleneck there. And really, I mean, the biggest pitfall, really, is just having something that’s not working in any part of the process, because you’ve never tested it. So Testing, testing, testing is the most important thing I can think of there. One of the areas that we like to get into is. Partnering, and we help a lot of clients with this through our managed services offering. And we also do other things, which I’ll mention in a minute. But this is a complex process in understanding your recovery time and your recovery point, and if this is your first time ever doing it, well, you might make some mistakes along the way, where, when you have someone who’s done it, whether it’s a seasoned DBA or a team like we have at Stedman solutions, where we can jump in and work with you, we might be able to help you get to that recovery time or recovery point objective, or at least assess where you’re at in a much, a much faster process than if you were trying to do it on your own for the very first time, and that’s one of those things that when we take on a managed service client, we want to understand what their objectives are up front, and in the first month, we want to make sure that we have backups in place to make sure we’re meeting those things. But we also offer, if you’re not a managed service customer, we can come and do a backup assessment, or we also have our mentoring service, which we offer, where you can buy four hours of time and from us at a time, and we just get in and work with you and help you through these kind of things. But we also have tools like, Database Health Monitor in order to assist in proactive planning, where we can go look at what the backup history is and things like that, so you can, under better understand what’s currently happening, so that you can come up with at least your current recovery time or recovery points.

Shannon Lindsay 21:27 What would you say is your biggest call to action out of all of it?

Steve Stedman 21:32 Oh, just figure out, go, go, evaluate and determine what your current recovery time and recovery points are, and then take those and determine, do they meet the business needs and not maybe it’s if you’re the person responsible for that, the person who’s going to actually be doing the restores, make sure you can test them and that you know what those recovery times are and what those recovery points are. If you’re higher up the chain and you’re the one making the decisions on what they should be, well, you should understand what the reality is of what they are today, and then figure out, how do you get closer to what the goal is that you really want? And sometimes that’s change in processes. Sometimes it’s changes in hardware. Sometimes it’s building out a second server or having a larger location for where you keep backups, things like that. But also it’s one of those things that if you need help, you can talk to us at Stedman solutions, and we can help with that. I think that we’ve got a lot of different I guess there’s a lot of resources out there in evaluating your recovery time and your recovery point, but really the best way to get to it is to just go try it. I mean, practice scenarios. I mean maybe with one of the clients we worked with, we would have a weekly disaster recovery planning meeting, and in that meeting, we would throw out scenarios. We throw out a scenario of today, at 10:57am, there was a power spike that somehow blew out the power of your primary data center, and because of that, you now have to go and recover at your secondary data center, work through the stereo. We’re not actually going to go do it, but we’re going to work through what would you do in that scenario? Is kind of and make sure that everyone in the room, everyone who’s going to be part of that process, whether it’s your most senior person on the team, or as the junior person that might be covering over the holidays. Make sure that everyone knows what’s involved, so that when it does happen, it’s not your first time ever doing it. So at this point, I think I just really want to emphasize how important this is, aligning these things with your business goals, and it’s always better to find out that maybe your recovery point and your recovery time doesn’t match the business objectives before that failure happens, because when that failure does happen, you can’t you can’t negotiate, you can’t work out better options. At that point, you’re just stuck with whatever you’ve got planned, and if you haven’t planned where you’re stuck with probably a pretty bad situation. So planning and testing are the absolute best way to avoid data loss in these type of scenarios. And I think at this point there’s, there’s a lot of different things we can look at. We have, I mean, we can help with this, but we also have a backup and recovery class, I forgot to even put that in the agenda. I’ve built a backup and recovery class that we have available, that you can purchase, and that takes you through a lot of the processes for doing these things, and it has a whole section about recovery time, objective and recovery point objectives. But the biggest thing I would focus on there is what type of backups you have and what’s your best way for getting them back any other thoughts or any other questions on recovery point or recovery time? Shannon,

Shannon Lindsay 24:48 No, I think you covered a lot of it. I think you’re right on communication, just in general is such a huge part of a. Success of any company, and just being able to have a plan in place, have everybody on the same page, and know there’s a lot of people that don’t handle emergencies very well, and I think that knowing the plan ahead of time and who to call, who to communicate with is the number one plan to have,

Steve Stedman 25:27

yep, yep, absolutely. And I think that making sure that when you communicate that you’re that everybody is completely honest about it as well. I mean, I’ve seen people where we talk about their objectives here, and they know quite certainly that their actual reality does not match the goals, but they just play along with it, hoping it’ll never happen. And that’s a real dangerous place to be in. So honest communication in a way that it’s designed to improve the process is the way to go there, all right. Well, I think that wraps up the RPO and RTO discussion. Let’s move into the ask Steve SQL questions that we have every week on the podcast, and the one that that’s this week that fortunately, is related to backups and recovery is what is an immutable backup. Now this is one of those things that I don’t know 1015, years ago, people didn’t really care too much about immutable backups, because there weren’t as many bad actors out there, like ransomware and things like that. And what an immutable backup is, is a backup that’s being written somewhere that cannot be changed for a certain amount of time. And what that means is, if I just have a network drive, let’s just call it the X Drive, and I’m writing all of our backups to the X Drive, and your computer gets hit with ransomware. One of the thing the ransomware will do is go out and encrypt a lot of those backups, or destroy those backups in a way that you can’t get them back without paying the ransom. Now what an immutable backup is, is a backup location where your backups are being written to that once they’re written, they’re locked, they can’t be changed, they can’t be they can be read, they can be used, they can be used for restore, but they can’t be modified or destroyed for a certain amount of time. So for instance, we have backups of some of my servers here where we back up to Azure, Azure Blob Storage. It’s set up as immutable backup, where everything that gets written there is stuck and cannot be changed for 60 days. What that means is any of the backups that are written there, even if we get hit with ransomware, we can go get those back, and we know they’re not going to be modified or changed in any way. I think immutable backups is one of those things that is critical in your backup and recovery process, because without immutable backups, there’s no way you can guarantee that you have an actual backup that hasn’t been destroyed in some way by ransomware seen too many times where somebody will say, Well, we had backups, but The ransomware hit those and encrypted those before it actually took out our our live server, and we can’t get any of our backups to even work. So immutable backups, oftentimes they’re associated with off site storage, but they can also be on site storage. There’s a number of backup appliances that you can have local, immutable backups as well, or immutable storage in a way that it gets written to and it cannot be changed for a certain amount of time, should be a critical part of anyone’s backup and recovery process. So hopefully we covered that one on to the holidays. Christmas is coming, and if you’re having trouble shopping for that SQL Server expert, or SQL Server programmer, we’ve got a great option here. It’s our join type socks, where you can go buy a pair of socks that has Inner Joins, left, outer joins, right, outer joins, FULL OUTER JOINs, cross joins, cross supplies unions, intersects, all those kind of things for joining tables on your socks, not only will they keep you warm at the holidays, but they’ll keep you informed. And if you’re on that next interview where somebody asks you, can you explain the difference to me between an inner join and an outer join? You can look at your socks and show them pictures. Kind of a nice cheat sheet there, but it’s one of those things would be great in somebody’s Christmas stocking. So that wraps up this week’s podcast. Join us next week. Next week, we’ll be talking about database, health monitor and the new features that the development team has just released. This is going to be an exciting one. You can watch the podcast on YouTube and Spotify, and you can go to stedman.us/podcast is a short link to get to our main podcast page. And also, if you want to be a guest, there’s a link there to fill out the form to be a guest. So thanks for watching that wraps up this week’s podcast. Have a great day.

Steve Stedman 30:11 Thanks for watching our video. I’m Steve, and I hope you’ve enjoyed this. Please click the thumbs up if you liked it. And if you want more information, more videos like this, click the subscribe button and hit the bell icon so that you can get notified of future videos that we create you.