Watch / Listen Now – The Business Continuity episode
- Host: Steve Stedman
- Topic: The importance of having a business continuity plan
- Recording date: November 20, 2024
- Watch on YouTube: https://youtu.be/NwC3JYJelb0
- Watch on Spotify: https://open.spotify.com/episode/5y9X6I1OF8LJgXT3QBiicJ
Steve Stedman discusses the availability of the Stedman Solutions SQL podcast on YouTube and Spotify, and announces November class promotions with discounts up to 60% off. He emphasizes the importance of business continuity planning, especially in light of recent storms in the Pacific Northwest. Stedman outlines potential threats, such as hardware failures, natural disasters, cybersecurity threats, and human error, and stresses the need for risk assessments, backup strategies, and regular testing. He also highlights the role of managed services in ensuring business continuity and shares real-world examples of disaster recovery and consultation stories. The episode concludes with a discussion on best practices for shrinking SQL Server files.
Podcast Transcript: Business Continuity
Steve Stedman 0:16 Hey everyone, welcome to this week’s Stedman solutions SQL podcast. This is episode number eight, and I am your host. Steve Stedman, one thing I wanted to share as we start the podcast is we now have the podcast available both on YouTube and on Spotify. However, on Spotify, it’s audio only, so there’s no video, and we’ve got a couple of short links here you can use to get to that as well as well as our main podcast page at Stedman solutions. The links are stedman.us/podcast YouTube or stedman.us/podcast Spotify, or stedman.us/podcast any of those will get you to to be able to view or listen, depending on which platform to our podcast. All right. So some news this month, November, as I mentioned last week, is our crazy class promotions at Stedman sequel school. If you visit Steve stedman.com you can see some blog posts with the latest class promotions with 50 to 60% off on some of the classes. And next week, being Thanksgiving week, we have some crazy deals happening on Thanksgiving, Black Friday and Cyber Monday. And some of those deals will be 30% off the database health monitor bundle, 25% off our SQL DBA and developer interview class, 40% off learning common table expressions in SQL Server and even bigger deals on Black Friday and Cyber Monday. So check in if you’re looking for a really big deep discount on a class, check in on Black Friday and Cyber Monday. That’s when we’ll have some of the biggest discounted classes. And again, you can go to Steve stedman.com and look for the latest blog post talking about that also during this live stream, feel free to ask questions, and at the end of the session, we’ll do our best to answer them. Just put your questions into the YouTube chat. And for those who have questions after the live stream is over, you can always reach out to us at Stedman solutions or post a comment on YouTube, and we’ll see what we can do to follow up with that next. Do you want to be a guest on our podcast? Do you have some SQL Server topic you want to share with our listeners? Can reach out to Shannon at Stedman solutions.com Shannon is my assistant, and she can help get that scheduled.
This week’s topic is business continuity, and what is your business continuity plan specifically relating to SQL Server? Now this is one of those things that it’s near and dear to me, because it’s so important, and we’ve seen businesses that have failed, and we’ve seen businesses that have really succeeded with this. And really the importance of Business Continuity is to be able to keep your business running to whatever it is that you do, whether it’s manufacturing or some website system or something in finance, whatever it may be to keep your business running during different types of outages, from disasters to just other things that may happen along the way, part of it is minimizing downtime, ensuring data availability and protecting operations. Now today, it’s a pretty stormy day here in western Washington, where we’re broadcasting from, and last night and early this morning, there was a big storm came through that the weather people called a cyclone bomb. I think it’s just a Pacific hurricane, but there were winds south of here around 80 miles an hour sustained. It knocked out a whole lot of power for more than a half million people around the greater Seattle area. Now, I bet a lot of those people today are figuring out that they may not have the best business continuity plan if they relied on the power grid in order to be able to keep their system running, be able to keep their office running, be able to do business whatever it may be. Now business continuity really comes down to figuring out what are the potential threats, and figuring out how do you address those. Now I work out in my home office, and yeah, there’s a lot of weather, storm or flooding related threats here. If we lose power, that’s an issue for me, but I have a generator. I can go start it up and get power back on if I need to. If I lose internet, I have cable modem here that gives me my internet. If I lose internet, I’ve got a backup I’ve got my T Mobile Hotspot. I can use business hotspot, and I’ve got a backup plan there. If I lose a computer or something, I’ve got backup plans for all of that. Now, I used to have a really great business continuity plan, which was my boat, a boat has a. Diesel generator on board. It has room to work, has food, everything. I could go if something happened to my house, I could go and work off my boat with Starlink internet as well as T Mobile Internet on the boat. But I put my boat up for sale, and it’s now in a lot for sale, so can’t really use that option. But everyone has different business continuity, things that they have to worry about now, my boat solution, or me firing up a generator at my house here, that’s great for me to get back online, for me to work for maybe to be able to connect and work with my clients, but it doesn’t help with servers. And yeah, I’ve got a bunch of test servers here at my house, my home office, and if those go down, it doesn’t really impact business, but if you had a production system, you wouldn’t want to be running it out of something with power as flaky as a home or a normal office building. So that’s where it comes down to addressing potential threats for business continuity, looking for things that could happen to anybody, anywhere, like hardware failures, you could have a disk go bad in your server, a server crashes, or you could have something as simple as a power supply for your server goes out and figuring out. How do you plan for those kind of things? And that’s where for hardware failures you can plan for those with things like redundant hardware, and we’ll get into other stuff like that, but more than one server to do every job, so that if one server fails, you’re not out of business. Also, good backups is important on that things like natural disasters, floods, earthquakes, fires, storms, things like that. Right now looking I guess, looking out, I’m in a position where, if this storm picks up, I could lose power here, and I need to be able to be ready to handle that. Things like cyber security threats can be threats to your business continuity, ransomware attacks, data breaches and hacking attempts I’ve seen too many times where I get contacted by somebody who has already been hit by a ransomware attack. The first contact with us is saying, well, help. I got hit with a ransomware attack, and I’ve lost my database. How can I fix that? And sometimes the answer is, bummer. You can’t fix it, but we work with all of our clients to help make sure that they’re never in that position. Then there’s also human error, things like accidental deletion or configuration mistakes or untested changes. While ago, I was actually, I was on my boat for the weekend, and I got a call from a client, and they said, Help somebody accidentally updated a table and they forgot the where clause. I mean, that’s one of those typical things that you hear about, but you don’t actually see very often, but they’d updated a table, and they had changed some value, like a price on a product on every single row in that table. So that was one of those things that they didn’t know how to fix it. I got on, I was able to restore from backup and get that one table back, and actually, I just updated that one column that they had changed and got them back up and running again. But human error is as big of a threat as hardware failures, things like that. And no matter how good the humans are working on your system, there’s always a chance that somebody can make a mistake. And then there’s SQL Server specific risks that you want to consider when you’re talking about business continuity, things like database corruption and hey, there’s here’s the plug. Next week, Derek bovenkamp is going to join me, and Derek and I are the team at Stedman solutions that does all of the database corruption repair. So next week we’ll be talking about database corruption. But keep in mind, when you’re planning for business continuity that database corruption is one of those things that can knock your knock your business down. Other things you might not think of, like unoptimized queries leading to performance bottlenecks. I mean, think about this. What if your SQL Server is crashing to the point you have to reboot it because of such poor performing queries that’s going to really impact your business or patching gone wrong. What happens if you do an update and it knocks out your SQL server and you have to reinstall SQL Server or undo that update? Well, those kind of things all can impact your overall business continuity. Now, keep in mind, business continuity is doing what you can to minimize downtime and ensure data availability and protect operations. Now some of the components of building your business continuity plan involve a risk assessment. Now the risk assessment is one of those things that you have to be really honest about. You can’t lie or fake it on your risk assessment and then expect to come up with any kind of a realistic business continuity plan. And part of the risk assessment is making sure that everybody involved with it understands that this is private and confidential, and this is being done in order to be able to fix your business, and you have to be in a position where you can identify those critical systems and processes and figure out how can you make them. Critical, or make them redundant, or things like that. You need to evaluate threats and vulnerabilities, and you can do a business impact analysis that where you go through and estimate the financial and operational impacts of downtime. Now let’s say you’ve got a business with a couple or a handful of SQL servers, and you’re doing whatever you do in your business, and the thing you need to evaluate is, let’s pick any one of those SQL servers, or even all of those SQL servers, and say, if those SQL servers were down for 10 minutes, how would that impact your your operations, of your business, both financially and other impacts of that downtime. What if it was down for an hour? What if it was down for five hours, what if it was down for a whole day? What if something happened to all those SQL servers and your other systems and they were down for a week? What would that mean to the overall impact for your business, financially? And other impacts coming from the downtime. Then you also need to, once you’ve identified those things, go through and prioritize systems based on the business needs. Let’s say you’ve got six SQL servers, and you look at them and five of them, you say, Well, okay, we could run the business for a week without those, but there’s that one important SQL Server that’s there that’s needed to do everything related to making money in your business? Well, you better make sure that you focus on that one SQL Server rather than the five other SQL servers in that case. And then you need to look at other component of that assessment. Is, what are your recovery strategies? Let’s say, And gosh, there were some people just across the county from us that with the wind storm, some trees came down and destroyed their whole house. What if that was your office building? And something happened that destroyed your office building to the point that your servers were gone? What are your backup and recovery options? Is that something that you can restore from backups, off site backups and get your system running again in a few hours. Or is that something that would knock you out for a month or more? We’ve seen that happen with businesses where they don’t have the right backup and recovery options and they aren’t able to get things running again. Then also, you want to look at what are your high availability solutions, such as always on, availability groups, failover, clustering, replication, log shipping, things like that. And what do you have in place so that if you have a small disaster like a single server fail, how are you going to fail over and run that another look on another server that’s different than if your whole data center or your whole building is gone. And then the most important part here is conducting testing and training, regular training for those disaster recovery drills. What happens if you’ve got the greatest disaster recovery plan, your greatest business continuity plan, and nobody knows how to use it, nobody’s ever practiced it. One of the things I like to do when we’re when we have regular team meetings, and with it groups or SQL Server team, is look at that and just in a team meeting, discuss, here’s what happened. Just make make something up. Say Server X is gone. The whole server is fried, and all the hard drives are toast. What are you going to do to go and resolve that and let the team talk through it and work through it, using those business continuity plans or your disaster recovery plans, in order to be able to make sure that it’s well tested and well practiced. So the first time your team is ever faced with a disaster, you’re in better shape because you’ve practiced it and you’ve practiced it and you’ve practiced it and you’ve practiced it. So the importance of SQL Server and your business continuity. Now we’re, this is our Stedman solution SQL Server podcast, and we’re in the business of working with SQL Server. So that’s our focus on business continuity. We like to look at what SQL Server is the most mission critical for any organization, or that might be one or more servers. And then look at, how do we make sure that that SQL Server is in a position, that, if it is that important, how are we going to make sure that it’s always operational, or as close to always operational as possible? Do we have backups that, that if that main server, do we have backups that if that main location, that server was gone, you could go and restore them to a SQL Server in the Azure Cloud temporarily, or Amazon RDS, or something like that. There’s a lot of impacts that we’ve seen from different server outages. We’ve seen, I mean, I’ve seen clients that, or prospective clients that have called me up and said, Hey, this happened, and we lost all of our SQL Server Data and we don’t have a good backup. What can we do to get it back? And the answer is, gee, well, you’re not going to get it back. That’s a pretty big business impact. Then our clients that we take care of with our managed services and things like that, where we make sure they have good, solid. Backup plans, they have a disaster. They call us and they say, this is what happened, or we automatically detect it through our systems, and we have something in place to get that back as quickly as possible. And sometimes that means you’ve got an outage for five minutes when something catastrophic happens. Sometimes it’s a little bit longer, depending on the specific needs of that client and the specific needs of those servers. Now, some of the solutions that we look at relating to SQL Server Business Continuity is it’s all related to if something happens to your primary location, what are we going to do to flip over and use some other server, some other location? An example of that is off site and cloud based backup storage, but not just the backups off site, but a complete other server off site. And we use things like log shipping in order to be able to have a server running on standby that’s within a couple of minutes up to date is the primary. So if you lose your primary you’re able to turn that log shipping on or turn that secondary server on at the other location. A lot of people look at log shipping as old and outdated, but in a lot of environments, it’s a really good way to have a server on standby. One of our clients that we work with, we got them set up with log shipping, and before we even tested their the whole log shipping environment, they had an outage at their primary site, and we were able to turn on their secondary site, using that log shipping in a very short amount of time and get them back up and running. And that’s the type of disaster that had that happen. Prior to log shipping being in place, they would have been down for a couple days while they rebuilt new servers and reconfigured them and then reinstalled SQL Server, restored backups, those kind of things. But with log shipping, we got them up and running very quickly. We’ve done that multiple times multiple clients. There’s also other high availability solutions that take care of like minor disasters and smaller things, things like always on availability groups. I mean, that’s one that we’d love working with clients with always on availability groups, because it allows us to take a server out of the group, do maintenance on it, or do repairs on it, and even if one of those servers fails, you’re able to keep running your SQL Server in that availability group without a downtime always on avail. Our clients that have always on availability groups, they have very, very little downtime compared to people using SQL server without availability groups. Now there’s a lot of other disaster recovery strategies we can work on with that, things like geo redundant databases and failover plans to cover regional outages. Want to share a story here? Well, we’ll get to stories in a minute, but one of the things that we want to talk about is the role of managed services in ensuring business continuity. Now, most companies out there, unless you already have a full time database administrator that specializes in business continuity, you’re probably not in a position where you’re have the right expertise in house in order to build out the business continuity solution cost effectively. So the benefit of managed services for this, and that’s one of the things we help with, is partnering with SQL Server experts. We’re not people who are taking our first attempt at doing a business continuity plan. We’ve done this many times with many clients, and we have tools in place for continuous monitoring, proactive issue, resolute resolution and things like that, and we have expertise so that when things go wrong, it’s not like, Gee, let’s go to Google and search on how do we fix this problem? It’s these are problems we’ve seen before and that we’ve dealt with. Specifically, one is around database corruption, which, again, next week will be the podcast will be Derek bobenkamp and I talking about database corruption and our experience in doing repairs on corrupt databases. But that’s one of those things that not everybody has that experience. And making sure that if you do get a corrupt database, how can you resolve it? That’s an important part of your business continuity plan, and having us on speed dial to call us and introduce yourself once you are got a corrupt database is not the most optimal way to do it. We’ve seen not clients, but people who’ve called us up after they had a corrupt database, where they’ve lost everything and they haven’t been able to get it back. Compare that to clients that we’ve worked with, where they hit corruption somewhere along the way we were able to restore it on databases larger than a terabyte and get them back online with less than an hour of downtime because we had the right tools in place. So just think about that when you’re looking at your business continuity plan. Stedman solutions, managed services is a great example of how we can help with that. Now, an important part is testing and updating your business continuity plan. Test it regularly. It might be quarterly or monthly, but make sure you’re testing that plan regularly and reviewing it, and then when things change in your business, oh, we got rid of this server, we added that server. Now this new server is more important. You want to make sure that the business continuity plan involves those changes. And use those to simulate disaster scenarios, to put on your your evil it hat, and figure out what’s the worst thing you could do to impact this disaster recovery plan or this business continuity plan, and throw that at your team and let them take a look at what they can do to resolve it. Now there’s some real world examples I’m going to use here. One of them is a former employer that I worked with 15 to 20 years ago, where this was a company that was growing. They were doing really well, but upper management didn’t know a lot about it, and managing SQL Server or even managing data centers, and they had a single data center in the building they were located in, and it was just a local what I learned later to refer to as a fisher price data center, kind of like my first data center experience, where it’s just kind of a joke of a data center, and they try to put some computers together, they try to put some power together. But what is it that a data center is supposed to provide for you? Well, obviously, security, power, internet cooling, those kind of things. Well, everybody thinks, Well, what happens if I lose internet? Or what happens if I lose power? Well, and they cover that really well, but this place never covered what happens if we lose air conditioning? We started getting alerts on our servers that they were overheating. It’s in the same building. So we ran downstairs one floor and into the data center. And it was over 100 degrees in the data center, and they had no way to control the heat. The only thing that we could do to keep our servers from frying, because they were overheating at this point, was shut them down. So what did we do? We shut all of our servers down. We took the core primary servers we needed to keep the system running out of the data center, we threw them on a fold out table in our office, and we got them running as quickly as we could. And it took four or five more days for that data center to get their cooling fixed. In the meantime, a lot of their clients had fried hard drives because and motherboards, because they were so hot. Servers aren’t made to run in 100 to 120 degree temperature for extended amounts of time. That was one of them where we didn’t have a disaster recovery plan or a business continuity plan at that point. Our plan was, grab the servers, move them, plug them in somewhere else, change our DNS stuff like that and get the system running again. Okay, so at that point, that company decided, well, this fisher price data center, as I referred to it as, was not good enough, so they went and found a higher end Data Center in Seattle, just down the road, about 90 miles away from where we’re at. Moved everything over to that data center. Significant expense. This new data center had great cooling, redundant power. They had power from two different substations coming into the building. They had massive generators, multiple generators. They had massive battery banks and all that. And we were in that new data center for about, probably about four or five months, and they had a problem. They were doing a test where they were switching from grid power over to generator power, and somehow there’s this big switch that moves when they do that, and something arced, and it welded that switch. That arc welded the switch in the open position. Now with that, that meant that this data center was not able to take any power from the grid, and they were not able to take any power from their generator because of how everything had failed, or their multiple generators, and they had a battery bank, and they didn’t bother telling the customers that something had failed until their battery bank was worn out, and all the servers basically just shut down. Everything powered off in the data center, and it was that was, again, another one we ran. We loaded our primary servers into a truck, brought them back to the office and plugged it in. That was a horrible business continuity plan at that point was when I was able to convince the management and the company that there is no scenario where a single data center will meet all the needs of business continuity for what they needed at that point in time. So at that point, we went and found a data center in Denver and a data center in Phoenix, and we were able to build out. It took us about six months to get this all done, but build out multiple data centers with redundant servers. That point we were using log shipping and in the event of a single server, or, sorry, a complete data center failure. We were able to get this whole system running at the second data center in about five to eight minutes because we built it out correctly. We didn’t care if what, what the reliability. I mean, we got good data centers, but we knew that no matter how good your data center is, you can lose one of them at any point in time and keep your system running. So that’s one of those things that was very expensive. It took adequate planning. But in today’s world, there’s a lot more solutions available that you can do where it doesn’t require a data center in different cities, you can use Azure or RDS servers that where you can get geo redundant servers or. Servers that have backups or copies in multiple locations in order to be able to build out those business continuity plans. So at this point that covers, found out my overall business continuity. I have a call to action. I have built a system or a survey form. Let me get to the next slide here. If you go to stedman.us/continuity it’s a free assessment for your business continuity status. Or you can go to Stedman solutions.com, and on the Home menu, drop down the menu and there’s an option there for the business continuity survey. And what this survey does, it goes through and asks a bunch of the hard questions about you and your business and your plans there for what you’re going to do in a disaster, you then get a report at the end by email that summarizes and scores your situation, ranks your situation, and lets you know ideas or things that you can proceed With in order to be in a better position on your business continuity. All right, that wraps up our section on business continuity. Don’t see any questions now, but what I’m going to jump into, I’ll come back and look for questions on the live stream here in a moment.
But what I’m going to jump into now we’re done with business continuity, is talk about the ask Steve SQL topics. Keep in mind you can email Shannon at Stedman solutions.com with any topic for related to SQL Server, and we’ll see what we can do to answer it on the podcast. The question this week is, what are the best practices for shrinking files on SQL Server. What are the best practices for shrinking files on SQL Server? Now this is one of those things that it depends on a lot of things. There’s some people out there that will say, No, never shrink files. Ever. Well, the fact is, there are situations where you may be in a position where you have to shrink files on your SQL Server, but the thing you want to do is only shrink files one at a time. Or, sorry, only shrink files is a one time operation after some significant event, like, let’s say you’ve got a database file that’s grown large, and you delete half the data out of there, and you don’t expect that to refill anytime soon. Well, that might be a good reason to shrink your data file. Now, if you shrink 10% of or you clear up 10% of your file and you say, I want to shrink it, just get that 10% back. But your database is growing, well, you’re probably going to fill that up anyway. So it seems kind of wasteful to shrink that. The other thing you want to do is you want to target specific files. Now, database files and log files are very different. Although the commands are the same for shrinking them, there’s different processes to use. A log file, the way it grows is if it’s in use, it just grows and grows and grows. You can only shrink things off the end of that file. With a data file, you may have to move things around in order to shrink the end of the file. So you want to make sure that you’re taking the right approach to doing those differently, using DBCC shrink file and not DBCC shrink database. Now if you do like we had an example just a couple days ago, where client had several, a couple dozen databases on their server, they did a migration that somehow changed a lot of data, and that caused their log files to all get bloated out to the point they were starting to run out of disk space in that server. Well, they didn’t have an option to quickly add disk space to that server. So shrinking the log files was the right thing to do, because they had grown for one specific event. That’s not likely to happen again. And I say that because I’ve given them some ideas on how to keep it from growing so big in the future, and they’re freeing up space that’s not needed for regular runtime by shrinking these extremely bloated log files. Now, if you do shrink your data files, which generally we’re going to push you not to do unless you’ve got a good reason, like they’ve really I mean, a significant amount of space has been cleaned out them, but you want to make sure that if you shrink your data files, fragmentation can occur. And you want to make sure that you fix that fragmentation by rebuilding or reorganizing indexes after it shrinks. And another thing to do to make it so you don’t have to shrink files so often, is to modify your growth settings, adjust your auto growth settings on your database so that they grow at an appropriate rate. One of the things we have with database health monitor that was added in the last update or so was a way to go and look at your data and your log files and see what point in time they actually grew. And it’s a really interesting thing to take a look at to see how those files look over time. If you are at the point where you do have to shrink a file, make sure you shrink it during a downtime or after hours when there’s low traffic. It’s easier to shrink the files when there’s less going on in them, and if you’re going to shrink them, sometimes it’s better to shrink them in a loop where you only shrink a small amount at a time, shrinking. Maybe 25 megabytes in a chunk, rather than trying to shrink off a gigabyte at a time depending on the size or layout of the file with data files that might be the way it has to be done. Also consider alternatives. You don’t necessarily have to shrink your files. You can address storage issues by expanding disk capacity or implementing different archive strategies so that those databases don’t get so bloated over time. And test first always make sure you’ve got a way to test this in a non production environment, to make sure that it’s not going to have a significant impact and only shrink your files when necessary. Shrinking is not a routine maintenance task, but it’s a last resort when something unusual has happened. All right, that wraps up the ask Steve questions.
Hopefully everyone knows a little bit more about when or why you should shrink files on SQL Server. Next I’m going and looking I do not see any questions in the live chat, so feel free to post questions on YouTube once this is posted, and we can see what we can do to answer those and join us next week, as Derrick Bovenkamp joins me to discuss database corruption, some stories of our experiences in repairing corruption, The Good, the Bad and the Ugly and things we can do to help reduce the impact of corruption. And remember you can watch this, this episode and our other episodes on YouTube and on Spotify. All right. Thanks for watching. Have a great day. You. Steve, thanks for watching our video. I’m Steve, and I hope you’ve enjoyed this. Please click the thumbs up if you liked it. And if you want more information, more videos like this, click the subscribe button and hit the bell icon so that you can get notified of future videos that we create so.
If you have any questions, or might need SQL server help, find out here how Stedman Solutions can help.