Avoid Common SQL Mistakes for Better Performance

Host: Steve Stedman
Topic: Common SQL Mistakes and How to Avoid Them
Recording Date: April 1, 2025
Database Health Monitor: https://databasehealth.com

Stedman SQL Podcast Season 2 Episode 11 Common SQL Mistakes

This week’s episode discusses 4 common mistakes that may be affecting your SQL Server performance. We will cover the misuse of correlated subqueries, cross apply and how it can be over used, missing or overlooked indexes and overly wide tables. Steve covers each of this potential issues in detail with real world examples and how to avoid misusing them. You won’t want to miss this one!

Podcast Transcript

Steve Stedman 00:00

Hi, I’m Steve, and welcome to the Stedman SQL podcast. This is season two, Episode 11, and I’m your host, Steve Stedman, today we’ll be taking a look at some common SQL Server mistakes and how to avoid them. These are specifically mistakes that negatively impact performance. But before we get into that, just a reminder that April is JOINs month at Stedman Solutions. So, if you have not downloaded our free JOINs type poster, please take a look at that. It’s free, and it’s something that may help you better understand Inner JOINs, left JOINs, outer JOINs, right JOIN stuff like that. So the poster is available for free at https://Stedman.us/poster and we’re also offering 35% off our JOINs class for the month of April. Get the class for just under $20 using the short URL of https://stedman.us/35offApril.

Other news, we just passed the 10 year anniversary mark at Stedman Solutions. Check out our 10 year anniversary podcast episode that we released on April 1 of this year. Yeah, I know April 1 is not the best day to release a podcast or to announce your 10 year anniversary, but I wish I’d realized that when I started the company on April 1, 10 years ago, we’d like to welcome our new and regular listeners. This is our 25th episode since we officially turned our streaming into a podcast last year.

So have you ever written a SQL query that took forever to run and then realized it was your own code or the only way, the way you wrote it that was holding it back? So today, we’re going to dive into some of those common mistakes and take a look at how to fix them. So SQL is a powerful but tricky language, and some small missteps or misunderstandings can lead to some pretty big performance disasters. So four key mistakes that we’re going to talk about in today’s podcast are misuse of and I should say misuse on all of these, because these are all things that, if used correctly, they may be okay, but misuse of correlated subqueries, overusing or misuse of cross apply, neglecting indexes or bad table structure, like one giant table instead of a relational design. So really, what we’re going to go through here is some practical and actionable advice for beginners and seasoned developers and Database Administrators alike. So first, let’s jump into correlated subqueries. What is a correlated subquery? Well, a correlated subquery is when you have a SELECT statement inside of another SELECT statement. But to make it a correlated subquery, that SELECT statement has to reference something from the outer query or the outer SELECT. So if you put in your list of columns that you’re selecting a query that references something from the outer query, inside of parentheses, and you have a SELECT there, that’s a correlated subquery. If you have a subquery in your where clause, where some column value is in, and then there’s a SELECT where it’s referencing something based off of the outer query that’s also a correlated subquery. Now, what makes a correlated subquery different than a regular subquery is that they reference the outer query inside of the subquery, and what that means is that when it references the outer query, it means that that is going to be executed repeatedly for every row. And when I say every row, if it’s in your SELECT statement, for the columns, it’ll be executed in every row that’s in your result set.

So if you’re selecting out 10,000 rows, well, that correlated subquery gets run 10,000 times. Where it’s different is in the where clause. It gets run once for every row. But every row doesn’t mean just those in your result set. Every row means those rows that have to be compared as part of that where clause. So if you’re joining two tables with 50 million rows in each table, and you’re using that correlated subquery in the where clause, or even, well specifically there, then that could be run as many as n times where n is the number of rows in that entire table that you’re joining to. So it could be run 50 million times that. So let’s say you’ve done the best thing you can to performance tune that correlated subquery, and it turns out that you’ve got it down to like 20 or 30 milliseconds. It’s really good. But then you take that 20 or 30 milliseconds and you multiply it times 10 million or 50 million, or however, many times it’s being run. Well, it’s not good at that point. And any amount of time you’ve got if you’re multiplying it by that many in that many times that it’s being run, it can’t possibly be fast at that point. So the thing with correlated subqueries is that they’re easy to do, and the common mistake that happens is they get overused. Basically, it’s a way to go and get what you need without maybe looking at other options of how it could be done. Maybe instead of doing a correlated subquery, maybe you do a JOIN to a table that has some grouping on it, or something like that.

The problem here is that performance can tank as the data set really grows. Think of N squared complexity, meaning that if you’ve got N being the number of rows and the more rows you have in there, it’s taking on basically an N squared, because it’s having to be run, perhaps, on how you’re doing a JOIN that many times for each row in the queries that are being joined. Now, a lot of the time, people look at it and think, oh, well, that’s just one line in my query. How could that be run that many times? So what happens is, the only way SQL Server can evaluate that correlated subquery that’s referencing something from the outer table in the JOIN is to run that once for every single row in that set. So the problem here is that you try it out in a small server, or a new server or a development environment, something that doesn’t quite have the same data that you have in production. And then over time, your production system grows, and you get more and more data there, and what happens is that you end up with something that used to run really fast with a small amount of data that’s now taking minutes or even hours to run, and it may tank the entire performance on your SQL Server.

Now think of this as someone built here’s a real world example that I was able to see with it. Somebody had a reporting query. This was going into an SSRS report, and in the where clause, it said where column A, I’m just making up names here, was in and then a SELECT from some other table where some value equaled something from the original table. And it worked great for years. And then they became more and more successful. They got more and more and more data in their production system, and that query just completely ground to a halt. And it was one of those things that it got to the point that somebody would run it and it would take down their entire it wouldn’t take down, but it would block up their entire report server. And then people would hit refresh and refresh and refresh again and again, and you end up with the same query running two times or three times or 10 times, depending on how many times they hit refresh, in order to try and get it to run faster. Well, one of the ways that you can avoid this is to rewrite your query as a JOIN and I mentioned earlier that this is JOIN months at Stedman Solutions, and go check out our JOINs class or download our SQL JOIN types poster for free, and that poster just extra plug here is available at https://Stedman.us/poster, you just fill out a form and you can download the PDF. But the way you can avoid this is to rewrite your query as a JOIN where possible, instead of doing like an in with a subquery or even an equals with a subquery in your work, in your where clause. And what you want to do is you want to be able to test this often, test your execution plans, to spot the issue early. And the way you can see this is to mouse over in your execution plan the different operators and see how many times it’s being run. And in most queries, things should be run once or a couple of times, depending on how it’s being done. But when you mouse over it and you see that a query is being run 50 times or 1000 times or 10,000 times a specific operator in that plan, well that’s a clue that maybe this isn’t going to scale, and maybe it isn’t going to work well for you in the long run. So the key here is, if your subquery mentions the outer table, stop and ask the question of, could this be a JOIN instead almost every subquery that I’ve seen could be written as a JOIN of some type, and it might be a JOIN to another subquery, but as long as it’s not a correlated subquery, then it’s only going to be run Once, instead of N times, where N is the number of rows being evaluated.

Okay, so that’s one that I have seen time and time again, that those correlated subqueries can really kill performance overall in a system. But one of them that gets even worse is excessive or misuse of the cross apply. Now the cross apply, which I should I didn’t realize I was going to plug my SQL JOINs poster, so much as I was going into this, the cross apply is also on my SQL JOIN types poster you can download and cross apply is basically cross apply or outer apply. Similar thing is a way of doing between like a JOIN to a function or to a subquery. And what happens when you JOIN to a function or a subquery is that you’re passing in some value from the outer query, and you’re ending ended up like the same thing we run into with a correlated subquery. But what’s different is that that may be going inside of a function, and a lot of the overhead there may be hidden when you’re looking at the execution plan. So let’s say you have a query that selecting back 10,000 rows, and then you’re using a cross apply to run some query against each one of those rows. Well, that cross apply is going to be run 10,000 times whether the cross apply is cross apply to a function, or a cross apply to a subquery. It’s a great feature to use if you know it’s not going to be called on lots and lots of rows. The next thing that I see that happens with this is somebody will first do a cross apply. You have a SELECT or a JOIN from a table. From there you go to cross apply, and then you take the result from that first cross apply and run it with another cross apply. So you’re taking one query that may be running the first cross apply 10,000 times, and then for each row that comes out of that first cross apply, you’re doing it 10,000 times, N with number of rows out of there being pushed, pumped into the second cross apply. And you may end up with some function or specific query that’s part of that cross apply being run hundreds of 1000s or millions of times. And no matter how performant you make that thing that’s being called the cross apply, if it’s being called too many times, it’s going to rack up a lot of time. And you’re going to wonder, why is your query slow?

One of the common mistakes is to use it everywhere, because it’s super flexible, and if there’s a way that you could do it with a simple JOIN or a where clause, that’s a better way to do things. So one of the reasons that this is a problem is that it really over complicates queries and can really bloat your execution plan to the point that parts of it are being run way more times than you would ever expect. Real world example, applying a function to every row when a filtered subquery or case statement would suffice. So think of it this this way that if you have a result set, and you’re using a cross apply to run some query or function against it. If there’s another way to do the same thing with a case statement, or maybe as just a filtered subquery, instead of that correlated subquery or correlated cross apply that can really, really slow things down. Again, this is one of those that on a system I regularly work on, we’re looking at a query that had three or four levels of cross apply deep, and it’s just like doing multiplication on how long your query is going to run, based on how long those subqueries run and how many rows are being considered in your result set. So basically, my recommendation is that you reserve cross apply for scenarios where you really need row by row function calls, those should be very rare. There should typically, be other ways to do it, and you should be able to compare what you would normally be doing with the cross apply to other ways of doing INNER JOIN or exist type queries in order to improve the overall speed and performance. Also look at your SQL Server execution plan for that query and profile the query. And don’t assume that the outer apply or cross apply is always the answer. Avoid it when you can quick tip. Think of it this way, cross apply is like a is a scalpel, not a sledge hammer, so you should use it precisely next.

One of those things that we commonly come across when we’re looking at different queries or different scenarios with clients that we’re working on, what we run into is that missing indexes really impact performance, and a lot of the time, people may not even know what indexes are missing, or may not know what indexes they need, and it may be different between a development environment and their production environment, depending on the amount of data that you’re using there. So basically, indexes speed up data retrieval, but they’re oftentimes overlooked. Now if you’re wondering, well, what is an index? We’ll go check out our earlier podcast, which I think was published on February 5 of this year. February 5 2025 that on what are indexes and the basics around indexes, but the common mistake here is forgetting to index frequently queried columns in your JOIN predicates or your where clause.

So the reason that this is a problem not having those indexes is that you end up with full table scans that ends up being slow queries, especially on really large data sets. Now let’s assume that you’ve got a dev environment with a substantial amount of data, but then your production environment has 10 times or 100 times or 1000 times the amount of data as your dev environment does well. You may have queries that work great in that dev environment. But perform completely different in that production environment because there’s so much more data there. If you’re looking at the difference between a full table scan versus an index seek, if you’re running with a very small data set, that may not make much difference at all, but it may be a difference between 10s of 1000s or hundreds of 1000s of page reads. So what you want to do is look at your execution plan, figure out if there’s any recommended indexes, and never go with just the recommended index that they give. You always go and verify it and make sure there’s not something that it’s a duplicate of, or there’s something close to it, but figure out what indexes are missing.

And this is one of those real world example we ran across with this was working with a client. Their system was grinding to a halt regularly, typically on some of their busiest days of the week. And after working with them, we went in and we found some indexes that helped with that, and we were able to apply those right on the spot, get them fixed, get them in place, and it was faster to add the index than it was for each query that was being run without that index. So we were able to make significant impact with that. So if you’re working on queries, you need to understand what indexes are being used, what indexes may be needed, and what you can do to improve the indexing to help with your specific queries. Now you want to look at the query patterns and index columns that are typically in the where or in the JOIN or in the order by clauses. Those are things that are oftentimes used, and we can get big performance boost by indexing them properly. Use composite indexes for multi column filters, meaning in your where clause. If you’re saying where, Column A equals something and Column B equals something else, we’ll put both column A and column B into that index and see, and oftentimes, I’ll experiment with if we add column A to the index and Column B, or we add column B first and then Column A or one versus the other and see how they perform. But don’t take this as a license to over index. Over indexing can be, well, usually as bad as not having indexes, but I would rather have a few too many indexes than not having enough. But over indexing, what it does is it makes it so every time you change or insert data, insert, update, delete things like that, that it takes a little bit longer because there are more copies of that data around in those indexes that need to be updated. So it’s a balance.

One of the things that I recommend is taking a look at database health monitor and the database health monitor performance monitor to track those missing indexes. There are so many indexing reports that we have inside of database health monitor to be able to show you what’s missing, what’s available. We I think next week, we’re talking about new features in database health monitor, and one of them is our cardinality report that will help you better understand what’s going on with some of those indexes. But the quick tip is take a look at the execution plan, and it will oftentimes be screaming at you about missing indexes when it really needs an index.

All right, next on the list, we’re going to go into another performance killer. Item four is giant tables versus smaller relational tables. So SQL Server is a relational database management system. What that means is that it was built to handle queries around tables with relations like foreign keys and primary keys and things like that. Tables with relations really well. But what it was, what SQL Server wasn’t built for, was a tool to handle giant, massively wide tables. So let’s say you’ve got a table with 150 columns in it, and they’re all varchar 150 or even worse, and varchar 150 and you’ve got a bunch of stuff in those tables, everything from dates to ages and numbers and strings, and I mean, all kinds of stuff that fits into appropriate data types, but it’s all been put into just one giant table because that was what was easy bringing into the system. Well, what ends up happening in that case is that anything you do to query that table, it may take a whole lot more time than you would ever expect, because it’s got to parse through all of that data. So for instance, if you’ve got a date field stored in that giant table as a varchar, and you’re trying to find out where a date is greater than a specific date yesterday, or something like that. Well, it’s got to go through and convert all of those either with a implicit or an explicit conversion. Either way, converting a bunch of bar charts into dates is not the most optimal way to query data. So what you end up with here is really poor availability. You end up with a ton of wasted storage, and you end up with sluggish queries, especially when a lot of those columns may be null.

So real world example table working with 150 columns, like I mentioned earlier, most of the data types are and varchar 150 and the table filled in at about 45 gigabytes in size, looking at that and looking at how we can restructure it, we could probably get that table and all of those indexes to be less than 15 gig if it was structured in a way where was broken up into lookup tables and things like that. So that gets us in, how into how do we avoid it, or how do we improve it? So this is where, instead of one giant table that would make me always think of like a big spreadsheet, you break up the table into smaller pieces. And this is called normalization. Normalization is breaking up the data into smaller pieces that can then be joined to in order to be able to bring in the things you need. Denormalization is going the other way, where you’re taking a bunch of smaller tables and making them into a big, massive table. And what we’re talking about here is normalization, taking your one massive table and figuring out, how do we split it up into smaller pieces. So for instance, let’s say you’ve got addresses in there, and one of the addresses is a city name. And if you’ve got 10 million rows in that table, you probably don’t have 10 million city names and city names. Let’s say it’s a varchar or an nvarchar 100 and typically, a city name might be, let’s just say 10 characters. So 10 characters and then varchar is going to take up 20 bytes, and 20 bytes times. Let’s say 10 million rows in that table, is a huge amount of data just to store those city names. So what you do is you make a city name table, which, in this example, or you could do a whole address or city state table, or something like that. But in this case, you make a city name table that has a list of all the cities, and instead of your main table containing the actual city name, you have it have a foreign key that links off to the primary key in that city name table. So instead of taking possibly 20 bytes for an average of maybe 10 characters per city name and putting that in your main big table, you could replace it with a foreign key, which probably be an integer, would be a fine foreign key for that which could come in at four bytes and save you a whole lot of space on every single row in that table. So repeat that for all the columns that have a lot of repeating values, even things like first names and last names. I mean, there’s only there’s a lot of wild ones out there, but there’s only so many first names and only so many last names out there. Eventually, if you have enough rows in your data, you can have repeats. And what that means is that they could be foreign keyed to a lookup table. When you do those look up tables, you want to make sure that you use foreign keys in order to maintain relationships and data integrity there. So you can’t go and delete from one while it’s still being referenced from the other. And when you’re working on these, you want to standardize on the right data types. So for instance, instead of a varchar 100 or a nvarchar 100 to store data that may be a Boolean. And I saw this one where we had a column that was basically a boolean value, but it contained the values of yes, no, true, false and on or off. So it contains six different values. Oh, and it had upper and lowercase versions of those as well, just to represent a yes, no Boolean type answer. So, had that been something that could have just been stored as a bit instead of a nvarchar? Well, that could have saved a whole lot of space, and it would have helped with the data analysis there, if you’re trying to use this data and that column has yes, no true and false in it. Well, how are you going to compare against that are you going to read the text string? Figure out? Is it a yes, a true, or an on? And then it just doesn’t make any sense to it that way. So normalize the data, standardize on the right data types, and then only denormalize when performance demands it. And if you’re denormalizing for some reason, make sure you specify why. So people know why. So the quick tip here is that if your table has more columns than your last spreadsheet, it’s time to rethink your design.

All right. Well, that wraps up my four tips around some common pitfalls or common things that we run into around performance with correlated sub queries and the misuse of them, cross apply and the overuse of it missing or overlooked indexes and giant tables. Now these are tables, query, sub queries, cross apply, they’re all good things until they end up. Being misused or used in a problem scenario. So the key takeaway here is that small tweaks in the design and query writing can save hours and hours of frustration later when it comes to performance. So my recommendation here is go try and find one slow query this week using these tips and figure out how to perform it soon it and let us know how it goes. I’d love to hear some feedback on what you’ve tried or what’s worked in order to help with these common issues. So as I mentioned earlier, a great related episode is the one on our indexing episode that was a podcast from February 5 of 2025 and that is one of our most popular episodes to date. So check that out. Thanks for watching and listening. Join us next week, where we discuss new features released in Database Health Monitor with Database Health Monitor developer Mitchell, and then if you missed it, also check out our 10 year anniversary episode that we released on April 1, not it. I mean, it was April Fool’s Day, but it’s not an April Fool’s joke. But we did release our 10 year anniversary episode on April 1, and that’s 10 years of me being full time at Stedman Solutions and growing it, from me starting a company with no income whatsoever from the business, all the way up to 10 years later, nine employees and doing quite well. So remember, all of our episodes are available on YouTube at Stedman.us/podcastYouTube and on Spotify. And the short link for that is stedman.us/podcastSpotify. Thanks for listening, and have a great day. Thanks for watching our video. I’m Steve, and I hope you enjoyed this. Please click the thumbs up if you liked it. And if you want more information, more videos like this, click the subscribe button and hit the bell icon so that you can get notified of future videos that we create.

Stedman SQL Podcast Sn 2 Ep 11 Common SQL Mistakes

Stedman SQL Podcast Season 2 Episode 11 Common SQL Mistakes

Podcast Transcript