Podcast Season 2 Ep. 4 Indexing

Host: Steve Stedman
Recording Date: January 15th, 2025
Topic: Indexing

Stedman SQL Podcast Sn 2 Ep. Indexing

In this episode of the SQL Server Podcast, Steve Stedman breaks down the fundamentals of indexing in SQL Server and why it’s essential for database performance. Steve starts by explaining the key role of indexes in speeding up query performance. Using a phone book analogy, he illustrates the difference between clustered and non-clustered indexes. A clustered index organizes data physically based on a specific column, making searches on that column highly efficient. Meanwhile, a non-clustered index acts as a separate lookup table, helping speed up queries on other columns, such as names or phone numbers.

Podcast Transcript

Steve Stedman 00:00 Welcome everyone to this week’s Stedman SQL podcast. This is season two, episode four, and I am your host, Steve Stedman. Today’s topic, we’re going to be talking about the basics of indexing on SQL Server, and we’ll get into that in just a minute. First a little bit of news being the first week of February. February is our performance month at Stedman solutions, where many of our blog posts and podcast topics are related to SQL server performance issues on that topic, I’d like to mention our free SQL server performance email course. It’s delivered once a week, about once a week, sometimes every seven or eight days for a whole year of SQL server performance lessons. Some of them are text based, some of them are video. Some of them are different tips and tricks. Basically, it’s free and you can learn. You can learn a lot in this course. And it’s one of those things. It’s absolutely free that I give away. And if you visit stedman.us/perf, P, E, R, F, you can get more details to sign up for that free year long email course. So welcome everyone to this week’s episode. We’re going to be covering the basic fundamentals of indexing, and there’s a lot of stuff I love around indexes that love to talk about, but we’re not going to be going into that super deep dive at this point. We’re just going to be starting with the basics.

I was on a call client a couple weeks or less than that, a week or so ago, and I asked the question. I said, because we were talking about indexes and performance, I asked the question of, does everybody on the call really understand how indexes work in SQL Server? And there were a couple people who were on the call that the assumption was they knew that, but it turned out that they didn’t. So I built this quick tutorial as a way to help sort of explain the basics around indexes, so that nobody will ever have to answer the question of, do you really understand the basics of indexes? And they never have to answer that as No.

So, what we’re going to cover is some basic things around how tables work in SQL Server, around clustered indexes and around non clustered indexes. Like I said, it’s an introduction, not a deep dive, but it’s intended to kind of get the point across of how indexes work. Now, one of the examples I like to use is a phone book. Now I don’t know we’ve moved on to the internet over the last 30 years, and phone books are not a printed. Phone Book is not as popular as it was 30 years ago, where you might have one sitting next to your phone on the on the table, but the idea is of a phone book is it lists everybody by name, order and their phone number, and oftentimes it had addresses and things like that too. So if you wanted to look me up Steve Stedman, you would look alphabetically, find the S’s and then find Stedman, and then search for Steve, and you’d be able to find my phone number and address that way. Well, imagine if that phone book was not alphabetically sorted. Imagine if it was just randomly sorted by wherever someone ended up when there was an open page when they bought their phone and they were put in the list there, you would have to flip through every single page of that phone book in order to find my phone number. And the only way to make sure you found all my phone numbers, if I had more than one, would be to look at every single page of that phone book to find me. Obviously, that would be a really inefficient way to organize a phone book so it’s ordered, or we’ll use the term clustered by last name and first name and middle initial so that people can quickly look things up. Well, databases are the same way picture a database table is being a lot like that phone book, and if you don’t have the right indexes on it. When SQL server goes to query your data, it may have to do the equivalent of looking at every single page in that phone book in order to be able to find your phone number. So I’ve built this demo here in Microsoft Excel as a way to kind of show some of the things around what’s going on with tables and indexes in SQL Server. So what I want to start out with is here’s just a basic table, and this table is called people, and it is the create statement. Is down at the bottom there, CREATE TABLE people. It has an ID, a name, a phone number and an address, and this is what is considered a unclustered table, or a heap in SQL Server.

Now, over time, rows get inserted to this table. Rows get deleted out of this table. Things get changed around at. This table and the ordering of what’s this tape, what’s in this table just happens to be where people were inserted, where the opening was found when they were inserted. So here’s an example. A few people have been deleted out of it, as happens over time, and then some people have been updated or changed or added back in later. And you can see what we’ve ended up here is this table representing people and phone numbers as an unclustered heap. Is just a random order of all of these people with no specific sort to it at all. So if you want to search through here and find Derek and all of the Derek’s, you’d have to start at the top and scan through every single row in the list until you found Derek. Okay, and then you’d have to keep going until you got to the very end of the list to make sure that there was not a second or a third Derek in the list. Now that being said, Imagine, okay, that’s easy to do on this list here that has like 16 rows in it, but imagine if this table had 15 million rows in it, and it’s just in a whatever random order of however it ended up when the rows were inserted in the table over time, and you write a query against that table to say, select star from the table where name equals Mitchell. Well, the only way that SQL server is going to be able to to get the answer to that and find all the Mitchells is to be is to scan all 15 million rows in that table, one by one and say, Well, is this row Mitchell? Is that row Mitchell? If it is, let’s throw that into our result set. Not a very efficient way to find data, and that’s the equivalent of having a phone book that’s just in random order, where the only way that you have to find someone in the phone book is to look through every single page of the phone book and read every single page until you find that so like what phone books did, where they sorted by something useful, a SQL Server had a concept, or brought in a concept of clustered indexes. And I say, brought it’s been there since, since I started working with SQL Server in 1990 but the idea is a clustered index on a table changes the ordering of how it exists. Now, just keep in mind that a table is a collection of data pages, and those data pages are 8k chunks of data that contain a handful of rows. So when we look at this, depending on how big the rows are, some of these rows might be on the first page, some of these rows might be on the second page, some of these rows might be on the third page, and some of these rows might be on the fourth page. Now, considering that this is a, theoretically a 15 million row table, well there might be a whole lot of data pages that have to be searched through.

So what SQL Server does when you create a cluster index, and I’m going to do that now, and the script for the basic script down below is create clustered index. I named it CL, people ID on the table people and cluster by the ID. That’s what we’ve got down the bottom there. And in doing that, it took everything in the table and rearranged the order based off of the ID here. So what this does is it makes it so if we want to search by ID, it’s very fast. So if we want to search and find out, okay, who is ID 19? Well, the index has a B plus tree, I guess they call it, or a tree structure that allows for quick searching based off of the things that are in that cluster and index mean. To do the ID, it goes through a tree structure. And instead of scanning every page in that 15 million row table, it scans through maybe a couple, two or three, maybe five pages in the index structure. To figure out, okay, to find position 19, it has to jump to this row on this page, and there it is, and one is position 19. So that adding the clustered index gave us a great way to search by ID and to quickly look up things by ID, to quickly identify any row by that ID, that’s great if we’re joining to other tables using that ID, for instance, it makes it quick for that join to happen, but it didn’t really help us if we’re searching by name. So if we want to find, if this is 15 million rows again, and we want to find all the all the Marcia is, well, we’d have to start at the top, scan through every single row and say, Okay, there’s one. And we keep scanning through all the rows, all the way down to the end, and find out that it took a while to scan through all 15 million of those rows, searching by name. Well, the thing with the clustered index is you can only have one clustered index on a table, because what it does is it rearranges the order of the entire table itself by the thing it’s clustered. It on. So we can’t say, Take this one table and sort it by ID and sort it by name. I mean, you could have a two parts in your index, but you couldn’t have it sorted both by ID and sorted by name, because the table they only exist in one place. It can only be sorted by one thing. So what SQL Server has is this concept of a non clustered index, and what a non clustered index does that says, let’s take those things that we want to put in the non clustered index, and we’re going to make a copy of them. Now let’s back up this 15 million row table with 8k data pages is going to take up some chunk of 8k pages depending on how big each row is, that might be hundreds of megabytes or even gigabytes of data in the clustered index or the table itself. So then what we’re going to do is we’re going to say, I want to add a non clustered index. And what that’s going to do is it’s going to take the data that you want to cluster on and make a copy of it. And conceptually, this is like making a copy of it into another table, but it’s a thing that’s hidden. It doesn’t look like a table, but it’s called an index, and effectively, it works like a table. And what this is doing is it’s saying we’re going to take the thing that you want to have in your non clustered index. And here we’re going to say it’s the column name, and it’s going to put all of those into a copy and sort them. And given the B tree on this on this index, it’s going to give us a fast way to search and find anyone by name, very quickly. But then this index is also going to contain a copy of whatever value is in our clustered index as a way to link back and join it back to our original table over here. So now, if we wanted to search on someone’s name like Fred, instead of having to search in the original table or clustered index the entire list and say, Are you Fred? Are you Fred? Are you Fred? And go all the way through and find that out of the 15 million rows, instead, it can quickly jump and find, here is the row with Fred on it through the beach restructure. So we’re doing a couple of 8k pages in case there very quickly. And if there were 50 friends in the list, they’d all be listed here, and you’d have a list of ids to go back to. So it can go a couple of page reads, it finds spread very fast. And then from there, we know Fred is clustered index ID of 18, which up over here there’s 18, maybe one or two page reads, or three or four page reads in a really big table to find Fred. And there we go. We’ve located Fred’s row. And what that non clustered index does is it allows the SQL Server query optimizer to do all the things that I just described there with do it very quickly to find your data without having to search every single row in a table, because that’s slow. Okay, great. So now we’ve got that let’s say we want to search on something else. Let’s say we want to search on phone number, or you want to search on address. Well, if we given that we have the table and it’s clustered on the ID, and we have a non clustered index that’s clustered on name, well if we write a query that says, Select star from people where phone number equals whatever, 1238877, in this case, the only way that SQL Server can get that is to scan through every single row of this clustered index and say, Is this one? That is this one, that is this one that and it’s back to in the example of 15 million rows searching all 15 million rows to say, is this the phone number we’re looking for this non clustered index over here doesn’t help us because it doesn’t have phone number in it at all. And the sorting of the clustered index over here doesn’t help us because it doesn’t have phone number at all. We’re going to add another non clustered index, and this time it’s going to be a non clustered index on phone number. So now if we want to look up a phone number that is 3601238877, it can quickly be found in this non clustered index because it’s in sorted order that. And this is the equivalent of what the SQL Server query optimizer is going to do when it runs it, and it’ll say, okay, the phone number is number 11. And if we’re saying whose name is associated with that phone number. It takes the 11, and then we’ll go back and do a clustered index seek to find out where 11 is, and it can get all the data very quickly there for Mitchell, based off of this, but given that we’ve added these two non clustered indexes, we can now quickly search on name because we have non clustered index on that. Quickly search on phone number because we have non clustered index on that, but we cannot quickly search on address that still takes a full table scan. And that’s a clustered index full table scan there. And I want to clarify a couple of terms here. If we talk about seek, that’s a way of. Going and quickly finding a row going through the V tree structure to locate it by just accessing the tree and then accessing the rows that need to be done. Now that’s what SQL Server is doing. If you look at the execution plan, it’ll show that a clustered index seek is happening, or non clustered index seek is happening. That’s fast. If it shows a scan is happening, that means that it’s having to go through every single row of the result set to find what you’re looking for. Now a few things. We’re gonna go into a little bit more depth here with another index. But first, keep in mind, our original table has taken up some amount of space. Let’s say, theoretically, it was 50 gigs of space taken up for the people table to begin with. Then we added a non clustered index which copied some data. It’s much less than the whole table. So instead of 50 gigs, maybe this non clustered index might be five gigs. And then we added another non clustered index which, again, copied a bunch of data into another location, and this might be another five gigs as well over here, if we’re talking about 15 million rows. Again, just making up numbers here for theoretical example. But so we’ve now taken what was originally 50 gigs, and we’ve turned it into 60 gigs, because 50 plus five plus five, and that’s where it’s now taking up more space. We can search faster because of that, but there’s more things in your database that need to be backed up. There’s indexes that need to be maintained, things like that, over time.

So at this point, some people say, Well, can I just go and, let’s say my column has or my table has 30 columns in it. Can I just go and add a single index for every column to make it quick for searching? Well, you could, but it wouldn’t really do what you think it’s going to do there. Part of it is the SQL Server query optimizer. When it’s running in query, it can really only use one index on your where or your join one non clustered index to find what you’re looking for. So if we added another non clustered index on just address, and we were searching on address and phone number or address and name, it wouldn’t really help us. It would help us if we were searching on just address, as we add those and they take up more space. That also means that there’s more work. The backups take longer. Your daily maintenance is hopefully running to do index defragmentation and statistics updating, which tracks what’s in your indexes so they can be used appropriately. That’s going to be slower too, because it’s got more indexes to go through and maintain. But the other thing that as you add more and more indexes, it’s going to be slow, is anything that’s changing the data, anytime you’re inserting, updating or deleting a row, it has to update the table itself or the clustered index, and it has to update all of the non clustered indexes to reflect whatever change happened in that original table. So if you have 50 non clustered indexes in different combinations, it’s going to be a lot slower to do those inserts, updates and deletes. So it’s a balance you have to come up with where you have enough non clustered indexes to do what you need to do to maintain your system, keep your queries running fast, but not so many of them that it slows things down and makes it too big. Okay, so both of these non clustered indexes that we’re looking at here happen to be single column, meaning we with the Create script down here, we just added a single column of name or phone number, and it also included the cluster index in there. You don’t really see that when you look at an index, but it’s there, so it can go back and look it up. But what if we wanted to add an index on multiple columns down here at the bottom, we can see create non clustered index on people, and in this case, we added the name and the address. So previously, if we wanted to search where a name started with s and that there was a wild card or something on like the address, it would require searching through here to find the name, and then going over here and searching through all of the addresses that match that name. And you’d end up with probably a non clustered index seek here and then a clustered index scanned over here to find based off of that address. But now that we’ve added an address table, if we wanted to search and say where name starts with S, well, it can quickly get there to find these two down at the bottom here, rather than searching through all 15 million rows. But we can do that before by using these two over here and where the address equals North 32nd street. Well, in that case, it can quickly determine from this index here that there is only one row that matches that. And then it that row is ID of 17. And the SQL Server query optimizer or query engine would look back over here and determine, Okay, here’s the whole row for this person named sky that’s at North. 32nd Street with this phone number. So this is where you need to be careful and not add too many indexes, because adding too many of them adds more work. But having no indexes is even worse, because if we go back to the very beginning, where the table was ordered in the fashion of just whatever row ended up as that is a really inefficient way in order to be able to access data. And a lot of times, what happens is, when a client comes to us and says, Hey, this query is slow. It’s at the point, it’s clogging up our system, it’s blocking things. I need a way to I mean, I need this query to run faster, and we’ll look and see, oh, well, the table is missing clustered index. Well, maybe we should put a clustered index on it so we can join quickly to that ID, or look things up quickly based on that ID. Or it might be if they’re filtering on name, and we have to go put in an index on name, it might be that we have to put on a bigger composite index that has three or four or five columns in it, rather than just one or two. So that kind of wraps it up at this point, it’s the basic, the fundamentals around how tables are structured. An unclustered table is a heap. A clustered table is ordered in the order of whatever you’ve decided to cluster it on. In this case, it’s the ID, and oftentimes it’s the primary key, but it doesn’t have to be non clustered. Indexes are copies of the data that are ordered by whatever thing you put in that list. And each one of them is like creating another table structure that you can go and quickly look things up for. You don’t do it, but the SQL Server Query engine does. And then all of these come together to make it so that SQL Server, through the query engine, can quickly find the data you’re looking for, rather than scanning through millions and millions of data pages. And if you look at the execution plan or the actual query statistics, a table with 15 million rows like I’ve been talking about, this example here could be millions of page reads to find what you’re looking for, if there’s no indexes on it, but it could be like three or four or five page reads to find what you’re looking for, a very small, finite number if it has the right indexes on it. So quick overview of indexes there. Hopefully this makes sense to everyone. Hopefully it puts it in a way that everyone can use it or and you know that every table should have, say, most every table should have a clustered index on it, and non clustered indexes are good for helping with the searching. Too many non clustered indexes can lead to performance issues, or adding non clustered indexes that that aren’t needed may lead to performance issues too, because you’ve got more work to maintain them, the performance issues would be on the INSERT, UPDATE and DELETE, but they help a lot with performance issues, because they give you a fast way to search or join on data. So that wraps up my indexing episode.

Hopefully everyone has learned something new from this, and I’d like to say thanks for watching the Stedman SQL podcast. Come back next week for another episode. We’ll see you then. Have a great day. You. Steve, thanks for watching our video. I’m Steve, and I hope you’ve enjoyed this. Please click the thumbs up if you liked it. And if you want more information, more videos like this, click the subscribe button and hit the bell icon so that you can get notified of future videos that we create so.