This is turning into a more-general introduction to indexing, but I suspect you'll still find it useful. The first two paragraphs especially speak to your question.
Clustered vs Non-clustered
This refers to how the table is physically arranged on disk. A clustered index works by sorting the physical pages and rows in a table on disk based on the index definition. Non-clustered indexes use a separate location on disk to store a copy of the columns in the index (and only those columns), plus a pointer to the source records. For this reason, clustered indexes are often faster because they will always cover any data you need in the query. However, you only get one of them because otherwise you'd duplicate the entire table. It's also important to know that adding non-clustered indexes to a table actually slows down write operations like inserts and updates, because the database has to rebuild the index, or at least certain pages in the index.
An index on
(A,B) is not the same as on
(B,A). If the first case, records in the index are ordered by column
A first, and column
B only effects the index order when you have duplicate values for
A. Searching the index with a column
B value only won't help you, because you still have to scan through every record in the index to find all your matching values in
B. In the second case, the reverse happens: records are ordered by column
B first, and column
A only helps when you have duplicate values for
A. Searching that index with a column
A value only won't help you.
Sometimes a database can fulfill the requirements of a query entirely from an index. In this case, the index is said to be a "covering" index for that query. This is advantageous because indexes are often cached in memory, and so the database may not have to go do disk at all. To understand this, imagine an index on
(A,B) where there are very few duplicate values for
A in the index seems wasteful, unless you have a query that runs often that looks for a particular value of
A and also needs
B. This index will now save a lot work going back to the original table to retrieve
Selectivity is a value from 0 to 1 (often expressed as a percentage) that tells you how unique each value in an index is. A selectivity of 1 or 100% means there are no duplicates. A selectivity of 0 means there is only one value in the column. Generally, a higher selectivity (approaching 1) is better for indexes.
To demonstrate this, think about what would happen with a low-selectivity index. For example, you try to speed up a query by adding an index to a bit column in a table with 10000 records. In this case (assuming uniform distribution), the selectivity is .5. You run your query, and the index returns 5000 records. But each of those records still has to go back to the original table, and since the index order doesn't match the table order it would have to do a lot of separate look-ups into the table. Instead, it's likely faster to just scan through the entire table start to finish to retrieve the needed data.
Selectivity explains why you would want to cluster on the primary key. Since the clustered index tells the database how to order the table, going for anything less than 100% selectivity here means a query will have to scan the table more often. Clustering on the primary key gives you perfect selectivity. And since this primary key is often used as the record pointer in other indexes, you want to keep it as small as possible (ie, an integer identity column).
There's a good article on the selectivity and indexing here:
This refers to whether the database is able to use a particular filter with an index.
As we have shown, indexes normally work by first sorting the data into a specific order, so that lookups into that index can use something efficient like a tree-based search rather than a slower linear search. Anything that can't be effectively compared with sorted data can't be used with an index. A good example is the
LIKE operator. This is sargable:
SELECT * FROM [Table] WHERE [Column] LIKE @Value + '%'
but this is not sargable:
SELECT * FROM [Table] WHERE [Column] LIKE '%' + @Value + '%'
Some other things that can make a filter un-sargable are non-deterministic functions (and there are more of those than you think).
A common mistake I've seen is to have a separate index for each column in the table. For example, someone will take a table with columns
(A,B,C,D) and create four separate indexes, one each for
D, believing they have now indexed every column and so every query should be fast. In fact, this is rarely helpful for reasons I hope I've already explained, and will often make things worse rather than better because the database will now need to update these indexes for every change to the data.