One 400GB table, One query - Need Tuning Ideas (SQL2005)

0 votes
asked Dec 8, 2008 by mr-flibble

I have a single large table which I would like to optimize. I'm using MS-SQL 2005 server. I'll try to describe how it is used and if anyone has any suggestions I would appreciate it very much.

The table is about 400GB, has 100 million rows and 1 million rows are inserted each day. The table has 8 columns, 1 data col and 7 columns used for lookups/ordering.

 k1 k2 k3 k4 k5 k6 k7 d1

where

 k1: varchar(3), primary key - clustered index, 10 possible values
 k2: bigint, primary key - clustered index, total rows/10 possible values
 k3: int, 10 possible values
 k4: money, 100 possible values
 k5: bool
 k6: bool
 k7: DateTime

Only one select query is run which looks like this:

 SELECT TOP(g) d1 FROM table WITH(NOLOCK)
  WHERE k1 = a
  AND k3 = c
  AND k4 = d
  AND k5 = e
  AND k6 = f
  ORDER BY k7

where g = circa 1 million This query us ran about 10 times per day (often while inserts are happening) and takes about 5-30 minutes.

So I currently only have a clustered index on the two primary key columns. My question is: what indexes should I add to improve this query's performance?

Would separate indexes on every column be a good choice? I think a single index would take up about 5-8GB. The DB server has 8GB RAM total.

Please do not say that the best thing is to experiment. This is akin to 'I don't know, work it out your self' :)

Any tips much appreciated!


EDIT by doofledorfer--

You've caused an outbreak of premature optimization here, if not outright suggestions that "the best thing is to experiment". You need to clarify a number of issues if you want useful help.

-- doofledorfer


EDIT: Comments on posts to date are now posted below along with query plan - Mr. Flibble

You are probably I/O bound

Yes, it is not CPU bound. Disk access is high. All available RAM seems to be used. Whether it is used wisely or not remains to be seen.

You say you can't split the data because all the data is used: IMPOSSIBLE

I mean that all data is used at some point - not that all data is used by each user in each query. I can certainly split the data but, so far, I don't understand why partitioning the table is any better than using a clustered index.

Why did you choose these types VARCHAR probably should have been INT as it can only be a few values. The rest are sensible enough, Money represents a money value in real life and bigint is an ID, and the bools are onny, offy type things :)

By any chance we could get have a look the insert statement, or TSQL or the bulkinsert

TSQL. Its basically INSERT INTO table VALUES (k1,k2,k3,k4,k5,k6,d1). The only thing that is in any way interesting is that many duplicate inserts are attempted and the k1 & k2 PK constraint is used to prevent duplicate data entering the database. I believed at design time (and now) that this was as quick a way as any to finter out duplicate data.

Can you tell how often your insert happens Every 10 minutes or so inserts run (ADO.NET) maybe 10K at a time and take a few minutes. I estimate currently a full day's inserts take 40% of the time in the day.

Does the DateTime field contains the date of insert No. There is actually another DateTime column which does but it is not retrieved in any SELECT query so I didn't mention it for the sake of simplicity.

How did you came to this More one man day thinking.

if you're interested only in the last data, deleting/archiving the useless data could make sense (start from scratch every morning)

I am not interested in recent data only. A query may select some of the very first data that was inserted into the table all the way up to data inserted minutes ago. But as the data is filtered this does not mean that all the data in the DB is requested in that query.

if there is only one "inserter" and only one "reader", you may want to switch to a specialised type (hashmap/list/deque/stack) or something more elaborated, in a programming language.

I will probably stick with MSSQL for the moment. It's not broke yet, just a little slow.

liggett78, do you suggest a clustered index on columns k1,k4,k5,k6,k3 or a non-clustered index on those columns?


My main question right now is should I extend the current clustered index to contain k4 also (this is the col with next most possible values) or should I just add a non-clustered index to k4.

Would adding all k1-k6 to a clustered index be an option? Then have a separate non-clustered index on the DateTime column for the ORDER BY? Am I correct in thinking that this would not cause any major increase in DB size but will only affect insert times. Can anyone guesstimate the effect this will have on inserts?

I think that if adding indexes to all the columns will double the DB size then it is not viable without large (ie. hardware) changes.


The following plan was run with an index (non clustered) on the DATE column.

EDIT: Not sure if you can see the XML below so here is a link to it: http://conormccarthy.com/box/queryplan.sqlplan.txt

<?xml version="1.0" encoding="utf-16"?>
<ShowPlanXML xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" Version="1.0" Build="9.00.1399.06" xmlns="http://schemas.microsoft.com/sqlserver/2004/07/showplan">
  <BatchSequence>
    <Batch>
      <Statements>
        <StmtSimple StatementCompId="1" StatementEstRows="11111" StatementId="1" StatementOptmLevel="FULL" StatementSubTreeCost="625.754" StatementText="SELECT TOP(11111) d1 FROM hands WITH (NOLOCK) &#xD;&#xA;                                WHERE k4 = '10' &#xD;&#xA;                                AND k6 = 1 &#xD;&#xA;                                AND k5 = 1  &#xD;&#xA;                                AND k1 = 'IPN'  &#xD;&#xA;                                AND k3 BETWEEN 2 AND 10  &#xD;&#xA;                                ORDER BY k7 DESC&#xD;&#xA;&#xD;&#xA;" StatementType="SELECT">
          <StatementSetOptions ANSI_NULLS="false" ANSI_PADDING="false" ANSI_WARNINGS="false" ARITHABORT="true" CONCAT_NULL_YIELDS_NULL="false" NUMERIC_ROUNDABORT="false" QUOTED_IDENTIFIER="false" />
          <QueryPlan DegreeOfParallelism="1" CachedPlanSize="36">
            <MissingIndexes>
              <MissingIndexGroup Impact="81.7837">
                <MissingIndex Database="[MYDB]" Schema="[dbo]" Table="[Hands]">
                  <ColumnGroup Usage="EQUALITY">
                    <Column Name="[k1]" ColumnId="1" />
                    <Column Name="[k4]" ColumnId="7" />
                    <Column Name="[k5]" ColumnId="9" />
                    <Column Name="[k6]" ColumnId="10" />
                  </ColumnGroup>
                  <ColumnGroup Usage="INEQUALITY">
                    <Column Name="[k3]" ColumnId="6" />
                  </ColumnGroup>
                  <ColumnGroup Usage="INCLUDE">
                    <Column Name="[d1]" ColumnId="3" />
                    <Column Name="[k7]" ColumnId="4" />
                  </ColumnGroup>
                </MissingIndex>
              </MissingIndexGroup>
            </MissingIndexes>
            <RelOp AvgRowSize="75" EstimateCPU="0.0011111" EstimateIO="0" EstimateRebinds="0" EstimateRewinds="0" EstimateRows="11111" LogicalOp="Top

23 Answers

0 votes
answered Jan 8, 2008 by chris

You need to create an index which will reduce the number of possible rows returned as quickly as possible.

Therefore the simplest index to create would be on column k4, as that can have to highest number of different values. It is only necessary to index the initial substring of k4 where the expected values of k4 differ within that substring. This will reduce the size of the index, and speed up access.

k7 should also be indexed as this will greatly increase the speed of the orderby clause.

You may also need to experiment (I know, I know, you said don't experiment, but this may help...) with creating a multiple column index in this order: k4, k1, k2, k3. This, again, is to reduce the number of possible rows returned as quickly as possible.

0 votes
answered Jan 8, 2008 by shawn

Here is an idea, what if you create a second table with all of the Lookup values, and then instead of using where you join the tables and do the where clause on the new Lookup table.

Also I think it could help if you posted a few rows of data and a sample query, if possible.

0 votes
answered Jan 8, 2008 by booji-boy

Add a single index with columns k1-k6 in it; that should be the best.

Also, if you can run sp_updatestats before each query.

0 votes
answered Jan 8, 2008 by hlgem

I would use the index tuning wizard to get a better answer.

However, if it were me, I would try an index on K3, K4 (In the order you most commonly query) (you already have K1 and K2 indexed) and a separate indexed on K7. I don't belive the additon of the boolean fields would improve index performance.

Remember the more indexes, the slower inserts will be. With the number of inserts you have, this ia a real concern. So truly the only real answer is that you will have to experiment with your own data and hardware and find what works best for your personal situation. The fact that it wasn't what you wanted to hear doesn't make it any less true, Indexing is very dependent on how your application actually works and the structure of your data.

0 votes
answered Jan 8, 2008 by vvs

I would say that 8 GB are not enough RAM for a 400 GB table. The server has no chance to keep the relevant data in memory if one index alone takes 5-8 GB. So there's lots and lots of harddisk reads which make the query slow.

In my opinion increasing the amount of RAM and having the database on a fast RAID (perhaps splitted on multiple RAIDs?) would help the most.

EDIT: To be sure what's your real bottleneck, run Windows' Performance Monitor.

0 votes
answered Jan 8, 2008 by mr-flibble

Thanks all for your help.

I have made 3 edits to mistakes in the original post.

1) The WHEREs should have been ANDs.

2) k4 should have been MONEY not VARCHAR. Also, k1 is of length 3.

3) The k2 should not be in the WHERE clause. As doofledorfer correctly points out, it makes no sense to have any other WHERE statements other than the full primary key.

Here are the answers to your questions:

Why have you clustered on the primary key?

I was under the impression that the PK was set as a clustered index by default. I did not change it.

Which columns can be NULL?

None.

What are the VARCHAR lengths?

I made a mistake with the column types. The only remaining VARCHAR is of length 3.

What does the query plan give you now?

Posted in the next post.

Help me understand more about the table. if your PK is k1,k2, you shouldn't have to select by any other column to get a completely unique record. This was a mistake. The k2 part of the PK is not in the WHERE clause.

Knowing why you need around a million records returned might help me provide a better solution.

The database contains daily records (the d1 TEXT column) or data. People need access to large amounts of this data to run their own reports. They need to filter it by a number of values and have it delivered sorted by time.

It looks like you only want the earliest "g" records? Maybe only the most recent "g" records?

Yes, the latest. But I a certain number of them. I don't know the start date beforehand.

Do you have foreign keys on k3, k4? No. This is the only table int the DB.

Comments:

Even if the clustered index is proper, the more selective field should come first.

The more selective index is not used in the WHERE clause (post edit!). So I take it it should not come first in that case?

You may want to Move data over a certain age to a history table

Currently all the data is used so pruning is not an option.

You may want to defrag the index

Currently I have none. Will look into it if this thread proves fruitful.

Add a single index with columns k1-k6 in it; that should be the best.

Can anyone else comment on this suggestion? Liggett78 cammented that this will double the size of the DB without helping much because of the date-column sort. Note that the DATE column is not in the WHERE clause, it is only used for ordering the data.

Try turning k1, k2 into ints and making them foreign keys, it'll use a lot less storage for one, I'd have thought and I think it should be quicker (though I may be wrong there, I guess SQL Server caches these values).

k2 is a bigint (mistake in the orig post). So changing k1 to an int (from a VARCHAR(3)) is an option. Do we really think this is going to make much difference. And do people really think that splitting the table into k1,k2,d1 and k1,k2,k3,k4,k5,k7 and using foreign keys would improve things?

One good tip to improve query speeds is to put in a sub-query that cuts down your recordset size to a more manageable one. There is likely to be some set of data that immediately cuts the recordset down from, say 10 million rows, to 10,000.

e.g. SELECT TOP(g) d1 FROM (SELECT * FROM table WHERE k1=a WITH(NOLOCK)) WHERE AND k3 = c AND k4 = d AND k5 = e AND k6 = f ORDER BY k7

Very interesting. Would this really help? It seems like SQL Server would be very stupid if it did not cut down the data in a similar manner itself.

Perhaps it is the time taken by your UI or whether to display the data, perhaps it is the time taken by the Network ?

There is no UI. There certainly are network issues moving the data but I am only concerned with the time taken for the query to start returning results (I'm using an ADO.NET data reader) at the moment - one thing at a time :)

.. [to] see the most gains ... partition the table

Will a clustered index not have the same effect?

Leave your primary key alone, but create a clustered index on your date column, since this is what you use in ORDER BY. That way the database engine would begin to scan the clustered key, compare columns with your supplied values and output rows that satisfy the conditions.

Sounds like a sound plan! Any other backers?

To summarize the suggestions:

1) Create separate indexes on all keys: most people vote no on this?

2) Create separate indexes on the keys with most distinct values.

3) Create a multiple column index on some of the columns, with the columns with the most distinct values first.

4) Throw RAM at it.

0 votes
answered Jan 8, 2008 by liggett78

Your query plan basically shows the following:

  • The first op is clustered index seek with comparisons on k1, handId?, d1, k3-k6
  • Second is an entire index scan on the k1, handId? and k7
  • Third is of course the join to build the final result set
  • Sorting ORDER BY
  • TOP n (Filter)

The plan suggest an index, which should improve perm by 81% - k1, k4, k5, k6, k3 + include d1 & k7. I don't know how long it would take to build such an index and see the results, but as I've commented here, it will effectively double the size of your table, simply because almost every column is present in the index. Also inserts will be slower.

As many people have suggested, partitioning is the best strategy here, e.g. make one table for example have k3 values from 1 to 3, another from 4 to 7, and the third from 8 to 10. With SQL Server Enterprise partitioning is done using a CHECK constraint on this column, the query optimizer will determine which table out of n to scan/seek depending on the parameter value for the column.

0 votes
answered Jan 9, 2008 by call-me-steve

That sounds like good fun.

A few questions:

  • Why did you choose these types ? varchar, money, bigint, int, bool ? is there a reason or just willing to add some good fun?
  • By any chance we could get have a look the insert statement, or TSQL or the bulkinsert ?
  • Can you tell how often your insert happens (is it a bulk, or random ?)
  • Does the DateTime field contains the date of insert ?
  • How did you came to this ? (a one man/day thinking or a team of 20 people working like crazy for the last three months ?)

A few facts seems important to me:

  • You insert a million row every day
  • You want only the last million data

A few remarks came to me:

  • if you're interested only in the last data, deleting/archiving the useless data could make sense (start from scratch every morning)
  • if there is only one "inserter" and only one "reader", you may want to switch to a specialised type (hashmap/list/deque/stack) or something more elaborated, in a programming language.
0 votes
answered Jan 9, 2008 by mark-brackett

I think a clustered index on K7 is the only thing of any value. The rest of your where clause has such low selectivity that it's a waste of time.

Unless you can take advantage of some specific knowledge of your values (maybe k5 is only true if k4 < 0, or something), you're pretty much looking at a clustered index scan. Might as well make it the field that you're ordering by.

Looking at the low numbers of distinct values in k3 - k6, you'd probably only need to read < 1.5 million rows to get your top 1 million. That's probably the best you're going to do - especially since any other plan would need you to order by k7 anyway to evaluate your TOP clause.

0 votes
answered Jan 12, 2008 by sean-reilly

Have you considered creating a surrogate identity column (type bigint) and using that as the clustered index? Then create your primary key as a non-clustered unique index.

With a table of this size, it's quite possible that index and page fragmentation are a big performance problem. The surrogate clustered index will ensure that all inserts are at the end of the table, which can almost completely eliminate page fragmentation (unless rows get deleted). Less page fragmentation == more pages per IO, which is a very good thing.

This will also allow you to periodically defrag the unique index that you are querying on, which will make it much more effective. Do this often, or at least monitor index fragmentation on this table regularly.

These performance improvements can be quite dramatic -- if your current PK is highly fragmented, an index seek can involve a great deal more IO than it should.

Once you've implemented this, consider (aka, try it and measure ;-) adding a nonclustered index on column k7.

Welcome to Q&A, where you can ask questions and receive answers from other members of the community.
Website Online Counter

...