I have a single large table which I would like to optimize.
I'm using MS-SQL 2005 server. I'll try to describe how it is used and if anyone has any suggestions I would appreciate it very much.
The table is about 400GB, has 100 million rows and 1 million rows are inserted each day.
The table has 8 columns, 1 data col and 7 columns used for lookups/ordering.
k1 k2 k3 k4 k5 k6 k7 d1
k1: varchar(3), primary key - clustered index, 10 possible values
k2: bigint, primary key - clustered index, total rows/10 possible values
k3: int, 10 possible values
k4: money, 100 possible values
Only one select query is run which looks like this:
SELECT TOP(g) d1 FROM table WITH(NOLOCK)
WHERE k1 = a
AND k3 = c
AND k4 = d
AND k5 = e
AND k6 = f
ORDER BY k7
where g = circa 1 million
This query us ran about 10 times per day (often while inserts are happening) and takes about 5-30 minutes.
So I currently only have a clustered index on the two primary key columns. My question is: what indexes should I add to improve this query's performance?
Would separate indexes on every column be a good choice? I think a single index would take up about 5-8GB. The DB server has 8GB RAM total.
Please do not say that the best thing is to experiment. This is akin to 'I don't know, work it out your self' :)
Any tips much appreciated!
EDIT by doofledorfer--
You've caused an outbreak of premature optimization here, if not outright suggestions that "the best thing is to experiment". You need to clarify a number of issues if you want useful help.
EDIT: Comments on posts to date are now posted below along with query plan
- Mr. Flibble
You are probably I/O bound
Yes, it is not CPU bound. Disk access is high. All available RAM seems to be used. Whether it is used wisely or not remains to be seen.
You say you can't split the data because all the data is used: IMPOSSIBLE
I mean that all data is used at some point - not that all data is used by each user in each query.
I can certainly split the data but, so far, I don't understand why partitioning the table is any better than using a clustered index.
Why did you choose these types
VARCHAR probably should have been INT as it can only be a few values. The rest are sensible enough, Money represents a money value in real life and bigint is an ID, and the bools are onny, offy type things :)
By any chance we could get have a look the insert statement, or TSQL or the bulkinsert
TSQL. Its basically INSERT INTO table VALUES (k1,k2,k3,k4,k5,k6,d1). The only thing that is in any way interesting is that many duplicate inserts are attempted and the k1 & k2 PK constraint is used to prevent duplicate data entering the database. I believed at design time (and now) that this was as quick a way as any to finter out duplicate data.
Can you tell how often your insert happens
Every 10 minutes or so inserts run (ADO.NET) maybe 10K at a time and take a few minutes. I estimate currently a full day's inserts take 40% of the time in the day.
Does the DateTime field contains the date of insert
No. There is actually another DateTime column which does but it is not retrieved in any SELECT query so I didn't mention it for the sake of simplicity.
How did you came to this
More one man day thinking.
if you're interested only in the last data, deleting/archiving the useless data could make sense (start from scratch every morning)
I am not interested in recent data only. A query may select some of the very first data that was inserted into the table all the way up to data inserted minutes ago. But as the data is filtered this does not mean that all the data in the DB is requested in that query.
if there is only one "inserter" and only one "reader", you may want to switch to a specialised type (hashmap/list/deque/stack) or something more elaborated, in a programming language.
I will probably stick with MSSQL for the moment. It's not broke yet, just a little slow.
liggett78, do you suggest a clustered index on columns k1,k4,k5,k6,k3 or a non-clustered index on those columns?
My main question right now is should I extend the current clustered index to contain k4 also (this is the col with next most possible values) or should I just add a non-clustered index to k4.
Would adding all k1-k6 to a clustered index be an option? Then have a separate non-clustered index on the DateTime column for the ORDER BY? Am I correct in thinking that this would not cause any major increase in DB size but will only affect insert times. Can anyone guesstimate the effect this will have on inserts?
I think that if adding indexes to all the columns will double the DB size then it is not viable without large (ie. hardware) changes.
The following plan was run with an index (non clustered) on the DATE column.
Not sure if you can see the XML below so here is a link to it: http://conormccarthy.com/box/queryplan.sqlplan.txt
<?xml version="1.0" encoding="utf-16"?>
<ShowPlanXML xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" Version="1.0" Build="9.00.1399.06" xmlns="http://schemas.microsoft.com/sqlserver/2004/07/showplan">
<StmtSimple StatementCompId="1" StatementEstRows="11111" StatementId="1" StatementOptmLevel="FULL" StatementSubTreeCost="625.754" StatementText="SELECT TOP(11111) d1 FROM hands WITH (NOLOCK)
WHERE k4 = '10'
AND k6 = 1
AND k5 = 1
AND k1 = 'IPN'
AND k3 BETWEEN 2 AND 10
ORDER BY k7 DESC
<StatementSetOptions ANSI_NULLS="false" ANSI_PADDING="false" ANSI_WARNINGS="false" ARITHABORT="true" CONCAT_NULL_YIELDS_NULL="false" NUMERIC_ROUNDABORT="false" QUOTED_IDENTIFIER="false" />
<QueryPlan DegreeOfParallelism="1" CachedPlanSize="36">
<MissingIndex Database="[MYDB]" Schema="[dbo]" Table="[Hands]">
<Column Name="[k1]" ColumnId="1" />
<Column Name="[k4]" ColumnId="7" />
<Column Name="[k5]" ColumnId="9" />
<Column Name="[k6]" ColumnId="10" />
<Column Name="[k3]" ColumnId="6" />
<Column Name="[d1]" ColumnId="3" />
<Column Name="[k7]" ColumnId="4" />
<RelOp AvgRowSize="75" EstimateCPU="0.0011111" EstimateIO="0" EstimateRebinds="0" EstimateRewinds="0" EstimateRows="11111" LogicalOp="Top