关于MS SQLSERVER索引优化问题

发布时间：2020-12-12 15:55:55 所属栏目：MsSql教程来源：网络整理

导读：********************************************************** Author:黄山光明顶 mail:leimin@jxfw.com version:1.0.0 date:2004-1-30 *********************************************************** 有一网友问:关于MS SQLSERVER索引优化问题: ?有表Stress

**********************************************************

Author:黄山光明顶

mail:leimin@jxfw.com

version:1.0.0

date:2004-1-30

***********************************************************有一网友问:关于MS SQLSERVER索引优化问题:
?有表Stress_test(id? int,key char(2))
??????? id 上有普通索引;
????? ??key 上有簇索引;
????? ??id 有有限量的重复;
????? ??key 有无限量的重复;

现在我需要按逻辑与查询表中key='Az' AND key='Bw' AND key='Cv' 的id

求教高手最有效的查询语句

测试环境：
???? Hardware:P4 2.6+512M+80G
???? Software:windows server 2003(Enterprise Edition)+Sqlserver 2000 +sp3a

? 首先我们建立一个测试的数据，为使数据尽量的分布和随即，我们通过RAND()来随机产生2个随机数再组合成一个字符串，首先插入的数据是1,000,000条记录，然后在循环插入到58,000条记录。
?? 因为是随机产生的数据，所以如果你自己测试的数据集和我测试的会不一样，但对索引的优化和运行的效率是一样的。
?? 下面的“--//测试脚本”是产生测试数据的脚本，你可以根据需要修改 @maxgroup， @maxLoop的值，比如测试1百万的记录可以:

???? Select @maxgroup=1000
???? Select @maxLoop=1000

如果要测试5千万：

???? Select @maxgroup=5000
???? Select @maxLoop=10000

所以如果你的SERVER或PC比较慢，请耐心等待.....，
?（在我的PC上运行的速度是插入1百万条的时间是1.14m，插入5千八百万条的时间是19.41m，重新建立INDEX的时间是34.36m)

作为一般的开发人员很容易就想到的语句：

?? --语句1

??? select a.[id] from?
??? (select distinct [id] from stress_test where [key] = 'Az') a,
??? (select distinct [id] from stress_test where [key] = 'Bw') b,
??? (select distinct [id] from stress_test where [key] = 'Cv') c
??? where a.id = b.id and a.id = c.id

?? --语句2

???? select [id]?
?????from stress_test?
???? where [key]='Az' or [key]='Bw' or [key]='Cv'
???? group by id having(count(distinct [key])=3)??

?? --语句5

??? SELECT distinct a.[id] FROM stress_test AS a,stress_test AS b,stress_test AS c
??? WHERE a.[key]='Az' AND b.[key]='Bw' AND c.[key]='Cv'
????? AND a.[id]=b.[id] AND a.[id]=c.[id]

但作为T-SQL的所谓“高手”可能会认为这种写法很“土”，也显得没有水平，所以会选择一些子查询和外连接的写法，按常理子查询的效率是比较高的：

?? --语句3

??? select distinct [id] from stress_test A where
??? not exists (
??? select 1 from
???? (select 'Az' as k union all select 'Bw' union all select 'Cv') B
??? left join stress_test C on? C.id=A.id and B.[k]=C.[key]
??? where C.id is null)

?? --语句4

???? select distinct a.id from stress_test a
???? where not exists
???? ( select * from keytb c
????? where not exists
????? ( select * from stress_test b
?????? where
???????????? b.id = a.id
???????????? and
???????????? c.kf1 = b.[key]
????? ?)
?????)

我们先分析这几条语句（针对5千8百万条数据进行分析）：

请大家要特别留心Estimated row count的值。

语句1:从执行规划中我们可以看出，MSSQLSERVER选择的索引优化非常有规律，先通过CLUSTERED INDEX筛选出符合[KEY]='Az'条件的ID,然后进行HASH MATCH,在找出ID相等的；依次类推最终检索到符合所有条件的记录。中间的Estimated row count的值都不大。

语句2:从执行规划中我们可以看出，是先通过CLUSTERED INDEX筛选出符合 [key]='Az' or [key]='Bw' or [key]='Cv' 符合所有条件的ID，然后分组进行2次HASH MATCH 所有的ID。我们可以看出Estimated row count的值是越来越少，从最初的369,262到最后排序的只有402。

语句3:从执行规划中我们可以看是非常复杂的，是先通过3组通过CONSTANT SCAN和NON-CLUSTERED INDEX检索出符合 A.ID=C.ID AND [key]='**'?的记录3组，然后分组进行外键匹配，再将3组的数据合并，排序，然后再和一个NON-CLUSTERED INDEX检索出的记录集进行外键匹配，我们可以看出MSSQLSERVER会对所有的记录（5千万条）记录进行分组，Estimated row count的值是：58,720,所以这句T-SQL的瓶颈是对5千万条记录进行分组。

语句4:从执行规划中我们可以看和语句3有相似之处，都要对所有的记录（5千万条）记录进行分组，所以这是检索的瓶颈，而且使用的索引都是NON-CLUSTERED INDEX。

语句5：从执行规划中我们可以看出，先通过CLUSTERED INDEX检索出符合[Key]='Az'的记录集，然后进行HASH MATCH和SORTS,因为数量少所以是非常会的，在和通过NON-CLUSTERED INDEX检索[KEY]='Bw'的记录进行INNER JOIN,在和通过CLUSTERED INDEX检索[KEY]='Cv'的记录进行合并,最后是对4百万条数据进行分组检索，如果是6列，我们可以看出Estimated row count的值是递增，越来越大，最后的分组检索的Estimated row count的值是3.46E+15,这已经形成巨大的瓶颈。

我们测试百万条以上的记录：
?1.先对1百万条记录进行测试（选取3列）
?2.先对1百万条记录进行测试（选取6列）
?3.对5千万条数据测试（选取3列）
?4.对5千万条数据测试（选取6列）

统计表1：
?----------------------------------------------------------------------
?|------------------语句 1----语句 2----语句 3----语句 4----语句 5----|
?| 1百万（3列）??? 0.77%???? 0.41%??? 49.30%???? 48.99%???? 0.52%
?| 1百万（6列）???? 1.61%???? 0.81%??? 48.99%???? 47.44%???? 1.14%
?| 5千万（3列）???? 0.14%???? 0.18%??? 48.88%???? 48.86%???? 1.93%
?| 5千万（6列）???? 0.00%???? 0.00%???? 0.00%????? 0.00%?? 100.00%
统计表2：
?----------------------------------------------------------------------
?|------------------语句 1----语句 2----语句 3----语句 4----语句 5----|
?| 1百万（3列）???? 9ms?????? 22ms???? 723ms???? 753ms????? 4ms
?| 1百万（6列）????? 15ms????? 38ms???? 764ms???? 773ms???? 11ms
?| 5千万（3列）???? 575ms???? 262ms? 110117ms? 110601ms? 12533ms
?| 5千万（6列）??? 1070ms???? 576ms? 107988ms? 109704ms???? 10m以上

测试总结：（我们可以比较关注：语句 2和语句 5）
1.在1百万条记录的情况下，语句 5是最快的，但在5千万条记录下是最慢的。这说明INDEX的优化一定的情况下，数据量不同，检索的效率也是不同的。我们平时在写T-SQL时一般关注的时INDEX的使用，只要我们写的T-SQL是利用CLUSTERED INDEX，我们就认为是最优化了，其实这是一个误区，我们还要关注Estimated row count的值，大量的I/O操作是我们应该关注的，所以我们应该根据数据量的不同选择相应的T-SQL语句，不要认为在小数据量下是最高的在大数据量的状态下也许是最慢的:-(。

2.在执行规划中最快的，并不是运行最快的，我们可以看在1百万（6列）在这行中，语句 2和语句 5的比例是0.81％：1.14％，但实际的运行效率是，38ms:11ms。所以，我们在选择T-SQL是要考虑本地I/O的速度，所以在优化语句时不仅要看执行规划还要计算一下具体的效率。

在测试的语句上加入：

??? SET STATISTICS TIME? ON/OFF
????SET STATISTICS IO? ON/OFF
是一个很好的调试方法。

3.综合评价，语句 2的效率是最高的，执行效率没有随数据量变化而有很大的差别。

4.执行规划越简单的语句（语句1），综合效率越高，反之则越低（语句3，语句4）。

5.在平时写T-SQL语句时，一定要根据不同的数据量进行测试，虽然都是用CLUSTERED INDEX，但检索的效率却大相径庭。

--//测试脚本
USE Northwind
GO
if exists(select * from sysobjects where name=N'stress_test' and type='U')
Drop table stress_test
GO
--//定义测试的表stress_test,存放所有的测试数据
Create table stress_test([id] int,[key] char(2))

GO
--//插入测试的数据
Set nocount on
--//变量定义
Declare @id int ??--//Stress_test ID 值
Declare @key char(2)??--//Stress_test [key] 值
Declare @maxgroup int??--//组最大的循环数
Declare @maxLoop int??--//ID最大的循环数
Declare @tempGroup int??--//临时变量
Declare @tempLoop int??--//临时变量
Declare @tempint1 int??--//临时变量
Declare @tempint2 int??--//临时变量
Declare @rowcount int??--//记录事务提交的行数

--//初始化变量
Select @id=1
Select @maxgroup=1000
Select @maxLoop=1000
Select @tempGroup=1
Select @tempLoop=1
Select @key=''
Select @rowcount=0

while @tempLoop<=@maxLoop
begin
?while @tempGroup<=@maxGroup
?begin
??select @tempint1=65+convert(int,rand()*50)
??select @tempint2=65+convert(int,rand()*100)
??if (@tempint1>=122 or @tempint2>=122)
????begin
????select @tempint1=@tempint1-100
????select @tempint2=@tempint2-100
???
????if (@tempint1<=65 or @tempint2<=65)
?????begin
?????select @tempint1=@tempint1+57
?????select @tempint2=@tempint2+57
????end
???end
??select @key=char(@tempint1)+char(@tempint2)
??if @rowcount=0
??begin tran ins
???? insert into stress_test([id],[key])values(@id,@key)
?? ??? select @rowcount=@rowcount+1
??
?? if @rowcount>3000 --//判断当行数达到3000条时，开始提交事务
???begin
?????? commit tran ins
????? select @rowcount=0
???end
??
??select @tempGroup=@tempgroup+1
?end
?if @rowcount>0
?begin
??commit tran ins
??select @rowcount=0
?end

?select @tempGroup=1
?select @id=@id+1
?select @tempLoop=@tempLoop+1
end
GO
--//删除KEY值为NULL的记录
delete stress_test where [key]is null
GO
--//建立簇索引PK_STRESS
Create Clustered index pk_stress on stress_test([Key])
--//建立非簇索引NI_STRESS_ID
Create NonClustered index NI_stress_id on stress_test([id])
GO
--//定义测试的表keytb
if exists(select * from sysobjects where name=N'keytb' and type='U')
Drop table keytb
GO
create table keytb?? -----//存放你需要匹配的值的表
(
? kf1? varchar(20)
)

--//存放你需要匹配的值，暂定为三个
insert into keytb(kf1) values('Az');
insert into keytb(kf1) values('Bw');
insert into keytb(kf1) values('Cv');

--insert into keytb(kf1) values('Du');
--insert into keytb(kf1) values('Ex');
--insert into keytb(kf1) values('Fy');
GO

下面我们就开始测试几种T-SQL的INDEX优化问题：

--先对1百万条/1亿条记录进行测试（选取3列）的T-SQL:

PRINT '第一种语句：'
SET STATISTICS TIME? ON
SET STATISTICS IO? ON
select a.[id] from
(select distinct [id] from stress_test where [key] = 'Az') a,
(select distinct [id] from stress_test where [key] = 'Bw') b,
(select distinct [id] from stress_test where [key] = 'Cv') c
where a.id = b.id and a.id = c.id
GO
PRINT '第二种语句：'
select [id]
from stress_test?
where [key]='Az' or [key]='Bw' or [key]='Cv'
group by id having(count(distinct [key])=3)
GO
PRINT '第三种语句：'
select distinct [id] from stress_test A where
not exists (
select 1 from
(select 'Az' as k union all select 'Bw' union all select 'Cv') B
left join stress_test C on? C.id=A.id and B.[k]=C.[key]
where C.id is null)
GO
PRINT '第四种语句：'
select distinct a.id from stress_test a
?where not exists
?( select * from keytb c
?? where not exists
?? ( select * from stress_test b
???? where
?????? b.id = a.id
?????? and
?????? c.kf1 = b.[key]
?? )
?)
GO
PRINT '第五种语句：'
SELECT distinct a.[id] FROM stress_test AS a,stress_test AS c
WHERE a.[key]='Ac' AND b.[key]='Bb' AND c.[key]='Ca'
????? AND a.[id]=b.[id] AND a.[id]=c.[id]

GO
SET STATISTICS TIME? OFF
SET STATISTICS IO? OFF

--先对1百万条/1亿条记录进行测试（选取6列）的T-SQL:
PRINT '第一种语句：'
SET STATISTICS TIME? ON
SET STATISTICS IO? ON
select a.[id] from
(select distinct [id] from stress_test where [key] = 'Az') a,
(select distinct [id] from stress_test where [key] = 'Cv') c,
(select distinct [id] from stress_test where [key] = 'Du') d,
(select distinct [id] from stress_test where [key] = 'Ex') e,
(select distinct [id] from stress_test where [key] = 'Fy') f
where a.[id] = b.[id] and a.[id] = c.[id] and a.[id] = d.[id] and a.[id] = e.[id] and a.[id] = f.[id]
GO
PRINT '第二种语句：'
select [id]
from stress_test?
where [key]='Az' or [key]='Bw' or [key]='Cv' or [Key]='Du'or [Key]='Ex'or [Key]='Fy'
group by id having(count(distinct [key])=6)
GO
PRINT '第三种语句：'
select distinct [id] from stress_test A where
not exists (
select 1 from
(select 'Az' as k union all select 'Bw' union all select 'Cv'union all select 'Du'union all select 'Ex'union all select 'Fy') B
left join stress_test C on? C.id=A.id and B.[k]=C.[key]
where C.id is null)
GO
PRINT '第四种语句：'
select distinct a.id from stress_test a
?where not exists
?( select * from keytb c
?? where not exists
?? ( select * from stress_test b
???? where
?????? b.id = a.id
?????? and
?????? c.kf1 = b.[key]
?? )
?)
GO
PRINT '第五种语句：'
SELECT distinct a.[id] FROM stress_test AS a,stress_test AS c,stress_test AS d,stress_test AS e,stress_test AS f
WHERE a.[key]='Az' AND b.[key]='Bw' AND c.[key]='Cv' AND d.[key]='Du' AND e.[key]='Ex' AND f.[key]='Fy'
???? and a.[id] = b.[id] and a.[id] = c.[id] and a.[id] = d.[id] and a.[id] = e.[id] and a.[id] = f.[id]

GO
SET STATISTICS TIME? OFF
SET STATISTICS IO? OFF?

我们可以先测试一下小的数据量（50000条）；

大家可以下面测试脚本的：

?? Select @maxgroup=500
?? Select @maxLoop=100

----------------------------------------------------------------------
?|------------------语句 1----语句 2----语句 3----语句 4----语句 5----|
?| 5万（3列）??????? 5ms?????? 19ms???? 37ms???? 59ms????? 0ms
?| 5万（6列）??????? 1ms?????? 26ms???? 36ms???? 36ms???? 1ms
?

从测试的的数据来看，语句5的效率是最高的，几乎没有花费时间，而语句2的效率只能说是一般。如果测试到这里就结束了，我们可以毫不犹豫的选择语句 5 :-(,继续进行下面的测试.....

（编辑：李大同）

【声明】本站内容均来自网络，其相关言论仅代表作者个人观点，不代表本站立场。若无意侵犯到您的权利，请及时与联系站长删除相关内容!