前言
当多个循环嵌套的时候,里面有非常多的冗余代码。在.Net6只能对一个循环进行局部优化。然在.Net7或者8里面,可以对多个循环进行全面优化。这种优化技术7之前叫循环【提升或克隆】,7之后则叫循环深度【提升或克隆】。
具体怎么样呢?来看看.
概括
1.先看看循环提升或者循环深度提升
以下引用官方代码:
using BenchmarkDotNet.Attributes;using BenchmarkDotNet.Running;using System;using System.Runtime.CompilerServices;[DisassemblyDiagnoser]public partial class Program{static void Main(string[] args){BenchmarkSwitcher.FromAssembly(typeof(Program).Assembly).Run(args);}[Benchmark]public void Compute(){for (int thousands = 0; thousands < 10; thousands++){for (int hundreds = 0; hundreds < 10; hundreds++){for (int tens = 0; tens < 10; tens++){for (int ones = 0; ones < 10; ones++){int n = ComputeNumber(thousands, hundreds, tens, ones);Process(n);}}}}}static int ComputeNumber(int thousands, int hundreds, int tens, int ones) =>(thousands * 1000) +(hundreds * 100) +(tens * 10) +ones;[MethodImpl(MethodImplOptions.NoInlining)]static void Process(int n) { }}运行:dotnet run -c Release -f net7.0 --filter *Program*
可以看到compute这个函数有四层循环,最里面的一层循环调用了ComputeNumber和Process函数。.Net6的Compute函数的ASM如下:
xor esi,esiM00_L00:xor edi,ediM00_L01:xor ebx,ebxM00_L02:xor ebp,ebpimul ecx,esi,3E8imul eax,edi,64add ecx,eaxlea eax,[rbx+rbx*4]lea r14d,[rcx+rax*2]M00_L03:lea ecx,[r14+rbp]call Program.Process(Int32)inc ebpcmp ebp,0Ajl short M00_L03inc ebxcmp ebx,0Ajl short M00_L02inc edicmp edi,0Ajl short M00_L01inc esicmp esi,0Ajl short M00_L00add rsp,20
Compute函数最里面的循环调用的ComputeNumber函数计算的代码:
(thousands * 1000) +(hundreds * 100) +(tens * 10) +ones;imul ecx,esi,3E8imul eax,edi,64
注意看这两个imul,它只是被提升到了有里到外,第二层循环的位置。这样就导致了这两段代码多运行了10*10次,完全是可以避免的。所以它在.Net6里面只能叫循环提升。
再看下.Net8里面它的ASM
xor esi,esiM00_L00:xor edi,ediimul ebx,esi,3E8M00_L01:xor ebp,ebpimul r14d,edi,64add r14d,ebxM00_L02:xor r15d,r15dlea ecx,[rbp+rbp*4]lea r12d,[r14+rcx*2]M00_L03:lea ecx,[r12+r15]call qword ptr [7FFEE5471168]; Program.Process(Int32)inc r15dcmp r15d,0Ajl short M00_L03inc ebpcmp ebp,0Ajl short M00_L02inc edicmp edi,0Ajl short M00_L01inc esicmp esi,0Ajl short M00_L00add rsp,20
这个 imul ecx,esi,3E8提升到第一层位置,imul eax,edi,64提升到第二层的位置。减少了冗余的运行次数,在.Net7及8里面进行了大幅度优化,所以叫做深度循环提升。
2.循环深度克隆
循环深度克隆,把这个循环优化成没有边界检查的快速循环,而如果需要检查的,则在慢速路径中进行边界检查。在.Net6里面只能从低值到高值(i<=这种)的优化,.Net7之后则可以高值到低值优化(也就是i>=这种)。双向优化皆可。所以叫做深度克隆。
先上代码:
private int[] _values = Enumerable.Range(0, 1000).ToArray();[Benchmark][Arguments(0, 0, 1000)]public int LastIndexOf(int arg, int offset, int count){int[] values = _values;for (int i = offset + count - 1; i >= offset; i--)if (values[i] == arg)return i;return 0;}
.Net6 ASM:
sub rsp,28mov rcx,[rcx+8]lea eax,[r8+r9-1]cmp eax,r8djl short M00_L01mov r9d,[rcx+8]nop word ptr [rax+rax]M00_L00:cmp eax,r9djae short M00_L03movsxd r10,eaxcmp [rcx+r10*4+10],edxje short M00_L02dec eaxcmp eax,r8djge short M00_L00M00_L01:xor eax,eaxadd rsp,28retM00_L02:add rsp,28retM00_L03:call CORINFO_HELP_RNGCHKFAILint 3
在判断values[i] == arg之前,进行了边界检查。但是这个边界检查并不是必要的,所以在.Net8优化如下:
sub rsp,28mov rax,[rcx+8]lea ecx,[r8+r9-1]cmp ecx,r8djl short M00_L02test rax,raxje short M00_L01mov r9d,ecxor r9d,r8djl short M00_L01cmp [rax+8],ecxjle short M00_L01M00_L00:mov r9d,ecxcmp [rax+r9*4+10],edxje short M00_L03dec ecxcmp ecx,r8djge short M00_L00jmp short M00_L02M00_L01:cmp ecx,[rax+8]jae short M00_L04mov r9d,ecxcmp [rax+r9*4+10],edxje short M00_L03dec ecxcmp ecx,r8djge short M00_L01M00_L02:xor eax,eaxadd rsp,28retM00_L03:mov eax,ecxadd rsp,28
优化之后的代码,先是判断:
if(i >= offset + count - 1 && offset!=0 &&value!=0 && value[0] > offset + count - 1 ){if(value[i]==arg) //不进行边界检查{return;}}else{if(value[i]==arg) //进行边界检查{return;}}