Topic: 2D-Linq and optimization of numeral filters - 2

We continue speculations the Author: Sinclair Date: 25.06 14:16. Upon, last time we received only first half of title - 2D-Linq. Now we talk about optimization. Lag in 2.3 - 2.7 times is caused, apparently, by presence of an indirect call. As it well-known, calls of delegates  are not subject. The motivation is for this purpose obsolete approximately about ten years ago, but restrictions which faces JIT, all the same. Therefore in our dense cycle there are constant indirect calls all same kernel, transferred to us outside. Unlike JIT and C#, we precisely know, how many time will be executed this same delegate; we can appear favourable dynamic to construct the code similar to ours PerfectFourNeighborAverage, and to execute it. No sooner said than done. Skilled  for certain would replace method signature Select so that it accepted Expression <Kernel <T>> instead of the delegate, and on the fly would collect the code of iteration procedure from cycles and expressions. Unfortunately, my qualification on it did not suffice, therefore to work we are up to standard MSIL. A decision jacket - method GenerateFilter (): public static Filter <T> GenerateFilter <T> (Kernel <T> kernel) {var km = KernelMeasure. Measure (kernel); DynamicMethod dm = new DynamicMethod ("Filter" +kernel. Method. Name, typeof (T []), new Type [] {typeof (object), typeof (T [])}); var ilg = dm. GetILGenerator (); var result_var = ilg. DeclareLocal (typeof (T [])); var i_var = ilg. DeclareLocal (typeof (int)); var j_var = ilg. DeclareLocal (typeof (int)); var h_var = ilg. DeclareLocal (typeof (int)); var w_var = ilg. DeclareLocal (typeof (int)); ilg. Emit (OpCodes. Ldarg_1);//source ilg. Emit (OpCodes. Ldc_I4_0);//0 ilg. Emit (OpCodes. Call, typeof (T []).GetMethod ("GetLength", new Type [] {typeof (int)})); ilg. Emit (OpCodes. Stloc, h_var); ilg. Emit (OpCodes. Ldarg_1);//source ilg. Emit (OpCodes. Ldc_I4_1);//1 ilg. Emit (OpCodes. Call, typeof (T []).GetMethod ("GetLength", new Type [] {typeof (int)})); ilg. Emit (OpCodes. Stloc, w_var); ilg. Emit (OpCodes. Ldloc, h_var); ilg. Emit (OpCodes. Ldloc, w_var); ilg. Emit (OpCodes. Newobj, typeof (T []).GetConstructor (new Type [] {typeof (int), typeof (int)})); ilg. Emit (OpCodes. Stloc, result_var); ilg. EmitIncrease (h_var,-km.ymax); ilg. EmitIncrease (w_var,-km.xmax); ilg. EmitFor (i_var, 0-km.xmin, h_var, () => {ilg. EmitFor (j_var, 0-km.ymin, w_var, () => {ilg. Emit (OpCodes. Ldloc, result_var); ilg. var ki = new KernelInliningILInstructionVisitor <T> (ilg, i_var, j_var);//the most interesting - here! new ILReaderkernel. Method).Accept (ki); ilg. Emit (OpCodes. Call, typeof (T []).GetMethod ("Set", new Type [] {typeof (int), typeof (int), typeof (T)}));});}); ilg. Emit (OpCodes. Ldloc, result_var); ilg. Emit (OpCodes. Ret); var inlined = (Func <object, T [], T []>) dm. CreateDelegate (typeof (Func <object, T [], T []>)); return (data) => inlined (kernel. Target, data);} As a whole - anything the military man: the method simply generates about the same MSIL-code which generates for PerfectFourNeighborAverage () the compiler C#: - we calculate the sizes of an initial array - we create a target array of the same size - it is reduced variable "width" and "heights" by the appropriate sizes of a kernel of the filter - we build two nested loops. For simplification of result of the code class ILWriter with pair helpers-methods like EmitFor and EmitIncrease which hide under a cowl routine operation to destination labels and to typical sequences of loading of a variable/loading of an item/addition/preservation is written. An interesting part - embedding of the code of a kernel in the code of our method: var ki = new KernelInliningILInstructionVisitor <T> (ilg, i_var, j_var); new ILReader (kernel. Method).Accept (ki); It is based on packet ILReader which allows to fulfill bypass of instructions of a MSIL-stream. Class KernelInliningILInstructionVisitor implements the following strategy of bypass: All instructions are copied in transferred to it ILGenerator. Each copied instruction is supplied with a label (ilg. MarkLabel ()), and its offset is remembered in the internal table. It is made accurately to copy various branch-instructions: at reading of a flow of labels is not present, there are only offsets. The instruction ret is thrown out at copying; the result of a method simply remains on a stack - in accuracy how would be after a real call one more trick Becomes: operation changeover. In an initiating variant of the code of method Select in kernel the link on IRelativeAccess2d was transferred. Basically, it would be possible it and to leave - to construct local variable, to update its fields on each iteration, and to replace all instructions ldard.1 on corresponding ldloc. Let further JIT understands. But from a congenital paranoia and to mistrust to JIT, I replace method calls IRelativeAccess2d <T>.get_Item with method calls T [].Get (). Thus we know that at the moment of a call at us in a stack values arg.1, dx, dy lie. Before their call it is necessary to replace on data, i+dx, j+dy. In the magic image data at us just lies in arg.1 a generated method, therefore it can to be touched. And to replace arguments on a stack, in a method  the following sequence of instructions: Target. Emit (OpCodes. Stloc, dy);//it is got one more variable for temporal storage dy. A stack after operation: [data, dx]. Target. Emit (OpCodes. Ldloc, i);//[data, dx, i] Target. Emit (OpCodes. Add);//[data, i+dx] Target. Emit (OpCodes. Ldloc, j);//[data, i+dx, j] Target. Emit (OpCodes. Ldloc, dy);//[data, i+dx, j, dy] Target. Emit (OpCodes. Add);//[data, i+dx, j+dy] Target. Emit (OpCodes. Call, _arrGetter);//[data [i+dx, j+dy]] the Last exercise - a binding to a context. The method in kernel can be any, including the instance method (in our case indeed, there is nothing to short). Its code easily can address to this, therefore it is impossible to build in simply it as is. Therefore at generated DynamicMethod not one parameter, and two - the first appears Target the delegate transferred in Kernel. Therefore from GenerateFilter the local function adding to data original Target is returned not itself DynamicMethod, and: return (data) => inlined (kernel. Target, data); Now method Select looks simple to impropriety: public static T [] Select <T> (this T [] source, Kernel <T> kernel) => GenerateFilter (kernel) (source); Well, and now - to a dessert://* Summary * BenchmarkDotNet=v0.10.14, OS=Windows 10.0.15063.1088 (1703/CreatorsUpdate/Redstone2) Intel Core i7-6600U CPU 2.60GHz (Skylake), 1 CPU, 4 logical and 2 physical cores Frequency=2742190 Hz, Resolution=364.6720 ns, Timer=TSC [Host]:.NET NET Framework 4.6.1 (CLR 4.0.30319.42000), 64bit RyuJIT-v4.7.2650.0 LegacyJitX86: a. net Framework 4.6.1 (CLR 4.0.30319.42000), 32bit LegacyJIT-v4.7.2650.0 Job=LegacyJitX86 Jit=LegacyJit Platform=X86 Runtime=Clr Method | Mean | Error | StdDev | Scaled | ScaledSD | Gen 0 | Gen 1 | Gen 2 | Allocated |--------- |--------- neutral---------- neutral---------- neutral------- neutral--------- neutral---------- neutral--------- neutral--------- neutral---------- neutral Perfect | 10.42 ms | 0.2342 ms | 0.2505 ms | 1.00 | 0.00 | 156.2500 | 156.2500 | 156.2500 | 3.81 MB | SafeLinq | 10.50 ms | 0.2057 ms | 0.2368 ms | 1.01 | 0.03 | 156.2500 | 156.2500 | 156.2500 | 3.81 MB | Linq | 30.95 ms | 0.6184 ms | 1.0160 ms | 2.97 | 0.12 | 9687.5000 | 250.0000 | 250.0000 | 22.81 MB | Total, high-speed performance of the generated method - in limits . in relation to an "ideal" variant! Even to cache the generated dynamic method to us it is not necessary - at such sizes of arrays generation and JIT occupies  small time. But whether it is necessary to stop on the reached? In the following series we look, whether it is possible to fulfill from d in data select (d [-1, 0] + d [1, 0] + d [0, 1] + d [0,-1]) / 4 faster, than "the elementary double cycle with one operator inside" which "is written" on the automatic machine ", without thinking almost as similar any skilled programmer wrote many times."