#### Topic: Multiplication of a kernel of convolution of the filter in the size 3x3 with an array

There is a convolution kernel 33 and the image, presented by an array of pixels integer value. The convolution kernel is presented so://composite convolution kernels:////convolution kernel H =//...... | 1, 0, 1 |//src x | 0, 0, 0 |//...... |-1, 0,-1 |//////a convolution kernel V =//...... | 1, 0,-1 |//src x | 0, 0, 0 |//...... | 1, 0,-1 |//a convolution kernel = kernel H + a kernel V Implementation on the S-code is available, now I try to shift it on SSE the code. for (int inc=0; inc <height-2; inc ++) {//loaded in 3 lines str1_16pxs = _mm_loadu_si128 ((__ m128i *) (src_all_str)); str2_16pxs = _mm_loadu_si128 ((__ m128i *) (src2_all_str)); str3_16pxs = _mm_loadu_si128 ((__ m128i *) (src3_all_str));//packed on 16 discharges str1_16pxs_pack1st_8to16 = _mm_cvtepu8_epi16 (str1_16pxs); str2_16pxs_pack1st_8to16 = _mm_cvtepu8_epi16 (str2_16pxs); str3_16pxs_pack1st_8to16 = _mm_cvtepu8_epi16 (str3_16pxs);//-! //here it is done 1 to convolution for 8px's//... In this place should the code is interposed!!!!//-//summ 1st 8to16 vertical registers sum1_str12_vert_16pxs_pack1st_8to16 = _mm_add_epi16 (str1_16pxs_pack1st_8to16, str2_16pxs_pack1st_8to16); sum1_str123_vert_16pxs_pack1st_8to16 = _mm_add_epi16 (sum1_str12_vert_16pxs_pack1st_8to16,str3_16pxs_pack1st_8to16); for (int jnc=0; jnc <(width>> 4); jnc ++) {str1_16pxs_plus_8pxs = _mm_srli_si128 (str1_16pxs, 8); str2_16pxs_plus_8pxs = _mm_srli_si128 (str2_16pxs, 8); str3_16pxs_plus_8pxs = _mm_srli_si128 (str3_16pxs, 8);//pack 2nd 8to16 registers (+8px's) str1_16pxs_pack2nd_8to16 = _mm_cvtepu8_epi16 (str1_16pxs_plus_8pxs); str2_16pxs_pack2nd_8to16 = _mm_cvtepu8_epi16 (str2_16pxs_plus_8pxs); str3_16pxs_pack2nd_8to16 = _mm_cvtepu8_epi16 (str3_16pxs_plus_8pxs);//--!//we do convolution for remaining 8px's and so up to the end lines//... In this place should the code is interposed!!!! //-//summ vertic 8to16 registers sum1_str12_vert_16pxs_pack2nd_8to16 = _mm_add_epi16 (str1_16pxs_pack2nd_8to16, str2_16pxs_pack2nd_8to16); sum1_str123_vert_16pxs_pack2nd_8to16 = _mm_add_epi16 (sum1_str12_vert_16pxs_pack2nd_8to16,str3_16pxs_pack2nd_8to16);//--! 4 loading next 16 px's src_all_str + = 16; src2_all_str + = 16; src3_all_str + = 16;//... _mm_store_si128 ((__ m128i *) (dst_all_str), res); dst_all_str + = 8;}//for (jnc)}//for (inc) I truth do not know how to do multiplication convolution 3x3 kernels with SSE in the line. I will be very grateful, if show.