1

Topic: Calculation on GPU

I welcome, I have zero experience of programming of videocards, and knowledge on this point in question at me too is close to zero. If the question seems axiomatic, or silly, I ask not to kick strongly. It is necessary it makes sense to understand how much to use GPU instead of CPU for the following task. There is rather simple iterative algorithm. The algorithm fulfills the order of hundred iterations and operates with type numbers double. On an input the algorithm receives some tens numbers, on an output has the order of 10 numbers. On good CPU the algorithm is completely fulfilled approximately for 400 microseconds for one dial-up of input parameters. This algorithm is necessary for running for different dial-ups of input parameters. Number of dial-ups - from 100 to 5000. All data sets are independent and accessible simultaneously (in operative storage). The task consists in finishing recalculation of all data sets as soon as possible. Questions: whether It is possible  this calculation on GPU? How many dial-ups it will be possible to consider simultaneously? What acceleration it is possible to expect in comparison with CPU which enumerates all data sets sequentially one after another? Where it is possible to expect bottlenecks and problems? Where it is possible to find a code sample which does something similar? The task purely mathematical also does not concern display something on the screen. Thanks

2

Re: Calculation on GPU

Hello, DKM_MSFT, you wrote: DKM> This algorithm is necessary for running for different dial-ups of input parameters. Number of dial-ups - from 100 to 5000. All data sets are independent and accessible simultaneously (in operative storage). It is taught, on all about all 2  a maximum? It seems to me that to begin with it is necessary to try  calculations on all kernels and  SSE4 any on CPU, well and do not forget about caching.

3

Re: Calculation on GPU

Hello, DKM_MSFT, you wrote: DKM> I Welcome, DKM> I have zero experience of programming of videocards, and knowledge on this point in question at me too is close to zero. If the question seems axiomatic, or silly, I ask not to kick strongly. DKM> it is necessary it makes sense to understand how much to use GPU instead of CPU for the following task. https://www.youtube.com/watch?v=G-W0mVL … mqec_u7kcV DKM> There is rather simple iterative algorithm. The algorithm fulfills the order of hundred iterations and operates with type numbers double. On an input the algorithm receives some tens numbers, on an output has the order of 10 numbers. On good CPU the algorithm is completely fulfilled approximately for 400 microseconds for one dial-up of input parameters. DKM> this algorithm is necessary for running for different dial-ups of input parameters. Number of dial-ups - from 100 to 5000. All data sets are independent and accessible simultaneously (in operative storage). DKM> the Task consists in finishing recalculation of all data sets as soon as possible. DKM> questions: Whether DKM> It is possible  this calculation on GPU? DKM> How many dial-ups it will be possible to consider simultaneously? DKM> what acceleration it is possible to expect in comparison with CPU which enumerates all data sets sequentially one after another? DKM> where it is possible to expect bottlenecks and problems? DKM> where it is possible to find a code sample which does something similar? https://www.physics.drexel.edu/~vallier … l_2010.pdf http://ecee.colorado.edu/~siewerts/extr … mples.html DKM> the task purely mathematical also does not concern display something on the screen. DKM> thanks

4

Re: Calculation on GPU

Hello, Kernan, you wrote: K> It is taught, on all about all 2  a maximum? About that K> it seems To me that to begin with it is necessary to try  calculations on all kernels and  SSE4 any on CPU, well and do not forget about caching. These variants too are considered, but with them at me much more clearness. It would be desirable to understand perspectives to do it on GPU.

5

Re: Calculation on GPU

Hello, kov_serg, you wrote: DKM>> It is necessary it makes sense to understand how much to use GPU instead of CPU for the following task. _> https://www.youtube.com/watch?v=G-W0mVL … mqec_u7kcV thanks, I will look. Easier before to start to dig deeply a problem most, someone from the local can tells, whether costs  basically.

6

Re: Calculation on GPU

Hello, DKM_MSFT, you wrote: DKM> the Algorithm fulfills the order of hundred iterations and operates with type numbers double. double it is extremely strongly cut, only on old cards more or less. Peak on Titan black 1881 GFlops, and new GTX 1080 only 277! i7 - 6950X for comparing 240!!! But on CPU they are much more effective in case of difficult branching algorithms. http://www.geeks3d.com/20140305/amd-rad … computing/ if in algorithm it is a lot of branchings it will not be accelerated on gpu. The maximum productivity from  is squeezed out only in cases when over the data many operations are fulfilled - for example in a cycle to multiply something. If it is a lot of conditions on them brakes also can be extremely slowly.

7

Re: Calculation on GPU

Hello, DKM_MSFT, you wrote: DKM> Thanks, I will look. Easier before to start to dig deeply a problem most, someone from the local can tells, whether costs  basically. It seems to me, these two seconds by concerning moderate efforts will be pressed several times on normal CPU (actually, task multisequencing on 8 flows is simple transforms 2 seconds in 1/4 seconds provided that in a case  flows too will not compete for FPU). A question in on what efforts you are ready to go to receive faster result and so it is necessary for you, such acceleration?

8

Re: Calculation on GPU

Hello, Pzz, you wrote: Pzz> the Question in on what efforts you are ready to go to receive faster result and so it is necessary for you, such acceleration? We assume that I am ready to spend any efforts to receive as much as possible fast result. What can I expect at best?

9

Re: Calculation on GPU

Hello, DKM_MSFT, you wrote: Pzz>> the Question in on what efforts you are ready to go to receive faster result and so it is necessary for you, such acceleration? DKM> we assume that I am ready to spend any efforts to receive as much as possible fast result. What can I expect at best? I do not know. But I think, taking into account that the videocard it is necessary to initialize, load in it the code, etc., tens-hundreds milliseconds. I.e., it is comparable that you already have.

10

Re: Calculation on GPU

Hello, Pzz, you wrote: Pzz> I do not know. But I think, taking into account that the videocard it is necessary to initialize, load in it the code, etc., tens-hundreds milliseconds. I.e., it is comparable that you already have. I not so well raised the question. To us this calculation needs to be led not , and it is constant during the day (with other data). That is after  one dial-up of the independent data it will be necessary to count the following dial-up . If the very first dial-up is slowly considered owing to initialization it is normal. What I can expect acceleration at the subsequent starts of the same algorithm for other data sets?

11

Re: Calculation on GPU

Hello, DKM_MSFT, you wrote: DKM> What I can expect acceleration at the subsequent starts of the same algorithm for other data sets? Well whence to us to know, what in your case probably acceleration? Each program different. Though here, time on the majority of cards fp64 the unit is cut, it is possible to divide simply number of operations with double in your calculations into number GFLOPS for fp64, taken of the card specification. Here it also will be the best estimation. It after the order will coincide with an expected estimation, if calculations do not contain any reefs, for example like already mentioned chaotic branchings (actually not all branchings on GPU road, but only after what the program goes on different ways) and if to make stream calculations (aka streaming), well or not to do them but then to consider still date transmission time on the bus from CPU to GPU and is reverse (simply we divide volume of input and output parameters into speed of the bus). And generally, I already would measure productivity. It faster than to look forward to hearing here, and the main thing the received result will be much more reliable. For example, it is possible to try to implement on CUDA because, it, at first, simply enough, and, secondly, the code on CUDA works quickly enough since it is close to iron. Take, for example, the elementary program on CUDA from the documentation. Substitute instead of VecAdd the code of the iterative algorithm and measure runtime on the 5000 parameters. If at you all calculations are independent, at once is large  in the code for GPU hardly it turns out. Probably, will advise to trace only that all calculations became on local variables (. on registers) enough, and to global arrays (i.e. to A, B and a C from an example) to address as seldom as possible (in an ideal of 1 times: in the beginning for reading of input parameters, and in the end for answer record).

12

Re: Calculation on GPU

Hello, watchmaker, you wrote: W> Take, for example, the elementary program on CUDA from the documentation. Substitute instead of VecAdd the code of the iterative algorithm and measure runtime on the 5000 parameters. Clearly, thanks