The hash value for any given GPTL region is invariant across threads. So per-thread copies of the handle are not needed. Also, these functions can be freely mixed with their GPTLstart() and GPTLstop() counterparts, as shown in the example below.
Though not done in the example below, GPTLinit_handle() can be called prior to GPTLstart_handle() and GPTLstop_handle() to obtain the handle prior to invoking start/stop functions. This approach has the advantage that the overhead of generating the handle is removed even on the first call to GPTLstart_handle.
handle.F90:
Now compile and run:program handle use gptl implicit none integer :: handle1 ! Hash index integer :: n integer :: ret ret = gptlinitialize () ret = gptlstart ('total') ! Time the entire code ! IMPORTANT: Start with a zero handle value so GPTLstart_handle knows to initialize ! Instead of setting handle1=0 here we could also do: ! ret=gptlinit_handle('loop', handle1) ! This latter approach is actually preferable to avoid one-time multiple threads ! computing the handle value inside the threaded loop. handle1 = 0 !$OMP PARALLEL DO SHARED (handle1) do n=1,1000000 ! First call the "_handle" versions of start and stop for the region ret = gptlstart_handle ('loop', handle1) ret = gptlstop_handle ('loop', handle1) ! Now call the standard start and stop functions for the same region ret = gptlstart ('loop') ret = gptlstop ('loop') end do ret = gptlstop ('total') ! Time the entire code ret = gptlpr (0) stop end program handle
Here's the important output from the timing.0 file that got created by the call to gptlpr(0):% gfortran -fopenmp -o handle handle.F90 -I${GPTL}/include -I${GPTL}/lib -lgptlf -lgptl % ./handle
Total overhead of 1 GPTL start or GPTLstop call=1.08e-07 seconds Components are as follows: Fortran layer: 2.0e-09 = 1.9% of total Get thread number: 2.0e-08 = 18.5% of total Generate hash index: 3.1e-08 = 28.7% of total Find hashtable entry: 2.2e-08 = 20.4% of total Underlying timing routine: 3.3e-08 = 30.6% of total ... Stats for thread 0: Called Recurse Wallclock max min self_OH parent_OH total 1 - 0.159 0.159 0.159 0.000 0.000 loop 500000 - 0.045 1.40e-05 0.00e+00 0.018 0.091 Overhead sum = 0.108 wallclock seconds Total calls = 500001 ... Same stats sorted by timer for threaded regions: Thd Called Recurse Wallclock max min self_OH parent_OH 000 loop 500000 - 0.045 1.40e-05 0.00e+00 0.018 0.091 001 loop 500000 - 0.046 2.50e-05 0.00e+00 0.018 0.091 002 loop 500000 - 0.048 8.60e-05 0.00e+00 0.018 0.091 003 loop 500000 - 0.049 3.30e-03 0.00e+00 0.018 0.091 SUM loop 2.0e+06 - 0.189 3.30e-03 0.00e+00 0.070 0.362
It is worth noting that the reported overhead assumes that only GPTLstart() and GPTLstop() were called. This estimate can be further refined in this example by taking the reported 28.7% of overhead that is due to generating the hash index, multiplying it by 0.5 (since half of the start/stop calls used the "_handle" GPTL routines which don't need to generate hash indices), and subtracting that fraction from the 0.108 seconds reported overhead to a new overhead estimate of 0.092 seconds.
Note that the reported overhead was very high relative to the cost of the work being timed. This is understandable considering that no real work is being done between GPTL "start" and "stop" calls.