simpleomp.f90:
Compile and link, then run.program simpleomp use gptl implicit none integer :: ret, iter integer, parameter :: nompiter = 2 ! Number of OMP threads ret = gptlsetoption (gptlabort_on_error, 1) ! Abort on GPTL error ret = gptlsetoption (gptloverhead, 0) ! Turn off overhead estimate ret = gptlinitialize () ! Initialize GPTL ret = gptlstart ('total') ! Start a timer !$OMP PARALLEL DO PRIVATE (iter) ! Threaded loop do iter=1,nompiter ret = gptlstart ('A') ! Start a timer ret = gptlstart ('B') ! Start another timer ret = gptlstart ('C') call sleep (iter) ! Sleep for "iter" seconds ret = gptlstop ('C') ! Stop a timer ret = gptlstart ('CC') ret = gptlstop ('CC') ret = gptlstop ('B') ret = gptlstop ('A') end do ret = gptlstop ('total') ret = gptlpr (0) ! Print stats ret = gptlfinalize () ! Clean up end program simpleomp
% gfortran -fopenmp simpleomp.f90 -I${GPTL}/include -L${GPTL}/lib -lgptlf -lgptl % env OMP_NUM_THREADS=2 ./a.outThe call to gptlpr(0) wrote a file named timing.0, which looks like this:
Stats for thread 0: Called Recurse Wall max min total 1 - 2.000 2.000 2.000 A 1 - 1.000 1.000 1.000 B 1 - 1.000 1.000 1.000 C 1 - 1.000 1.000 1.000 CC 1 - 0.00e+00 0.00e+00 0.00e+00 Overhead sum = 1.8e-06 wallclock seconds Total calls = 5 Stats for thread 1: Called Recurse Wall max min A 1 - 2.000 2.000 2.000 B 1 - 2.000 2.000 2.000 C 1 - 2.000 2.000 2.000 CC 1 - 0.00e+00 0.00e+00 0.00e+00 Overhead sum = 1.44e-06 wallclock seconds Total calls = 4 Same stats sorted by timer for threaded regions: Thd Called Recurse Wall max min 000 A 1 - 1.000 1.000 1.000 001 A 1 - 2.000 2.000 2.000 SUM A 2 - 3.000 2.000 1.000 000 B 1 - 1.000 1.000 1.000 001 B 1 - 2.000 2.000 2.000 SUM B 2 - 3.000 2.000 1.000 000 C 1 - 1.000 1.000 1.000 001 C 1 - 2.000 2.000 2.000 SUM C 2 - 3.000 2.000 1.000 000 CC 1 - 0.00e+00 0.00e+00 0.00e+00 001 CC 1 - 0.00e+00 0.00e+00 0.00e+00 SUM CC 2 - 0.00e+00 0.00e+00 0.00e+00
Reading across the output from left to right, the next column is labelled "Called". This is the number of times the region was invoked. If any regions were called recursively, that information is printed next. In this case there were no recursive calls, so just a "-" is printed. Total wallclock time for each region is printed next, followed by the max and min values for any single invocation. In this simple example each region was called only once, so "Wallclock", "max", and "min" are all the same.
Since this was a threaded code run with OMP_NUM_THREADS=2, statistics for the second thread are also printed. It starts at "Stats for thread 1:" The output shows that thread 1 participated in the computations for regions "A", "B", "C", and "CC", but not "total". This is reflected in the code itself, since only the master thread was active when start and stop calls were made for region "total".
After the per-thread statistics section, the same information is repeated, sorted by region name if more than one thread was active. This section is delimited by the string "Same stats sorted by timer for threaded regions:". This region presentation order makes it easier to inspect for load balance across threads. The leftmost column is thread number, and the region names are not indented. A sum across threads for each region is also printed, and labeled "SUM".