GPTL usage example 1

Example 1: Manual instrumentation of a simple OpenMP Fortran code

This is an OpenMP Fortran code manually instrumented with GPTL calls. The output produced by the embedded call to gptlpr() is shown and explained.

simpleomp.f90:


program simpleomp
  use gptl
  implicit none
  integer :: ret, iter
  integer, parameter :: nompiter = 2 ! Number of OMP threads

  ret = gptlsetoption (gptlabort_on_error, 1) ! Abort on GPTL error
  ret = gptlsetoption (gptloverhead, 0)       ! Turn off overhead estimate
  ret = gptlinitialize ()                     ! Initialize GPTL
  ret = gptlstart ('total')                   ! Start a timer

!$OMP PARALLEL DO PRIVATE (iter)   ! Threaded loop
  do iter=1,nompiter
    ret = gptlstart ('A')          ! Start a timer
    ret = gptlstart ('B')          ! Start another timer
    ret = gptlstart ('C')
    call sleep (iter)              ! Sleep for "iter" seconds
    ret = gptlstop ('C')           ! Stop a timer
    ret = gptlstart ('CC')
    ret = gptlstop ('CC')
    ret = gptlstop ('B')         
    ret = gptlstop ('A')
  end do
  ret = gptlstop ('total')
  ret = gptlpr (0)                 ! Print stats
  ret = gptlfinalize ()            ! Clean up
end program simpleomp

Compile and link, then run.

% gfortran -fopenmp simpleomp.f90 -I${GPTL}/include -L${GPTL}/lib -lgptlf -lgptl
% env OMP_NUM_THREADS=2 ./a.out

The call to gptlpr(0) wrote a file named timing.0, which looks like this:


Stats for thread 0:
             Called  Recurse     Wall      max      min
  total           1     -       2.000    2.000    2.000
    A             1     -       1.000    1.000    1.000
      B           1     -       1.000    1.000    1.000
        C         1     -       1.000    1.000    1.000
        CC        1     -    0.00e+00 0.00e+00 0.00e+00
Overhead sum =   1.8e-06 wallclock seconds
Total calls  = 5

Stats for thread 1:
            Called  Recurse     Wall      max      min
  A              1     -       2.000    2.000    2.000
    B            1     -       2.000    2.000    2.000
      C          1     -       2.000    2.000    2.000
      CC         1     -    0.00e+00 0.00e+00 0.00e+00
Overhead sum =  1.44e-06 wallclock seconds
Total calls  = 4

Same stats sorted by timer for threaded regions:
Thd      Called  Recurse     Wall      max      min
000 A         1     -       1.000    1.000    1.000
001 A         1     -       2.000    2.000    2.000
SUM A         2     -       3.000    2.000    1.000

000 B         1     -       1.000    1.000    1.000
001 B         1     -       2.000    2.000    2.000
SUM B         2     -       3.000    2.000    1.000

000 C         1     -       1.000    1.000    1.000
001 C         1     -       2.000    2.000    2.000
SUM C         2     -       3.000    2.000    1.000

000 CC        1     -    0.00e+00 0.00e+00 0.00e+00
001 CC        1     -    0.00e+00 0.00e+00 0.00e+00
SUM CC        2     -    0.00e+00 0.00e+00 0.00e+00

Explanation of the above output

The output file displayed here leaves out preample information such as how the GPTL library was built, name of the underlying timing routine, and so forth. The statistics themselves begin with the line which reads "Stats for thread 0:". The region names are listed on the far left. A "region" is defined in the application by calling GPTLstart(), then GPTLstop() for the same input (character string) argument. Indenting of the names preserves parent-child relationships between the regions. In the example, we see that region "A" was contained in "total", "B" contained in "A", and regions "C" and "CC" both contained in "B".

Reading across the output from left to right, the next column is labelled "Called". This is the number of times the region was invoked. If any regions were called recursively, that information is printed next. In this case there were no recursive calls, so just a "-" is printed. Total wallclock time for each region is printed next, followed by the max and min values for any single invocation. In this simple example each region was called only once, so "Wallclock", "max", and "min" are all the same.

Since this was a threaded code run with OMP_NUM_THREADS=2, statistics for the second thread are also printed. It starts at "Stats for thread 1:" The output shows that thread 1 participated in the computations for regions "A", "B", "C", and "CC", but not "total". This is reflected in the code itself, since only the master thread was active when start and stop calls were made for region "total".

After the per-thread statistics section, the same information is repeated, sorted by region name if more than one thread was active. This section is delimited by the string "Same stats sorted by timer for threaded regions:". This region presentation order makes it easier to inspect for load balance across threads. The leftmost column is thread number, and the region names are not indented. A sum across threads for each region is also printed, and labeled "SUM".

Back to GPTL home page