Example 2 Back to GPTL home page

Example 1: Manual instrumentation of a simple OpenMP Fortran code

This is an OpenMP Fortran code manually instrumented with GPTL calls. The output produced by the embedded call to gptlpr() is shown and explained.

simpleomp.f90:

program simpleomp use gptl implicit none integer :: ret, iter integer, parameter :: nompiter = 2 ! Number of OMP threads ret = gptlsetoption (gptlabort_on_error, 1) ! Abort on GPTL error ret = gptlsetoption (gptloverhead, 0) ! Turn off overhead estimate ret = gptlinitialize () ! Initialize GPTL ret = gptlstart ('total') ! Start a timer !$OMP PARALLEL DO PRIVATE (iter) ! Threaded loop do iter=1,nompiter ret = gptlstart ('A') ! Start a timer ret = gptlstart ('B') ! Start another timer ret = gptlstart ('C') call sleep (iter) ! Sleep for "iter" seconds ret = gptlstop ('C') ! Stop a timer ret = gptlstart ('CC') ret = gptlstop ('CC') ret = gptlstop ('B') ret = gptlstop ('A') end do ret = gptlstop ('total') ret = gptlpr (0) ! Print stats ret = gptlfinalize () ! Clean up end program simpleomp
Compile and link, then run.
% gfortran -fopenmp simpleomp.f90 -I${GPTL}/include -L${GPTL}/lib -lgptlf -lgptl
% env OMP_NUM_THREADS=2 ./a.out
The call to gptlpr(0) wrote a file named timing.0, which looks like this:
Stats for thread 0: Called Recurse Wall max min total 1 - 2.000 2.000 2.000 A 1 - 1.000 1.000 1.000 B 1 - 1.000 1.000 1.000 C 1 - 1.000 1.000 1.000 CC 1 - 0.00e+00 0.00e+00 0.00e+00 Overhead sum = 1.8e-06 wallclock seconds Total calls = 5 Stats for thread 1: Called Recurse Wall max min A 1 - 2.000 2.000 2.000 B 1 - 2.000 2.000 2.000 C 1 - 2.000 2.000 2.000 CC 1 - 0.00e+00 0.00e+00 0.00e+00 Overhead sum = 1.44e-06 wallclock seconds Total calls = 4 Same stats sorted by timer for threaded regions: Thd Called Recurse Wall max min 000 A 1 - 1.000 1.000 1.000 001 A 1 - 2.000 2.000 2.000 SUM A 2 - 3.000 2.000 1.000 000 B 1 - 1.000 1.000 1.000 001 B 1 - 2.000 2.000 2.000 SUM B 2 - 3.000 2.000 1.000 000 C 1 - 1.000 1.000 1.000 001 C 1 - 2.000 2.000 2.000 SUM C 2 - 3.000 2.000 1.000 000 CC 1 - 0.00e+00 0.00e+00 0.00e+00 001 CC 1 - 0.00e+00 0.00e+00 0.00e+00 SUM CC 2 - 0.00e+00 0.00e+00 0.00e+00

Explanation of the above output

The output file displayed here leaves out preample information such as how the GPTL library was built, name of the underlying timing routine, and so forth. The statistics themselves begin with the line which reads "Stats for thread 0:". The region names are listed on the far left. A "region" is defined in the application by calling GPTLstart(), then GPTLstop() for the same input (character string) argument. Indenting of the names preserves parent-child relationships between the regions. In the example, we see that region "A" was contained in "total", "B" contained in "A", and regions "C" and "CC" both contained in "B".

Reading across the output from left to right, the next column is labelled "Called". This is the number of times the region was invoked. If any regions were called recursively, that information is printed next. In this case there were no recursive calls, so just a "-" is printed. Total wallclock time for each region is printed next, followed by the max and min values for any single invocation. In this simple example each region was called only once, so "Wallclock", "max", and "min" are all the same.

Since this was a threaded code run with OMP_NUM_THREADS=2, statistics for the second thread are also printed. It starts at "Stats for thread 1:" The output shows that thread 1 participated in the computations for regions "A", "B", "C", and "CC", but not "total". This is reflected in the code itself, since only the master thread was active when start and stop calls were made for region "total".

After the per-thread statistics section, the same information is repeated, sorted by region name if more than one thread was active. This section is delimited by the string "Same stats sorted by timer for threaded regions:". This region presentation order makes it easier to inspect for load balance across threads. The leftmost column is thread number, and the region names are not indented. A sum across threads for each region is also printed, and labeled "SUM".


Example 2 Back to GPTL home page