Automatic instrumentation of a number of MPI routines is also possible, utilizing the PMPI profiling layer provided by most MPI distributions. In this case no special compiler flags are necessary, and profiles are obtained with zero changes to application source files. See Example 6 for further details.
Here is a portion of GPTL printout after running the HPCC benchmark with compiler-based automatic instrumentation enabled:
Function names on the left of the output are indented to indicate their parent, and depth in the call tree. An asterisk next to an entry means it has more than one parent (see Example 2 for further details). Other entries in this output show the number of invocations, number of recursive invocations, wallclock timing statistics, and PAPI-based information. In this example, HPL_daxpy produced 1.34e6 floating point operations, 177.06 MFlops/sec, and had a computational intensity (floating point ops per memory reference) of 0.415.Stats for thread 0: Called Recurse Wallclock max min FP_OPS e6_/_sec CI total 1 - 64.021 64.021 64.021 3.50e+08 5.47 7.20e-02 HPCC_Init 11 10 0.157 0.157 0.000 95799 0.61 8.90e-02 * HPL_pdinfo 120 118 0.019 0.018 0.000 96996 4.99 8.56e-02 * HPL_all_reduce 7 - 0.043 0.036 0.000 448 0.01 1.03e-02 * HPL_broadcast 21 - 0.041 0.036 0.000 126 0.00 6.72e-03 HPL_pdlamch 2 - 0.004 0.004 0.000 94248 21.21 1.13e-01 * HPL_fprintf 240 120 0.001 0.000 0.000 1200 0.93 6.67e-03 HPCC_InputFileInit 41 40 0.001 0.001 0.000 194 0.27 8.45e-03 ReadInts 2 - 0.000 0.000 0.000 12 3.00 1.61e-02 PTRANS 21 20 22.667 22.667 0.000 4.19e+07 1.85 3.19e-02 MaxMem 5 4 0.000 0.000 0.000 796 2.70 1.79e-02 * iceil_ 132 - 0.000 0.000 0.000 792 2.88 1.75e-02 * ilcm_ 14 - 0.000 0.000 0.000 84 2.71 1.71e-02 param_dump 18 12 0.000 0.000 0.000 84 0.82 7.05e-03 Cblacs_get 5 - 0.000 0.000 0.000 30 1.43 1.67e-02 Cblacs_gridmap 35 30 0.005 0.001 0.000 225 0.05 1.79e-03 * Cblacs_pinfo 7 1 0.000 0.000 0.000 40 3.08 1.54e-02 * Cblacs_gridinfo 60 50 0.000 0.000 0.000 260 2.28 2.10e-02 Cigsum2d 5 - 0.088 0.047 0.000 165 0.00 6.37e-03 pdmatgen 20 - 21.497 1.213 0.942 4.00e+07 1.86 3.08e-02 * numroc_ 96 - 0.000 0.000 0.000 576 2.87 1.69e-02 * setran_ 25 - 0.000 0.000 0.000 150 2.94 1.72e-02 * pdrand 3.7e+06 2e+06 15.509 0.041 0.000 1.72e+07 1.11 2.24e-02 xjumpm_ 57506 57326 0.219 0.030 0.000 230384 1.05 2.66e-02 jumpit_ 60180 40120 0.214 0.021 0.000 280840 1.32 2.18e-02 slboot_ 5 - 0.000 0.000 0.000 30 1.30 1.01e-02 Cblacs_barrier 10 5 0.481 0.167 0.000 50 0.00 3.26e-03 sltimer_ 10 - 0.000 0.000 0.000 614 3.05 1.90e-02 * dwalltime00 15 - 0.000 0.000 0.000 150 2.54 2.57e-02 * dcputime00 15 - 0.000 0.000 0.000 373 3.06 1.91e-02 * HPL_ptimer_cputime 17 - 0.000 0.000 0.000 170 2.66 2.29e-02 pdtrans 14 9 0.124 0.045 0.000 573505 4.61 1.36e-01 Cblacs_dSendrecv 12 8 0.115 0.042 0.000 56 0.00 2.24e-03 pdmatcmp 5 - 0.448 0.295 0.003 1.29e+06 2.87 2.94e-01 * HPL_daxpy 2596 - 0.008 0.000 0.000 1.34e+06 177.06 4.40e-01 * HPL_idamax 2966 - 0.007 0.000 0.000 767291 104.75 4.15e-01 ...
If the PAPI library is installed on the target platform, GPTL can be used to access all available PAPI events. To count single-precision floating point operations for example, one need only add a call that looks like:
ret = GPTLsetoption (PAPI_SP_OPS, 1);The second argument "1" in the above call means "enable". Any non-zero integer means "enable", and a zero means "disable". Multiple GPTL or PAPI options can be specified with additional calls to GPTLsetoption(). The man pages provided with the distribution describe the full API specification. The interface is identical for both Fortran and C/C++ codes, except for the case-insensitivity of Fortran.
Calls to GPTLstart() and GPTLstop() can be nested to an arbitrary depth. As shown above, GPTL handles nested regions by presenting output in an indented fashion. The example also shows how auto-instrumentation can be used to easily produce a dynamic call tree of the application being profiled, where region names correspond to function entry and exit points.
Example 1 is a manually-instrumented threaded Fortran code.
Example 2 is a C code compiled with gcc's auto-instrumentation hooks to print a dynamic call tree.
Example 3 demonstrates the use of GPTLpr_summary() to obtain a statistical summary of timing statistics across OpenMP threads and MPI tasks.
Example 4 is an auto-instrumented C++ code. Issues related to in-line constructors are illustrated.
Example 5 is a Fortran code which uses gptlprocess_namelist() and an associated namelist file to set GPTL options.
Example 6 is a Fortran code which utilizes the ENABLE_PMPI option to automatically time various MPI calls and print the average number of bytes transferred.
Example 7 is a Fortran code which utilizes the functions GPTLstart_handle() and GPTLstop_handle(), which avoid much of the table lookup overhead of their siblings GPTLstart() and GPTLstop().
Example 8 is a C code which employs GPTL's capability to report memory usage during a code being profiled. Memory usage is checked on calls to both manually and auto-instrumented calls to start and stop routines, so the name of the routine responsible for memory growth is included in the printout.