Tools & SDKs
- Heterogeneous Computing
- Aparapi
- Accelerated Parallel Processing (APP) SDK
- Accelerated Parallel Processing Math Libraries (APPML)
- CodeXL
- Archived Tools
- CPU Development
- Graphics Development
- Open Source
Home > Tools & SDKs > Heterogeneous Computing > Archived Tools > APP Profiler > User Guide > APP Profiler Settings
This page allows you to configure some general profiler settings.
Setting | Description |
Always delete session files | The profiler will automatically delete session files when a solution is closed |
Never delete session files | The profiler will not delete session files when a solution is closed |
Ask user every time | The profiler will display a prompt when a solution is closed, asking the user if session files should be deleted |
If the option is set to Always delete session files or Ask user every time, you can also enable Show details of deletion. When enabled, the profiler will display a dialog, showing all the files and directories which were deleted as well an any errors that occurred when trying to delete files or directories.
This page allows you to select the counters to capture for the next profile session.
Below is a list of available counters and a brief description of them. The exact counters shown depends on the type of GPU installed on the system.
The supported counters on AMD Radeon™ HD 6000 series graphics cards or older:
Name | Description |
Wavefronts | The total number of wavefronts |
ALUInsts | The average number of ALU instructions executed per work-item (affected by flow control). |
FetchInsts | The average number of Fetch instructions from the video memory executed per work-item (affected by flow control). |
WriteInsts | The average number of Write instructions to the video memory executed per work-item (affected by flow control). |
ALUBusy | The percentage of GPUTime ALU instructions are processed. Value range: 0% (bad) to 100% (optimal). |
ALUFetchRatio | The ratio of ALU to Fetch instructions. If the number of Fetch instructions is zero, then one will be used instead. |
ALUPacking | The ALU vector packing efficiency (in percentage). This value indicates how well the Shader Compiler packs the scalar or vector ALU in your kernel to the 5-way VLIW instructions. Value range: 0% (bad) to 100% (optimal). Values below 70 percent indicate that ALU dependency chains may be preventing full utilization of the processor. |
FetchSize | The total kilobytes fetched from the video memory. This is measured with all extra fetches and any cache or memory effects taken into account. |
CacheHit | The percentage of fetches that hit the data cache. Value range: 0% (no hit) to 100% (optimal). |
FetchUnitBusy | The percentage of GPUTime the Fetch unit is active. The result includes the stall time (FetchUnitStalled). This is measured with all extra fetches and any cache or memory effects taken into account. Value range: 0% to 100% (fetch-bound). |
FetchUnitStalled | The percentage of GPUTime the Fetch unit is stalled. Try reducing the number of fetches or reducing the amount per fetch if possible. Value range: 0% (optimal) to 100% (bad). |
WriteUnitStalled | The percentage of GPUTime Write unit is stalled. Value range: 0% to 100% (bad). |
Additional performance counters for AMD Radeon™ HD 5000 or for AMD Radeon™ HD 6000 series graphics cards:
Name | Description |
FastPath | The total kilobytes written to the video memory through the FastPath which only supports basic operations: no atomics or sub-32 bit types. This is an optimized path in the hardware. |
CompletePath | The total kilobytes written to the video memory through the CompletePath which supports atomics and sub-32 bit types (byte, short). This number includes bytes for load, store and atomics operations on the buffer. This number may indicate a big performance impact (higher number equals lower performance). If possible, remove the usage of this Path by moving atomics to the local memory or partition the kernel. |
PathUtilization | The percentage of bytes written through the FastPath or CompletePath compared to the total number of bytes transferred over the bus. To increase the path utilization, use the FastPath. Value range: 0% (bad) to 100% (optimal). |
LDSFetchInsts | The average number of Fetch instructions from the LDS executed per work-item (affected by flow control). This counter is a subset of the ALUInsts counter. |
LDSWriteInsts | The average number of Write instructions to the LDS executed per work-item (affected by flow control). This counter is a subset of the ALUInsts counter. |
LDSBankConflict | The percentage of GPUTime LDS is stalled by bank conflicts. Value range: 0% (optimal) to 100% (bad). |
The full set of counters for AMD Radeon™ HD 7000 series GPU devices (based on Graphics Core Next Architecture/Southern Islands) or newer:
Name | Description |
Wavefronts | Total wavefronts. |
VALUInsts | The average number of vector ALU instructions executed per work-item (affected by flow control). |
SALUInsts | The average number of scalar ALU instructions executed per work-item (affected by flow control). |
VFetchInsts | The average number of vector fetch instructions from the video memory executed per work-item (affected by flow control). |
SFetchInsts | The average number of scalar fetch instructions from the video memory executed per work-item (affected by flow control). |
VWriteInsts | The average number of vector write instructions to the video memory executed per work-item (affected by flow control). |
LDSInsts | The average number of instructions to/from the LDS executed per work-item (affected by flow control). |
VALUUtilization | The percentage of active vector ALU threads in a wave. A lower number can mean either more thread divergence in a wave or that the work-group size is not a multiple of 64. Value range: 0% (bad), 100% (ideal - no thread divergence). |
VALUBusy | The percentage of GPUTime vector ALU instructions are processed. Value range: 0% (bad) to 100% (optimal). |
SALUBusy | The percentage of GPUTime scalar ALU instructions are processed. Value range: 0% (bad) to 100% (optimal). |
FetchSize | The total kilobytes fetched from the video memory. This is measured with all extra fetches and any cache or memory effects taken into account. |
CacheHit | The percentage of fetch, write, atomic, and other instructions that hit the data cache. Value range: 0% (no hit) to 100% (optimal). |
MemUnitBusy | The percentage of GPUTime the memory unit is active. The result includes the stall time (MemUnitStalled). This is measured with all extra fetches and writes and any cache or memory effects taken into account. Value range: 0% to 100% (fetch-bound). |
MemUnitStalled | The percentage of GPUTime the memory unit is stalled. Try reduce the number or size of fetches and writes if possible. Value range: 0% (optimal) to 100% (bad). |
WriteUnitStalled | The percentage of GPUTime the Write unit is stalled. Value range: 0% to 100% (bad). |
LDSBankConflict | The percentage of GPUTime LDS is stalled by bank conflicts. Value range: 0% (optimal) to 100% (bad). |
WriteSize | The total kilobytes written to the video memory. This is measured with all extra fetches and any cache or memory effects taken into account. |
GDSInsts | The average number of instructions to/from the GDS executed per work-item (affected by flow control). This counter is a subset of the VALUInsts counter. |
You can also hover over the counter names to get the descriptions.
To load and save the counter selections to a file, click on the Load Selection and Save Selection buttons.
This page contains two subpages that allow you to configure the behavior of the profiler when it performs an application trace.
Rule | Description |
Detect resource leaks | Tracks the reference count for all OpenCL™ objects, and reports any objects which are never released. |
Detect deprecated API calls | Detects calls to OpenCL™ API functions that have been deprecated in recent versions of OpenCL™ |
Detect unnecessary blocking writes | Detects unnecessary blocking write operations. |
Detect non-optimized work size | Detect clEnqueueNDRangeKernel calls which specify a global or local workgroup size which is non-optimal for AMD Hardware. |
Detect non-optimized data transfer | 1. Detect Non-Fusion APU access to Device-Visible Host Memory directly.2. Detect Host-Visible Device Memory read back to CPU directly. |
Detect redundant synchronization | Detect redundant synchronization which results in low host and device utilization |
Detect failed API calls | Detect OpenCL™ API calls that do not return CL_SUCCESS.Some of the return codes may not be detected unless Always show API error codes option is checked. |
This page allows you to configure whether the APP Profiler will automatically check for updates, as well as how often it will check for updates.
It also allows you to check for updates manually.
Frequency | Description |
Every startup | The APP Profiler will check for an update each time Visual Studio is started. |
Every day | The APP Profiler will check for an update once each day when Visual Studio is started. |
Every 7 days | The APP Profiler will check for an update once every 7 days when Visual Studio is started. |
Every 30 days | The APP Profiler will check for an update once every 30 days when Visual Studio is started. |
Your email address:
©2013 Advanced Micro Devices, Inc. OpenCL and the OpenCL logo are trademarks of Apple, Inc., used with permission by Khronos.