There are many ways to optimize Metal graphics code for maximum performance. Here’s how to start getting your code in better shape for the Metal platform.
Apple GPU architecture
Apple GPUs are tile-based deferred renderers, which means they use two main stages: tiling and rendering. The general rendering pipeline is shown below.
You can think of these two phases as one where the geometry is calculated and created, and another where all the pixels are rendered.
Advertisement
In most modern Apple GPU software, geometry is calculated and divided into meshes and polygons, and then converted into a pixel image, one image per frame.
Modern Apple GPUs have specific subsections in each core that handle shaders, textures, pixel processing, and dedicated tile memory. Each core uses these four regions during rendering.
Each frame is rendered using multiple passes, running on multiple GPU cores, with each core handling multiple tasks. In general, the more cores, the better the performance.
GPU Counters
GPU counters are used to measure this performance.
GPU counters monitor the load on each GPU and determine whether each one is doing enough work or not enough. They also find performance bottlenecks.
Finally, GPU counters optimize the commands that take the longest to improve performance.
There are over one hundred and fifty types of Apple GPU performance counters, and they are all beyond the scope of this article.
There is a problem understanding all the performance counter data. To do this, you use the Metal System Trace and Metal Debugger built into Xcode and Instruments.
There are four Metal GPU counters that include important ways to optimize Metal in your applications and games. They are:
Performance Limiters. Memory bandwidth occupied. Removing hidden surfaces.
Performance limiters, or limiter counters, measure the activity of multiple GPU subsystems, detecting work being done and detecting latencies that may be blocking or slowing parallel execution.
Modern GPUs perform math, memory, and rasterization in parallel (at the same time). Performance limiters help you identify performance bottlenecks that are slowing down your code.
You can use the Apple Instruments app to use performance limiters to optimize your code. Instruments has half a dozen different performance limiters.
Apple Instruments app.
Memory Bandwidth Counters
GPU memory bandwidth counters measure data transfer between the GPU and system memory. The GPU accesses system memory whenever buffers or textures are accessed.
But keep in mind that system-level caches can also be activated, meaning that you may sometimes notice small bursts of higher memory bandwidth than the actual DRAM transfer speed. This is fine.
If you see the memory bandwidth counter with a high value, it probably means that transfers are slowing down rendering. To eliminate these bottlenecks, you can do several things.
One way to reduce memory throughput slowdown is to reduce the size of working datasets. This speeds up the process because less data is transferred from system memory.
Another way is to load only the data needed for the current render pass, and save only the data needed for future render passes. This also reduces the overall data size.
You can also use block-based texture compression (ASTC) to reduce the size of texture resources and lossless compression for textures generated at runtime.
Occupancy measures how many threads are currently running from the total thread pool. 100% utilization means that a given GPU is currently maxed out in terms of the number of threads and overall work it can handle.
Advertisement
The GPU busy counter measures the percentage of total thread capacity used by the GPU. This sum is the sum of computation, vertex occupancy, and fragment occupancy.
Hidden surface removal typically occurs somewhere in the middle of each render pass before the fragment is processed—shortly after the tiled vertex buffer is sent to the GPU for rasterization.
Depth buffers and hidden surface removal are used to eliminate any surfaces that are not visible to the view camera in the current scene. This speeds up productivity because these surfaces don’t need to be painted.
For example, surfaces on the back of opaque 3D objects don’t need to be drawn because the camera (and viewer) never sees them, so there’s no point in drawing them.
Surfaces that are hidden by other 3D objects in front of them relative to the camera are also removed.
GPU counters can be used during hidden surface removal to determine the total number of pixels rasterized, the number of fragment shaders (actually the number of fragment shader calls), and the number of pixels saved.
GPU counters can also be used to minimize blending, which also results in lower performance.
To optimize painting while removing hidden surfaces, you need to draw objects in order of visibility state, namely checking if objects are opaque, checking for translucency, and trying to avoid alternating opaque and opaque meshes.
Resources
To get started with Metal optimization, be sure to watch the WWDC videos “Optimizing Metal Apps and Games with GPU Counters” from WWDC20, “Using GPUs with Metal” also from WWDC20, and “Delivering Optimized Apps and Games for Metal” from WWDC19.
Then read “Capturing a Metal Workload in Xcode” and “Metal Debugging Types” on the Metal Debugger pages on the Apple Developer Documentation website.
The Metal Debugger documentation also has a section on “Analyzing a Metal Workload.”
You’ll definitely want to spend a lot of time in the Xcode Metal Debugger and Trace documentation to learn in detail how the various GPU counters and performance graphs work. Without them, you won’t be able to get a detailed view of what’s actually happening in your Metal code.
Regarding compressed textures, it’s also worth reading about Adaptive Scalable Texture Compression (ASTC) and how it works in modern rendering pipelines.
Optimizing the performance of metals is a broad and complex topic. We have just started and will continue to study this topic in future articles.