Chapters

Hide chapters

Metal by Tutorials

Third Edition · macOS 12 · iOS 15 · Swift 5.5 · Xcode 13

Section I: Beginning Metal

Section 1: 10 chapters
Show chapters Hide chapters

Section II: Intermediate Metal

Section 2: 8 chapters
Show chapters Hide chapters

Section III: Advanced Metal

Section 3: 8 chapters
Show chapters Hide chapters

32. Best Practices
Written by Marius Horga & Caroline Begbie

Heads up... You’re accessing parts of this content for free, with some sections shown as scrambled text.

Heads up... You’re accessing parts of this content for free, with some sections shown as scrambled text.

Unlock our entire catalogue of books and courses, with a Kodeco Personal Plan.

Unlock now

When you want to squeeze the very last ounce of performance from your app, you should always remember to follow a golden set of best practices, which are categorized into three major parts: general performance, memory bandwidth and memory footprint. This chapter will guide you through all three.

General Performance Best Practices

The next five best practices are general and apply to the entire pipeline.

Choose the Right Resolution

The game or app UI should be at native or close to native resolution so that the UI will always look crisp no matter the display size. Also, it is recommended (albeit not mandatory) that all resources have the same resolution. You can check the resolutions in the GPU Debugger on the dependency graph. Below is a partial view of the dependency graph from the multi-pass render in Chapter 14, “Deferred Rendering”:

The dependency graph
Nfi fidowkuyzt hlinb

Minimize Non-Opaque Overdraw

Ideally, you’ll want to only draw each pixel once, which means you’ll want only one fragment shader process per pixel.

Submit GPU Work Early

You can reduce latency and improve the responsiveness of your renderer by making sure all of the off-screen GPU work is done early and is not waiting for the on-screen part to start. You can do that by using two or more command buffers per frame:

create off-screen command buffer
encode work for the GPU
commit off-screen command buffer
...
get the drawable
create on-screen command buffer
encode work for the GPU
present the drawable
commit on-screen command buffer

Stream Resources Efficiently

All resources should be allocated at launch time — if they’re available — because that will take time and prevent render stalls later. If you need to allocate resources at runtime because the renderer streams them, you should make sure you do that from a dedicated thread.

Design for Sustained Performance

You should test your renderer under a serious thermal state. This can improve the overall thermals of the device, as well as the stability and responsiveness of your renderer.

Memory Bandwidth Best Practices

Since memory transfers for render targets and textures are costly, the next six best practices are targeted to memory bandwidth and how to use shared and tiled memory more efficiently.

Compress Texture Assets

Compressing textures is very important because sampling large textures may be inefficient. For that reason, you should generate mipmaps for textures that can be minified. You should also compress large textures to accommodate the memory bandwidth needs. There are various compression formats available. For example, for older devices, you could use PVRTC, and for newer devices, you could use ASTC. Review Chapter 8, “Textures”, for how to create mipmaps and change texture formats in the asset catalog.

Optimize for Faster GPU Access

You should configure your textures correctly to use the appropriate storage mode depending on the use case. Use the private storage mode so only the GPU has access to the texture data, allowing optimization of the contents:

textureDescriptor.storageMode = .private
textureDescriptor.usage = [ .shaderRead, .renderTarget ]
let texture = device.makeTexture(descriptor: textureDescriptor)
textureDescriptor.storageMode = .shared
textureDescriptor.usage = .shaderRead
let texture = device.makeTexture(descriptor: textureDescriptor)
// update texture data
texture.replace(
  region: region,
  mipmapLevel: 0,
  withBytes: bytes,
  bytesPerRow: bytesPerRow)
let blitCommandEncoder = commandBuffer.makeBlitCommandEncoder()
blitCommandEncoder
  .optimizeContentsForGPUAccess(texture: texture)
blitCommandEncoder.endEncoding()

Choose the Right Pixel Format

Choosing the correct pixel format is crucial. Not only will larger pixel formats use more bandwidth, but the sampling rate also depends on the pixel format. You should try to avoid using pixel formats with unnecessary channels and also try to lower precision whenever possible. You’ve generally been using the RGBA8Unorm pixel format in this book. However, when you needed greater accuracy for the G-Buffer in Chapter 14, “Deferred Rendering”, you used a 16-bit pixel format. Again, you can use the Metal Memory Viewer to see the pixel formats for textures.

Optimize Load and Store Actions

Load and store actions for render targets can also affect bandwidth. If you have a suboptimal configuration of your pipelines caused by unnecessary load/store actions, you might create false dependencies. An example of optimized configuration would be as follows:

renderPassDescriptor.colorAttachments[0].loadAction = .clear
renderPassDescriptor.colorAttachments[0].storeAction = .dontCare
Redundant store action
Cenechelx pcibi oggeeb

Optimize Multi-Sampled Textures

iOS devices have very fast multi-sampled render targets (MSAA) because they resolve from Tile Memory so it is best practice to consider MSAA over native resolution. Also, make sure not to load or store the MSAA texture and set its storage mode to memoryless:

textureDescriptor.textureType = .type2DMultisample
textureDescriptor.sampleCount = 4
textureDescriptor.storageMode = .memoryless
let msaaTexture = device.makeTexture(descriptor: textureDescriptor)
renderPassDesc.colorAttachments[0].texture = msaaTexture
renderPassDesc.colorAttachments[0].loadAction = .clear
renderPassDesc.colorAttachments[0].storeAction = .multisampleResolve

Leverage Tile Memory

Metal provides access to Tile Memory for several features such as programmable blending, image blocks and tile shaders. Deferred shading requires storing the G-Buffer in a first pass and then sampling from its textures in the second lighting pass where the final color accumulates into a render target. This is very bandwidth-heavy.

Memory Footprint Best Practices

Use Memoryless Render Targets

As mentioned previously, you should be using memoryless storage mode for all transient render targets that do not need a memory allocation, that is, are not loaded from or stored to memory:

textureDescriptor.storageMode = .memoryless
textureDescriptor.usage = [.shaderRead, .renderTarget]
// for each G-Buffer texture
textureDescriptor.pixelFormat = gBufferPixelFormats[i]
gBufferTextures[i] =
  device.makeTexture(descriptor: textureDescriptor)
renderPassDescriptor.colorAttachments[i].texture =
  gBufferTextures[i]
renderPassDescriptor.colorAttachments[i].loadAction = .clear
renderPassDescriptor.colorAttachments[i].storeAction = .dontCare

Avoid Loading Unused Assets

Loading all the assets into memory will increase the memory footprint, so you should consider the memory and performance trade-off and only load all the assets that you know will be used. The GPU frame capture Memory Viewer will show you any unused resources.

Use Smaller Assets

You should only make the assets as large as necessary and consider the image quality and memory trade-off of your asset sizes. Make sure that both textures and meshes are compressed. You may want to only load the smaller mipmap levels of your textures or use lower level of detail meshes for distant objects.

Simplify memory-intensive effects

Some effects may require large off-screen buffers, such as Shadow Maps and Screen Space Ambient Occlusion, so you should consider the image quality and memory trade-off of all of those effects, potentially lower the resolution of all these large off-screen buffers and even disable the memory-intensive effects altogether when you are memory constrained.

Use Metal Resource Heaps

Rendering a frame may require a lot of intermediate memory, especially if your game becomes more complex in the post-process pipeline, so it is very important to use Metal Resource Heaps for those effects and alias as much of that memory as possible. For example, you may want to reutilize the memory for resources that have no dependencies, such as those for Depth of Field or Screen-Space Ambient Occlusion.

Mark Resources as Volatile

Temporary resources may become a large part of the memory footprint and Metal will allow you to set the purgeable state of all the resources explicitly. You will want to focus on your caches that hold mostly idle memory and carefully manage their purgeable state, like in this example:

// for each texture in the cache
texturePool[i].setPurgeableState(.volatile)
// later on...
if (texturePool[i].setPurgeableState(.nonVolatile) == .empty) {
  // regenerate texture
}

Manage the Metal PSOs

Pipeline State Objects (PSOs) encapsulate most of the Metal render state. You create them using a descriptor that contains vertex and fragment functions as well as other state descriptors. All of these will get compiled into the final Metal PSO.

Where to Go From Here?

Getting the last ounce of performance out of your app is paramount. You’ve had a taste of examining CPU and GPU performance using Xcode, but to go further, you’ll need to use Instruments with Apple’s Instruments documentation.

Have a technical question? Want to report a bug? You can ask questions and report bugs to the book authors in our official book forum here.
© 2025 Kodeco Inc.

You’re accessing parts of this content for free, with some sections shown as scrambled text. Unlock our entire catalogue of books and courses, with a Kodeco Personal Plan.

Unlock now