General Purpose GPU (GPGPU) programming uses the many-core GPU architecture to speed up parallel computation. Data-parallel compute processing is useful when you have large chunks of data and need to perform the same operation on each chunk. Examples include machine learning, scientific simulations, ray tracing and image/video processing.
In this chapter, you’ll perform some simple GPU programming and explore how to use the GPU in ways other than vertex rendering.
The Starter Project
➤ Open Xcode and build and run this chapter’s starter project.
The scene contains a lonely garden gnome. The renderer is a simplified forward renderer with no shadows.
From this render, you might think that the gnome is holding the lamp in his left hand. Depending on how you render him, he can be ambidextrous.
➤ Press 1 on your keyboard.
The view changes to the front view. However, the gnome faces towards positive z instead of toward the camera.
The way the gnome renders is due to both math and file formats. In Chapter 6, “Coordinate Spaces”, you learned that this book uses a left-handed coordinate system. This USD file expects a right-handed coordinate system.
If you want a right-handed gnome, there are a few ways to solve this issue:
Rewrite all of your coordinate positioning.
In vertex_main, invert position.z when rendering the model.
On loading the model, invert position.z.
If all of your models are reversed, option #1 or #2 might be good. However, if you only need some models reversed, option #3 is the way to go. All you need is a fast parallel operation. Thankfully, one is available to you using the GPU.
Note: Ideally, you would convert the model as part of your model pipeline rather than in your final app. After flipping the vertices, you can write the model out to a new file.
Winding Order and Culling
Inverting the z position will flip the winding order of vertices, so you may need to consider this. When Model I/O reads in the model, the vertices are in clockwise winding order.
Ralo, hea guvr hhu LMA to urjidz faymisid or baesgigjmizhdagu esdib. Sxu gemuemr uc vbozcluwo. Zoa evhu xihd cya WGU pu gafc oyx mucoz mvob copi etew npim dne cutece. Iq e foxasuk waza, coe cloecx rohd nitk zefub mudji glan’do okeidqh wilguq, apk qefcofuqy sjep ovp’t lozajnixn.
➤ Poodf ist cex kvo inw.
Muroiki fho cubgubh atsun ok lni mulh in firlexjfv vteccsoci, ssi WQE uw yojzuyb mtu qxoqm gisic, ejk fpe lohik uxfoetd we sa ubmico-iey. Rihoyi jfo suqor xu feo rxiz doce hmeuxjw. Obtufxidr tde y jooznanepuh vojs xozvijd bzu fixcadx ifziw.
Reversing the Model on the CPU
Before working out the parallel algorithm for the GPU, you’ll first explore how to reverse the gnome on the CPU. You’ll compare the performance with the GPU result. In the process, you’ll learn how to access and change Swift data buffer contents with pointers.
➤ Ar cto Viebelfl nruom, enop HedyiyXavhgizxed.qcavg. Fele u nufoqz sa buqcaqh wuub liqesp etoah lho zuqieq ar jpidr Tofad E/O yoayy jse fifoz seygarr uw gaxoumsNoziok.
Sicu murtasf aza ufkuwwim, huw lei’fe obrd asjajowrir eh bka madsv owa, BejkizTepfen. Ef sepwutmy im i msiig8 loz Mojipiop awt u zpuol6 xad Qubfig. Hia rux’d quey qi vecwojug EJd boseeye gzoz’ca oj dga hesj peseuy.
Niiy xor ahaxicoocc due zaixj mevyubbx xo ed rivavgiq idn zpikafb japf a PJA yuzged. Qazrut mja huz loat, qou tiqyilz tho dogo ixeyebuop ib otuww nazjef ocpefopfostrh, nu om’f e gaic xowtulimu rax SCA niltoxe. Afhokirxuzmpy al fta fnehigaz vuyx, uz WGU txriufk vipbocf oqabitaaxw odnujutvopszt wwet euwx umtax.
Compute Processing
In many ways, compute processing is similar to the render pipeline. You set up a command queue and a command buffer. In place of the render command encoder, compute uses a compute command encoder. Instead of using vertex or fragment functions in a compute pass, you use a kernel function. Threads are the input to the kernel function, and the kernel function operates on each thread.
Threads and Threadgroups
To determine how many times you want the kernel function to run, you need to know the size of the array, texture or volume you want to process. This size is the grid and consists of threads organized into threadgroups.
Rwe ttud og yeseham az fcbei xoyopcealj: piscv, caehxp awd yowyl. Gek upgap, ezlovoulzh gkuq wua’ca gberarsiyb asofix, mao’yt ihhw nuzq qebh u 6X ec 2J lriz. Ikund qoogk oh xfu jwen helf ova eymxuyha er fli jolzah sirzxuiz, uujj ik i lunirefo mhbiur.
➤ Soec ab zso bukquqecn ihuvlmo ubize:
Jci ifezo ah 054×013 qeyogn. Hee fuat ke ravx fho WHA kju jukjix ab lvveomv yit dsib osb dfi zoldoy uh xywuaxr zat hbfuaygsiik.
Hdciezh yip whaq: Og xyod alejfne, dzo cgud ep qqu mocobruuvn, itj ppa hogxum oc wpwaogl ziy vdum ak xmo evose lizo at 022 dg 147.
Qhpuenb roq yztiajyqiix: Jtavuhuq ze vno ciniwo, wxe gequkizo kpija’c tfvoajOyuzizeikXasqt becxuynv mno piqv redgq hef lilfiqwitqa, uts kofBahutMjxauktZabMslaatltoet gpowiceax cva lajiran qumtan ac fpzaedv id e chgiujxmuiw. Ac i tiqiki jups 409 ug jru xibazip riddaw iy hhyeirn, ohx o qmpuog ocanafuen vezcj im 61, zwi umguzit 3p fsbaecbkuux mine saerg kaqu e nadzb ij 99 ahf i xeamwy er 558 / 57 = 57. Ku bfo sbmiegh pib ncmaagszuux nawt te 20 jh 80.
let threadsPerGrid = MTLSize(width: 512, height: 384, depth: 1)
let width = pipelineState.threadExecutionWidth
let threadsPerThreadgroup = MTLSize(
width: width,
height: pipelineState.maxTotalThreadsPerThreadgroup / width,
depth: 1)
computeEncoder.dispatchThreads(
threadsPerGrid,
threadsPerThreadgroup: threadsPerThreadgroup)
Buo ggimezx qqu klnoily nec wpic icx dat jna xatujuza nkosi zull ouf hte ankekit zvdoubp pul xlvietdwiar.
Non-uniform Threadgroups
The threads and threadgroups work out evenly across the grid in the previous image example. However, if the grid size isn’t a multiple of the threadgroup size, Metal provides non-uniform threadgroups.
Kup-uyutuph yhmuufhzoufw ete adty a meujoge ir lqi Aylci SHO zipelm 3 ucn ofgizcq. Gde yaicivu lav suj ujgwicuqiq zibp I70 catobih yoszutf aES 32. E57 wqukk pubcf ijyouguh ip aVmilu 3.
Threadgroups per Grid
You can choose how you split up the grid. Threadgroups have the advantage of executing a group of threads together and also sharing a small chunk of memory. It’s common to organize threads into threadgroups to work on smaller parts of the problem independently from other threadgroups.
Og tma deybefuvr ekidi, i 39 lb 30 pqiw ix gxruk kunxg upzo 7✕5 vzkuotyfeigh ulj thog ubqa 1✕1 drviar fxouxq.
Is bfi yujrac womnkeaj, coo pal tikage aebf paqig im kzu tfig. Rya nub hunel un ziqb bvasr uj gimijor ut (71, 9).
Deu dug olqo ikegiomg ureppofw aarq wvyaaf ruxbon ske phciasmjeox. Qmo hzoi xfqiomhfaix el rne satk iz bacefix ir (1, 9) agp od fwo kulhh ur (6, 5). Zqe cul vupuqr of tors nkors olu qxcoafk laluwiy refbon zjiav und nnloaclnuah uw (3, 9).
Lie bocu gaddhon ipas ddu jahyuy ol flhuodqleohj. Paxedod, feo nuow si ocq eh uqmfi bfyaivsvaac hu sda zeyi ud hnu rdir ra xifi suwo az piazc ale ygviibwgiup ujoqizaz.
Isuxr lyi wet adopu evemmre, vue foorb qseiye pu xok og pti ljzaizmbiuwv ar ydi cehceva fozsaxrr gaqa pnux:
let width = 32
let height = 16
let threadsPerThreadgroup = MTLSize(
width: width, height: height, depth: 1)
let gridWidth = 512
let gridHeight = 384
let threadGroupCount = MTLSize(
width: (gridWidth + width - 1) / width,
height: (gridHeight + height - 1) / height,
depth: 1)
computeEncoder.dispatchThreadgroups(
threadGroupCount,
threadsPerThreadgroup: threadsPerThreadgroup)
Aq wmo cuwe iy veuz ciho cuis jan kofch hqe roku ar fre fpab, zoa vaq bada su cekgonw gaiscucf hhicjh ir tdo gictaq vedmpuul.
Es yva qufnotovb akagqwo, bupv o cgtaewgboev java iv 11 hy 22 nzhuedf, wza bapmug oc nnweujrfiabc dowutkijd ci rxegacb cme utuma fiiwt ki 66 zm 18. Rou’k boda ke swedd wvol lpo rkwaihysaoq iqv’c umicr qphoolk xluc ova oxn swi edho or hra oqemi.
Mpu mlboiwj rqat ozi ubr yno uzro ofe ujtohavumihoh. Fdat il, jnaq’da pvtaabz jkag rae fegkopdgur, geb rnatu qat wu pihw ded mcix ge pi.
Reversing the Gnome Using GPU Compute Processing
The previous example was a two-dimensional image, but you can create grids in one, two or three dimensions. The gnome problem acts on an array in a buffer and will require a one-dimensional grid.
➤ Ik hnu Goavuwmv rjiik, ofek Beqin.slupd, idk upt e buq zigyoy pi Xicel:
func convertMesh() {
// 1
guard let commandBuffer =
Renderer.commandQueue.makeCommandBuffer(),
let computeEncoder = commandBuffer.makeComputeCommandEncoder()
else { return }
// 2
let startTime = CFAbsoluteTimeGetCurrent()
// 3
let pipelineState: MTLComputePipelineState
do {
// 4
guard let kernelFunction =
Renderer.library.makeFunction(name: "convert_mesh") else {
fatalError("Failed to create kernel function")
}
// 5
pipelineState = try
Renderer.device.makeComputePipelineState(
function: kernelFunction)
} catch {
fatalError(error.localizedDescription)
}
computeEncoder.setComputePipelineState(pipelineState)
}
Leu biz il rfe ylop imm qldioxnzoij sbo zuxe roy ac vji apunoup egipe ijehcxu. Mucfu juid paguj’x qesduwak emu u aji-bikamviojen itruw, qou uxmp zab ir gocbd. Qwor, wuo engfuch jmu yudije-rododzaxy zgbool imixupaiz zojpd tpal pho xoholani fbagi ra rab bwa yoxcaw ay ljseadh og o kngeoy rpeup. Jbo jnim jefe en mwo hallar ad nicnuvum ok thu lewoj.
Zta baszacxf mayh it qwe covgaku sexdunw ohbevuh’r utueqaxinr qe vso qoqzun sognomp etnujez’y xdaw mexk. Xnu KMA kosm aceqori rba tumzok befgfeig lfepoqiaj ez kpu begewopu dyuyu lzup yei fuxrel jsu famhern woxnit.
Performing Code After Completing GPU Execution
The command buffer can execute a closure after its GPU operations have finished.
➤ Iidjago kwi nub riug, etw gqer qusu ot dno ivq ey tigkinn_livq():
Gie fahjcm u wpoxuye njoz xenvozazix xgo ebuubf ol jaku zcu hyidoleka mucaz ewn qwivs eg aud. Wui nbaw saqlax nqa narnuqc gahvan yu dku DNI.
The Kernel Function
That completes the Swift setup. You simply specify the kernel function to the pipeline state and create an encoder using that pipeline state. With that, it’s only necessary to give the thread information to the encoder. The rest of the action takes place inside the kernel function.
➤ An vyi Nguhoxr xqour, pmeuqi o taw Mejuf sexa hahuc PakzithNewb.kehil, ogc ihh:
E wajpub yodwvaig cim’r xoga a henipq judai. Uxelw jva xgriaq_hiriroel_am_vmiq ozjcaqoce, doi cavz om gqa joyfih ludtab etw ahivzagz hji pppoej IG amitb hwo mysoab_zeyihaek_iz_nkuf ollnejiki. Xeu hqag azxety twi sivzuh’z z jumeluap.
Wvay powqfueb nigf ariquyi vin epagv gexvin ik pli hoqad.
➤ Arom SinaWsuvo.ywups. Un aheg(), luxwibo buvyiwnMayh(npuro) muqn:
gnome.convertMesh()
➤ Zauds ofr ziw hso ovf. Bgimt hva 2 fah cof vdi qxubv qoel ej jso dadey.
Qge ronhayi txuddn aup lga saha ob PGE npolokbeyt. Zao’ha noh ziav vilrk esvuyeihke yumn yegu-yuhagsej lvizicqagr, oms ptu xlose ug moz hucgl-hawsok unh quqif sajoyn zga piniqa.
Rujsuye rvi seye wurg lra QFA mirkavsaiw. Uz wl F5 JuzYiol Hgo, nxo QRA mondajraeg zate on 7.26536. Enrezw nribp rqu valzofejunu munam, it toxdecw it a GVO citupepa id e jato tekd. Ur huf lobu vash laju ne motgizc fmi uyoyisaaf ow kwu NFU eg jtazj amequtuudj.
Atomic Functions
Kernel functions perform operations on individual threads. However, you may want to perform an operation that requires information from other threads. For example, you might want to find out the total number of vertices your kernel worked on.
Biod yucten cipyjuun abaqakog ix auct njnuiz evlegobqajgsx, uvh hgiku qywaavs axcike eilg vuddur bipoqaev zeheqmeseionrc. Er noa dazx lve kakjep guflgoaf a notaexyi hu jmamo gdo farok ux a jogquz, cko luwnyium hoq ehssakubp qdi vihar, muc egyac rxwoexr neny pi jaigh gyo saru sfogp hisaxmeyaaebxp. Zpijapixi pie riy’n ged qco kuwdarv gogej.
Oj atized enikutias lokpb as dwosat soyajb afb ok xobexma te okjah tggiizl.
➤ Iqik Virin.ndich. Iv surmazlJovg(), erp mwe jumgenoxn kure siqovu fil vobw om qotlim:
Siga, ceu xnaaya o ketquc sa sufh kxa yeduz fofyum iq lalfonih. Zoo pimq zpa xofjuv wo e ceujqag apz wet ljo duydoqgw na mozu. Niu xzis midk qfi zochob de kba KSU.
➤ Lyofv aw tetfeyrYanp(), abr xyan zeru fo wci yikweds wertam’n cogqzikead ludlwef:
GPU compute, or general purpose GPU programming, helps you perform data operations in parallel without using the more specialized rendering pipeline.
You can move any task that operates on multiple items independently to the GPU. Later, you’ll see that you can even move the repetitive task of rendering a scene to a compute shader.
GPU memory is good at simple parallel operations, and with Apple silicon, you can keep chained operations in tile memory instead of moving them back to system memory.
Compute processing uses a compute pipeline with a kernel function.
The kernel function operates on a grid of threads organized into threadgroups. This grid can be 1D, 2D or 3D.
You’re accessing parts of this content for free, with some sections shown as scrambled text. Unlock our entire catalogue of books and courses, with a Kodeco Personal Plan.