Vision Tutorial for iOS: Detect Body and Hand Pose

Learn how to detect the number of fingers shown to the camera with help from the Vision framework. By Saeed Taheri.

Leave a rating/review
Download materials
Save for later
Share

Machine learning is everywhere, so it came as no surprise when Apple announced its Core ML frameworks in 2017. Core ML comes with many tools including Vision, an image analysis framework. Vision analyzes still images to detect faces, read barcodes, track objects and more. Over the years, Apple added many cool features to this framework, including the Hand and Body Detection APIs, introduced in 2020. In this tutorial, you’ll use these Hand and Body Detection APIs from the Vision framework to bring a touch of magic to a game called StarCount. You’ll count the number of stars falling from the sky using your hands and fingers.

StarCount needs a device with a front-facing camera to function, so you can’t follow along with a simulator.

Finally, it would help if you could prop up your device somewhere, you’ll need both hands to match those high numbers!

Note: This Vision tutorial assumes a working knowledge of SwiftUI, UIKit and Combine. For more information about SwiftUI, see SwiftUI: Getting Started.

StarCount needs a device with a front-facing camera to function, so you can’t follow along with a simulator.

Finally, it would help if you could prop up your device somewhere, you’ll need both hands to match those high numbers!

Getting Started

Download the starter project using the Download Materials button at the top or bottom of this page. Then, open the starter project in Xcode.

Build and run. Tap Rain in the top left corner and enjoy the scene. Don’t forget to wish on those stars!

Vision Tutorial Starting page with rain button

The magic of raining stars is in StarAnimatorView.swift. It uses UIKit Dynamics APIs. Feel free to take a look if you’re interested.

The app looks nice, but imagine how much better it would look if was showing live videos of you in the background! Vision can’t count your fingers if the phone can’t see them.

Getting Ready for Detection

Vision uses still images for detection. Believe it or not, what you see in the camera viewfinder is essentially a stream of still images. Before you can detect anything, you need to integrate a camera session into the game.

Creating the Camera Session

To show a camera preview in an app, you use AVCaptureVideoPreviewLayer, a subclass of CALayer. You use this preview layer in conjunction with a capture session.

Since CALayer is part of UIKit, you need to create a wrapper to use it in SwiftUI. Fortunately, Apple provides an easy way to do this using UIViewRepresentable and UIViewControllerRepresentable.

As a matter of fact, StarAnimator is a UIViewRepresentable so you can use StarAnimatorView, a subclass of UIView, in SwiftUI.

Note: You can learn more about integrating UIKit with SwiftUI in this great video course: Integrating UIKit & SwiftUI.

You’ll create three files in the following section: CameraPreview.swift, CameraViewController.swift and CameraView.swift. Start with CameraPreview.swift.

Create a new file named CameraPreview.swift in the StarCount group and add:

// 1
import UIKit
import AVFoundation

final class CameraPreview: UIView {
  // 2
  override class var layerClass: AnyClass {
    AVCaptureVideoPreviewLayer.self
  }
  
  // 3
  var previewLayer: AVCaptureVideoPreviewLayer {
    layer as! AVCaptureVideoPreviewLayer 
  }
}

Here, you:

  1. Import UIKit since CameraPreview is a subclass of UIView. You also import AVFoundation since AVCaptureVideoPreviewLayer is part of this module.
  2. Next, you override the static layerClass. This makes the root layer of this view of type AVCaptureVideoPreviewLayer.
  3. Then you create a computed property called previewLayer and force cast the root layer of this view to the type you defined in step two. Now you can use this property to access the layer directly when you need to work with it later.

Next, you’ll create a view controller to manage your CameraPreview.

The camera capture code from AVFoundation is designed to work with UIKit, so to get it working nicely in your SwiftUI app you need to make a view controller and wrap it in UIViewControllerRepresentable.

Create CameraViewController.swift in the StarCount group and add:

import UIKit

final class CameraViewController: UIViewController {
  // 1
  override func loadView() {
    view = CameraPreview()
  }
  
  // 2
  private var cameraView: CameraPreview { view as! CameraPreview }
}

Here you:

  1. Override loadView to make the view controller use CameraPreview as its root view.
  2. Create a computed property called cameraPreview to access the root view as CameraPreview. You can safely force cast here because you recently assigned an instance of CameraPreview to view in step one.

Now, you’ll make a SwiftUI view to wrap your new view controller, so you can use it in StarCount.

Create CameraView.swift in the StarCount group and add:

import SwiftUI

// 1
struct CameraView: UIViewControllerRepresentable {
  // 2
  func makeUIViewController(context: Context) -> CameraViewController {
    let cvc = CameraViewController()
    return cvc
  }

  // 3
  func updateUIViewController(
    _ uiViewController: CameraViewController, 
    context: Context
  ) {
  }
}

This is what’s happening in the code above:

  1. You create a struct called CameraView which conforms to UIViewControllerRepresentable. This is a protocol for making SwiftUI View types that wrap UIKit view controllers.
  2. You implement the first protocol method, makeUIViewController. Here you initialize an instance of CameraViewController and perform any one time only setups.
  3. updateUIViewController(_: context:) is the other required method of this protocol, where you would make any updates to the view controller based on changes to the SwiftUI data or hierarchy. For this app, you don’t need to do anything here.

After all this work, it’s time to use CameraView in ContentView.

Open ContentView.swift. Insert CameraView at the beginning of the ZStack in body:

CameraView()
  .edgesIgnoringSafeArea(.all)

Phew! That was a long section. Build and run to see your camera preview.

Starting page after integrating first version of CameraView

Huh! All that work and nothing changed! Why? There’s another piece of the puzzle to add before camera previewing works, an AVCaptureSession. You’ll add that next.

Connecting to the Camera Session

The changes you’ll make here seem long but don’t be afraid. They’re mostly boilerplate code.

Open CameraViewController.swift. Add the following after import UIKit:

import AVFoundation 

Then, add an instance property of type AVCaptureSession inside the class:

private var cameraFeedSession: AVCaptureSession?

It’s good practice to run the capture session when this view controller appears on screen and stop the session when the view is no longer visible, so add the following:

override func viewDidAppear(_ animated: Bool) {
  super.viewDidAppear(animated)
  
  do {
    // 1
    if cameraFeedSession == nil {
      // 2
      try setupAVSession()
      // 3
      cameraView.previewLayer.session = cameraFeedSession
      cameraView.previewLayer.videoGravity = .resizeAspectFill
    }
    
    // 4
    cameraFeedSession?.startRunning()
  } catch {
    print(error.localizedDescription)
  }
}

// 5
override func viewWillDisappear(_ animated: Bool) {
  cameraFeedSession?.stopRunning()
  super.viewWillDisappear(animated)
}

func setupAVSession() throws {
}

Here’s a code breakdown:

  1. In viewDidAppear(_:), you check to see if you’ve already initialized cameraFeedSession.
  2. You call setupAVSession(), which is empty for now, but you’ll implement it shortly.
  3. Then, you set the session into the session of the previewLayer of cameraView and set the resize mode of the video.
  4. Next, you start running the session. This makes the camera feed visible.
  5. In viewWillDisappear(_:), turn off the camera feed to preserve battery life and be a good citizen.

Now, you’ll add the missing code to prepare the camera.

Add a new property for the dispatch queue on which Vision will process the camera samples:

private let videoDataOutputQueue = DispatchQueue(
  label: "CameraFeedOutput", 
  qos: .userInteractive
)

Add an extension to make the view controller conform to AVCaptureVideoDataOutputSampleBufferDelegate:

extension 
CameraViewController: AVCaptureVideoDataOutputSampleBufferDelegate {
}

With those two things in place, you can now replace the empty setupAVSession():

func setupAVSession() throws {
  // 1
  guard let videoDevice = AVCaptureDevice.default(
    .builtInWideAngleCamera, 
    for: .video, 
    position: .front) 
  else {
    throw AppError.captureSessionSetup(
      reason: "Could not find a front facing camera."
    )
  }

  // 2
  guard 
    let deviceInput = try? AVCaptureDeviceInput(device: videoDevice)
  else {
    throw AppError.captureSessionSetup(
      reason: "Could not create video device input."
    )
  }

  // 3
  let session = AVCaptureSession()
  session.beginConfiguration()
  session.sessionPreset = AVCaptureSession.Preset.high

  // 4
  guard session.canAddInput(deviceInput) else {
    throw AppError.captureSessionSetup(
      reason: "Could not add video device input to the session"
    )
  }
  session.addInput(deviceInput)

  // 5
  let dataOutput = AVCaptureVideoDataOutput()
  if session.canAddOutput(dataOutput) {
    session.addOutput(dataOutput)
    dataOutput.alwaysDiscardsLateVideoFrames = true
    dataOutput.setSampleBufferDelegate(self, queue: videoDataOutputQueue)
  } else {
    throw AppError.captureSessionSetup(
      reason: "Could not add video data output to the session"
    )
  }
  
  // 6
  session.commitConfiguration()
  cameraFeedSession = session
}

In the code above you:

  1. Check if the device has a front-facing camera. If it doesn’t, you throw an error.
  2. Next, check if you can use the camera to create a capture device input.
  3. Create a capture session and start configuring it using the high quality preset.
  4. Then check if the session can integrate the capture device input. If yes, add the input you created in step two to the session. You need an input and an output for your session to work.
  5. Next, create a data output and add it to the session. The data output will take samples of images from the camera feed and provide them in a delegate on a defined dispatch queue, which you set up earlier.
  6. Finally, finish configuring the session and assign it to the property you created before.

Build and run. Now you can see yourself behind the raining stars.

Starting page after setting up the camera viewfinder

A key-value pair in Info.plist stores the reason. It’s already there in the starter project.

Note: You need user permission to access the camera on a device. When you start a camera session for the first time, iOS prompts the user to grant access to the camera. You have to give the user a reason why you want the camera permission.

A key-value pair in Info.plist stores the reason. It’s already there in the starter project.

With that in place, it’s time to move on to Vision.