Saliency Analysis in iOS using Vision

In this tutorial, you’ll learn how to use the Vision framework in iOS to perform saliency analysis and use it to create an effect on a live video feed. By Yono Mittlefehldt.

Leave a rating/review
Download materials
Save for later
Share

Do you know what the creepy thing about robots is? You can never tell where they’re looking. They have no pupils to give them away. It’s like a person wearing sunglasses. Are they staring at you or something else?

Finally, Apple has said enough is enough. Apple has given us the technology to see what an iPhone thinks is interesting to look at. It’s called saliency analysis, and you too can harness its power!

In this tutorial, you’ll build an app that uses saliency analysis to filter the camera feed from an iOS device to create a spotlight effect around interesting objects.

Along the way, you’ll learn how to use the Vision framework to:

  • Create requests to perform saliency analysis.
  • Use the observations returned to generate a heat map.
  • Filter a video stream using the heat maps as an input.
Note: As this tutorial uses the camera and APIs introduced in iOS 13, you’ll need a minimum of Xcode 11 and a device running iOS 13.0 or later. You can’t use the simulator for this because you need a live feed of video from a physical camera.

Get ready to see the world through your iPhone’s eyes!

Getting Started

Click the Download Materials button at the top or bottom of this tutorial. Open the starter project and explore the code to get a feel for how it works.

The starter project sets up the camera and displays its output to the screen, unmodified. Additionally, there’s a label at the top of the screen that describes the screen output. Initially, Original is displayed, as the camera feed is unaltered.

Tapping on the screen changes the label to Heat Map. But nothing in the camera feed changes.

You’ll fix that shortly. First though, what is saliency analysis?

Saliency Analysis

Saliency analysis uses algorithms to determine what is interesting or important to humans in an image. Essentially determining what it is about an image that catches someone’s eye.

Once you have picked out the important areas in a photo, you could then use this information to automate cropping or provide filter effects highlighting them.

If you perform saliency analysis in real-time on a video feed, you could also use the information to help focus on the key areas.

The Vision framework provided by Apple has two different types of saliency analysis: attention-based and object-based.

Attention-based saliency tries to determine what areas a person might look at. Object-based saliency, on the other hand, seeks to highlight entire objects of interest. Although related, the two are quite different.

Roll up your sleeves and crack your knuckles. It’s time to code. :]

Attention-Based Heat Maps

Both Vision APIs used for saliency analysis return a heat map. There are a variety of ways to visualize heat maps. Those returned by Vision requests are in grayscale. Additionally, the heat map is defined on a much coarser grid than the photo or video feed. According to Apple’s documentation, you will get back either a 64 x 64 or a 68 x 68 pixel heat map depending on whether or not you make the API calls in real-time.

Functions used to return the width and height for a CVPixelBuffer reported 68 x 68. However, the data included in the CVPixelBuffer was actually 80 x 68. This may be a bug and could change in the future.

Note: Although the documentation says to expect a 64 x 64 pixel heat map when calling the APIs in real time, the code used in this tutorial to perform the Vision requests on a video feed still resulted in an 80 x 68 pixel heat map.

Functions used to return the width and height for a CVPixelBuffer reported 68 x 68. However, the data included in the CVPixelBuffer was actually 80 x 68. This may be a bug and could change in the future.

If you’ve never used the Vision framework, check out our Face Detection Tutorial Using the Vision Framework for iOS for some information about how Vision requests work.

In CameraViewController.swift, add the following code to the end of captureOutput(_:didOutput:from:):

// 1
let req = VNGenerateAttentionBasedSaliencyImageRequest(
  completionHandler: handleSaliency)
    
do {
  // 2
  try sequenceHandler.perform(
    [req],
    on: imageBuffer,
    orientation: .up)
    
} catch {
  // 3
  print(error.localizedDescription)
}

With this code, you:

  1. Generate an attention-based saliency Vision request.
  2. Use the VNSequenceRequestHandler to perform the request on the CVImageBuffer created at the beginning of the method.
  3. Catch and print the error, if there was one.

There! Your first step toward understanding your robotic iPhone!

You’ll notice that Xcode is unhappy and doesn’t seem to know what handleSaliency is. Even though they’ve made great strides in computer vision, Apple still hasn’t found a way to make Xcode write your code for you.

You’ll need to write handleSaliency, which will take a completed vision request and do something useful with the result.

At the end of the same file, add a new extension to house your Vision-related methods:

extension CameraViewController {
}

Then, in this extension, add the handleSaliency completion handler you passed to the Vision request:

func handleSaliency(request: VNRequest, error: Error?) {
  // 1
  guard
    let results = request.results as? [VNSaliencyImageObservation],
    let result = results.first
    else { return }

  // 2
  guard let targetExtent = currentFrame?.extent else {
    return
  }
  
  // 3
  var ciImage = CIImage(cvImageBuffer: result.pixelBuffer)

  // 4
  let heatmapExtent = ciImage.extent
  let scaleX = targetExtent.width / heatmapExtent.width
  let scaleY = targetExtent.height / heatmapExtent.height

  // 5
  ciImage = ciImage
    .transformed(by: CGAffineTransform(scaleX: scaleX, y: scaleY))
      
  // 6
  showHeatMap(with: ciImage)
}

Here, you:

  1. Ensure that the results are VNSaliencyImageObservation objects and extract the first result from the array of observations returned by the Vision request.
  2. Grab the image extent from the current frame. This is essentially the size of the current frame.
  3. Create a CIImage from the CVPixelBuffer that represents the heat map.
  4. Calculate the scale factor between the current frame and the heat map.
  5. Scale the heat map to the current frame’s size.
  6. Display the heat map using showHeatMap.

Now, just above handleSaliency(request:error:), add the handy helper method that will display the heat map:

func showHeatMap(with heatMap: CIImage) {
  // 1
  guard let frame = currentFrame else {
    return
  }
  
  let yellowHeatMap = heatMap
    // 2
    .applyingFilter("CIColorMatrix", parameters:
      ["inputBVector": CIVector(x: 0, y: 0, z: 0, w: 0),
       "inputAVector": CIVector(x: 0, y: 0, z: 0, w: 0.7)])
    // 3
    .composited(over: frame)

  // 4
  display(frame: yellowHeatMap)
}

In this method, you:

  1. Unwrap the currentFrame optional.
  2. Apply the CIColorMatrix Core Image filter to the heat map. You are zeroing out the blue component and simultaneously multiplying the alpha component of each pixel by 0.7. This results in a yellow heat map that is partly transparent.
  3. Add the yellow heat map on top of the original frame.
  4. Use the provided helper method to display the resulting image.

Before you build and run the app, you’ll need to make one final change.

Go back to captureOutput(_:didOutput:from:) and replace the following line of code:

display(frame: currentFrame)

with:

if mode == .original {
  display(frame: currentFrame)
  return
}

This code ensures that you only show the unfiltered frame when you’re in the Original mode. It also returns from the method, so you don’t waste precious computing cycles (and battery!) doing any Vision requests. :]

All right, it’s time! Build and run the app and then tap to put it into Heat Map mode.