Photo Stacking in iOS with Vision and Metal

In this tutorial, you’ll use Metal and the Vision framework to remove moving objects from pictures in iOS. You’ll learn how to stack, align and process multiple images so that any moving object disappears. By Yono Mittlefehldt.

Leave a rating/review
Download materials
Save for later
Share
You are currently viewing page 2 of 4 of this article. Click here to view the first page.

Using Vision to Align Images

The Vision framework has two different APIs for aligning images: VNTranslationalImageRegistrationRequest and VNHomographicImageRegistrationRequest. The former is easier to use and, if you assume that the user of the app will hold the iPhone relatively still, it should be good enough.

Note: If you’ve never worked with the Vision framework, check out Face Detection Tutorial Using the Vision Framework for iOS for some information about how Vision requests work.

To make your code more readable, you’ll create a new class to handle the alignment and eventual combining of the captured images.

Create a new, empty Swift File and name it ImageProcessor.swift.

Remove any provided import statements and add the following code:

import CoreImage
import Vision

class ImageProcessor {
  var frameBuffer: [CIImage] = []
  var alignedFrameBuffer: [CIImage] = []
  var completion: ((CIImage) -> Void)?
  var isProcessingFrames = false

  var frameCount: Int {
    return frameBuffer.count
  }
}

Here, you import the Vision framework and define the ImageProcessor class along with some necessary properties:

  • frameBuffer will store the original captured images.
  • alignedFrameBuffer will contain the images after they have been aligned.
  • completion is a handler that will be called after the images have been aligned and combined.
  • isProcessingFrames will indicate whether images are currently being aligned and combined.
  • frameCount is the number of images captured.

Next, add the following method to the ImageProcessor class:

func add(_ frame: CIImage) {
  if isProcessingFrames {
    return
  }
  frameBuffer.append(frame)
}

This method adds a captured frame to the frame buffer, but only if you’re currently not processing the frames in the frame buffer.

Still within the class, add the processing method:

func processFrames(completion: ((CIImage) -> Void)?) {
  // 1
  isProcessingFrames = true  
  self.completion = completion
  // 2
  let firstFrame = frameBuffer.removeFirst()
  alignedFrameBuffer.append(firstFrame)
  // 3
  for frame in frameBuffer {
    // 4
    let request = VNTranslationalImageRegistrationRequest(targetedCIImage: frame)

    do {
      // 5      
      let sequenceHandler = VNSequenceRequestHandler()
      // 6
      try sequenceHandler.perform([request], on: firstFrame)
    } catch {
      print(error.localizedDescription)
    }
    // 7
    alignImages(request: request, frame: frame)
  }
  // 8
  cleanup()
}

It seems like a lot of steps but this method is relatively straightforward. You will call this method after you’ve added all the captured frames. It will process each frame and align them using the Vision framework. Specifically, in this code, you:

  1. Set the isProcessingFrames Boolean variable to prevent adding more frames. You also save the completion handler for later.
  2. Remove the first frame from the frame buffer and add it to the frame buffer for aligned images. All other frames will be aligned to this one.
  3. Loop through each frame in the frame buffer.
  4. Use the frame to create a new Vision request to determine a simple translational alignment.
  5. Create the sequence request handler, which will handle your alignment requests.
  6. Perform the Vision request to align the frame to the first frame and catch any errors.
  7. Call alignImages(request:frame:) with the request and the current frame. This method doesn’t exist yet and you’ll fix that soon.
  8. Clean up. This method also still needs to be written.

Ready to tackle alignImages(request:frame:)?

Add the following code just below processFrames(completion:):

func alignImages(request: VNRequest, frame: CIImage) {
  // 1
  guard 
    let results = request.results as? [VNImageTranslationAlignmentObservation],
    let result = results.first 
    else {
      return
  }
  // 2
  let alignedFrame = frame.transformed(by: result.alignmentTransform)
  // 3
  alignedFrameBuffer.append(alignedFrame)
}

Here you:

  1. Unwrap the first result from the alignment request you made within the for loop in processFrames(completion:).
  2. Transform the frame using the affine transformation matrix calculated by the Vision framework.
  3. Append this translated frame to the aligned frame buffer.

These last two methods are the meat of the Vision code your app needs. You perform the requests and then use the results to modify the images. Now all that’s left is to clean up after yourself.

Add this following method to the end of the ImageProcessor class:

func cleanup() {
  frameBuffer = []
  alignedFrameBuffer = []
  isProcessingFrames = false
  completion = nil
}

In cleanup(), you simply clear out the two frame buffers, reset the flag to indicate that you’re no longer processing frames and set the completion handler to nil.

Before you can build and run your app, you need to use the ImageProcessor in your CameraViewController.

Open CameraViewController.swift. At the top of the class, define the following property:

let imageProcessor = ImageProcessor()

Next, find captureOutput(_:didOutput:from:). You’ll make two small changes to this method.

Add the following line just below the let image = ... line:

imageProcessor.add(image)

And below the call to stopRecording(), still within the if statement, add:

imageProcessor.processFrames(completion: displayCombinedImage)

Build and run your app and… nothing happens. No worries, Mr. Potter. You still need to combine all of these images into a single masterpiece. To see how to do that, you’ll have to read on!

NOTE: If you want to see how your aligned images compare to the original captures, you could instantiate an ImageSaver in your ImageProcessor. This would allow you to save the aligned images to the Documents folder and see them in the Files app.

How Photo Stacking works

There are several different ways to combine or stack images together. By far the simplest method is to just average the pixels for each location in the image together.

For instance, if you have 20 images to stack, you would average together the pixel at coordinate (13, 37) across all 20 images to get the mean pixel value for your stacked image at (13, 37).

Pixel stacking

Pixel stacking

If you do this for every pixel coordinate, your final image will be the average of all images. The more images you have the closer the average will be to the background pixel values. If something moves in front of the camera, it will only appear in the same spot in a couple of images, so it won’t contribute much to the overall average. That’s why moving objects disappear.

This is how you’ll implement your stacking logic.

Stacking Images

Now comes the really fun part! You’re going to combine all of these images into a single fantastic image. You’re going to create your own Core Image kernel using the Metal Shading Language (MSL).

Your simple kernel will calculate a weighted average of the pixel values for two images. When you average a bunch of images together, any moving objects should just disappear. The background pixels will appear more often and dominate the average pixel value.