Photo Stacking in iOS with Vision and Metal
In this tutorial, you’ll use Metal and the Vision framework to remove moving objects from pictures in iOS. You’ll learn how to stack, align and process multiple images so that any moving object disappears. By Yono Mittlefehldt.
Sign up/Sign in
With a free Kodeco account you can download source code, track your progress, bookmark, personalise your learner profile and more!
Create accountAlready a member of Kodeco? Sign in
Sign up/Sign in
With a free Kodeco account you can download source code, track your progress, bookmark, personalise your learner profile and more!
Create accountAlready a member of Kodeco? Sign in
Sign up/Sign in
With a free Kodeco account you can download source code, track your progress, bookmark, personalise your learner profile and more!
Create accountAlready a member of Kodeco? Sign in
Contents
Photo Stacking in iOS with Vision and Metal
25 mins
Using Vision to Align Images
The Vision framework has two different APIs for aligning images: VNTranslationalImageRegistrationRequest
and VNHomographicImageRegistrationRequest
. The former is easier to use and, if you assume that the user of the app will hold the iPhone relatively still, it should be good enough.
To make your code more readable, you’ll create a new class to handle the alignment and eventual combining of the captured images.
Create a new, empty Swift File and name it ImageProcessor.swift.
Remove any provided import statements and add the following code:
import CoreImage
import Vision
class ImageProcessor {
var frameBuffer: [CIImage] = []
var alignedFrameBuffer: [CIImage] = []
var completion: ((CIImage) -> Void)?
var isProcessingFrames = false
var frameCount: Int {
return frameBuffer.count
}
}
Here, you import the Vision framework and define the ImageProcessor
class along with some necessary properties:
- frameBuffer will store the original captured images.
- alignedFrameBuffer will contain the images after they have been aligned.
- completion is a handler that will be called after the images have been aligned and combined.
- isProcessingFrames will indicate whether images are currently being aligned and combined.
- frameCount is the number of images captured.
Next, add the following method to the ImageProcessor
class:
func add(_ frame: CIImage) {
if isProcessingFrames {
return
}
frameBuffer.append(frame)
}
This method adds a captured frame to the frame buffer, but only if you’re currently not processing the frames in the frame buffer.
Still within the class, add the processing method:
func processFrames(completion: ((CIImage) -> Void)?) {
// 1
isProcessingFrames = true
self.completion = completion
// 2
let firstFrame = frameBuffer.removeFirst()
alignedFrameBuffer.append(firstFrame)
// 3
for frame in frameBuffer {
// 4
let request = VNTranslationalImageRegistrationRequest(targetedCIImage: frame)
do {
// 5
let sequenceHandler = VNSequenceRequestHandler()
// 6
try sequenceHandler.perform([request], on: firstFrame)
} catch {
print(error.localizedDescription)
}
// 7
alignImages(request: request, frame: frame)
}
// 8
cleanup()
}
It seems like a lot of steps but this method is relatively straightforward. You will call this method after you’ve added all the captured frames. It will process each frame and align them using the Vision framework. Specifically, in this code, you:
- Set the
isProcessingFrames
Boolean variable to prevent adding more frames. You also save the completion handler for later. - Remove the first frame from the frame buffer and add it to the frame buffer for aligned images. All other frames will be aligned to this one.
- Loop through each frame in the frame buffer.
- Use the frame to create a new Vision request to determine a simple translational alignment.
- Create the sequence request handler, which will handle your alignment requests.
- Perform the Vision request to align the frame to the first frame and catch any errors.
- Call
alignImages(request:frame:)
with the request and the current frame. This method doesn’t exist yet and you’ll fix that soon. - Clean up. This method also still needs to be written.
Ready to tackle alignImages(request:frame:)
?
Add the following code just below processFrames(completion:)
:
func alignImages(request: VNRequest, frame: CIImage) {
// 1
guard
let results = request.results as? [VNImageTranslationAlignmentObservation],
let result = results.first
else {
return
}
// 2
let alignedFrame = frame.transformed(by: result.alignmentTransform)
// 3
alignedFrameBuffer.append(alignedFrame)
}
Here you:
- Unwrap the first result from the alignment request you made within the
for
loop inprocessFrames(completion:)
. - Transform the frame using the affine transformation matrix calculated by the Vision framework.
- Append this translated frame to the aligned frame buffer.
These last two methods are the meat of the Vision code your app needs. You perform the requests and then use the results to modify the images. Now all that’s left is to clean up after yourself.
Add this following method to the end of the ImageProcessor
class:
func cleanup() {
frameBuffer = []
alignedFrameBuffer = []
isProcessingFrames = false
completion = nil
}
In cleanup()
, you simply clear out the two frame buffers, reset the flag to indicate that you’re no longer processing frames and set the completion handler to nil
.
Before you can build and run your app, you need to use the ImageProcessor
in your CameraViewController
.
Open CameraViewController.swift. At the top of the class, define the following property:
let imageProcessor = ImageProcessor()
Next, find captureOutput(_:didOutput:from:)
. You’ll make two small changes to this method.
Add the following line just below the let image = ...
line:
imageProcessor.add(image)
And below the call to stopRecording()
, still within the if
statement, add:
imageProcessor.processFrames(completion: displayCombinedImage)
Build and run your app and… nothing happens. No worries, Mr. Potter. You still need to combine all of these images into a single masterpiece. To see how to do that, you’ll have to read on!
ImageSaver
in your ImageProcessor
. This would allow you to save the aligned images to the Documents folder and see them in the Files app.How Photo Stacking works
There are several different ways to combine or stack images together. By far the simplest method is to just average the pixels for each location in the image together.
For instance, if you have 20 images to stack, you would average together the pixel at coordinate (13, 37) across all 20 images to get the mean pixel value for your stacked image at (13, 37).
If you do this for every pixel coordinate, your final image will be the average of all images. The more images you have the closer the average will be to the background pixel values. If something moves in front of the camera, it will only appear in the same spot in a couple of images, so it won’t contribute much to the overall average. That’s why moving objects disappear.
This is how you’ll implement your stacking logic.
Stacking Images
Now comes the really fun part! You’re going to combine all of these images into a single fantastic image. You’re going to create your own Core Image kernel using the Metal Shading Language (MSL).
Your simple kernel will calculate a weighted average of the pixel values for two images. When you average a bunch of images together, any moving objects should just disappear. The background pixels will appear more often and dominate the average pixel value.