5 min read

Is this video spatial, or not?

Photo of cardboard old school 3D glasses with one color per eye
Photo by rivage / Unsplash

We have this feature in a visionOS app that allows users to select videos from their library using the PhotosPicker from PhotoKit. Overall, its implementation is simple and effective, especially given everything that is going on behind the scenes.

A diagram showing the types of requests your app can make through PhotoKit, to access photos stored in the user's photo library

Once a video is selected, we wanted to avoid restricting the type of video (and now that I think about it, I'm unsure how to even accomplish that), but flexibility in selection comes with the need of a specialized treatment for spatial videos.

The process of using the picker involves creating a Transferable, managing URL permissions, and finally, playing the content. We expected to be able to identify the type of the video early on, so I turned straight to figure out how to extract metadata when importing during the TransferRepresentation step. To do this, one can simply build an AVAsset from the received URL to extract the video tracks

let tracks = try await AVAsset(url: url).loadTracks(withMediaType: .video)

Then, each individual track can be queried by asking for its format descriptions

let descriptions = try await track.load(.formatDescriptions)

A CMFormatDescription represents each returned description and includes a set of methods to inspect its contents.

A screenshot of the Xcode list of functions when displaying the CMFormatDescription header and filtering for FormatGet. There are 27 results listed.
CoreMedia > CMFormatDescription

Now, which one to use? Sure, the one that will provide the most certainty for our objective. So after some digging, the kVTDecompressionPropertyKey_RequestedMVHEVCVideoLayerIDs seemed to be just that as it indicates that playing that particular file requires the use of the MV (Multi View) component of the HEVC format, commonly used for stereoscopic video. Unfortunately, this value remains unavailable until buffering begins (at least on visionOS?), making it a no go for the requirement to determine types in advance.

Another technique I've seen is to use CMFormatDescriptionGetExtensions and then check kCMFormatDescriptionExtension_HorizontalFieldOfView, which indicates the horizontal field of view in thousandths of a degree. A nonzero value for this key indicates that the video is spatial in theory.

Some outstanding projects that use this technique:

To bel honest, I believe this method should work on most cases as it has being clearly battle tested, but the absence of a straightforward API for identifying this trait still bothered me. Additionally, a superstitious part of me considers the possibility of videos with a wide horizontal FOV that may not be stereoscopic.

So, if you go again and analyze the results of CMFormatDescriptionGetExtensions after selecting a regular video and then a spatial one, you can see some interesting differences, and between those, I particularly believe the presence of HasLeftStereoEyeView and HasRightStereoEyeView in the video metadata could be more reliable indicators of spatial (stereoscopic) videos.

💡
Note also StereoCameraBaseline, HorizontalDisparityAdjustment and consider that all this properties could be combined for checking
Screenshot of the kaleidoscope app comparing the console output for getting extensions for both spatial and not spatial videos

Current status

Using the original strategy of collecting metadata before initiating any decoding sessions has proven to be a good fitting for us. It enables to inspect the video’s properties efficiently and make decisions without overhead which allows listing huge amounts of selected videos and group them by typology in lightweight and fast fashion. Furthermore, this approach offers interesting flexibility as it allows for the examination of diverse properties such as codec type, dimensions, and other format-specific extensions without the need for any additional setup. And finally, early decision-making simplifies the app modeling process by reducing the number of mutations required and complexities on tracking model properties.

The consistent identification of spatial videos compared to monoscopic ones using this metadata technique has proven effective, as demonstrated by the positive results of our initial tests, but we have to be aware that because of this API limitations, we could be confronted to edge-cases and therefore, this needs further testing… or embrace uncertainty 🖖

GitHub - elkraneo/SpatialVideoOrNot: Metadata inspection approach for detecting stereoscopic videos
Metadata inspection approach for detecting stereoscopic videos - elkraneo/SpatialVideoOrNot

Reading multiview 3D video files | Apple Developer Documentation
Render single images for the left eye and right eye from a multiview High Efficiency Video Coding format file by reading individual video frames.
kVTDecompressionPropertyKey_RequestedMVHEVCVideoLayerIDs | Apple Developer Documentation
Requests multi-image decoding of specific MV-HEVC video layers.
CMFormatDescription | Apple Developer Documentation
An object that describes a media format descriptor.
Deliver video content for spatial experiences - WWDC23 - Videos - Apple Developer
Learn how to prepare and deliver video content for visionOS using HTTP Live Streaming (HLS). Discover the current HLS delivery process…
Embed the Photos Picker in your app - WWDC23 - Videos - Apple Developer
Discover how you can simply, safely, and securely access the Photos Library in your app. Learn how to get started with the embedded…