Moving Towards Next Level Immersion with Mixed Reality
Over the last few years, we watched head-worn virtual and augmented reality devices move from research laboratories to store shelves as the newest platform enabling novel virtual experiences. The big promise of emerging mixed reality (MR) technology is a far deeper immersion into imaginary worlds than is possible with the flat 2D screens we currently use to consume movies, TV shows and computer games. Increased immersion comes from the fact that with MR, viewers feel completely surrounded by the alternative reality – they’re not merely looking at a flat 2D projection on a screen from a single fixed viewpoint. To create an increased level of immersion, MR content needs to look and behave realistically, as if the user is part of the virtual environment even though it only really exists as digital information in a computer’s memory.
In this white paper, Innovation Partners introduce key terms and technical aspects associated with free viewpoint control for MR content; limitations that are hindering production, distribution and consumption of immersive MR content; and, at the end, introduce solutions that address these shortcomings in novel ways.
Over the last few years, we watched head-worn virtual and augmented reality devices move from
research laboratories to store shelves as the newest platform enabling novel virtual experiences.
The big promise of emerging mixed reality (MR) technology is a far deeper immersion into imaginary
worlds than is possible with the flat 2D screens we currently use to consume movies, TV shows and
computer games. Increased immersion comes from the fact that with MR, viewers feel completely
surrounded by the alternative reality – they’re not merely looking at a flat 2D projection on a screen
from a single fixed viewpoint. To create an increased level of immersion, MR content needs to look
and behave realistically, as if the user is part of the virtual environment even though it only really
exists as digital information in a computer’s memory.
Virtual content for which the viewer is in a passive role, observing a pre-recorded story similar to a
movie, is called ‘Cinematic VR’. Even in these cases, when minimal amounts of interaction between
the viewer and virtual world are needed, the minimum requirement for realistic behavior in the
virtual world is that the user can freely control their viewpoint by just moving his or her head.
However, even this low level of interaction between the viewer and virtual content has proven to be
difficult to achieve with photorealistic visual quality using current-generation MR devices. Solving this
problem will unlock the full potential of MR and deliver next level immersion to consumers.
In the following chapters, we will introduce key terms and technical aspects associated with free
viewpoint control for MR content. We will introduce limitations that are hindering production,
distribution and consumption of immersive MR content, and, at the end, introduce solutions that
address these shortcomings in novel ways.
Motion parallax and degrees of freedom
When observing our surroundings in everyday life, our brains construct a spatial map that spans three
dimensions. In order to do so, our brain has learned to identify a number of depth cues from the
visual stimuli picked up by the receptors in our eyes. Depth cues are aspects of visual information that
provide hints about the distance of objects from the viewer.
An ideal virtual experience would recreate all depth cues accurately for the observer. In the fields
of cinematography and photography, stereoscopic capture and viewing has been experimented
with since almost the initial development of photography itself. In stereoscopic imaging, separate
images are recorded and displayed for each eye, allowing the viewer to observe binocular depth cues
such as stereopsis. But despite a big push from the industry a few years back, 3D content featuring
stereoscopic view has failed to catch on. A major reason is that stereopsis is just one of a number
of depth cues that our visual perception system uses to enable three-dimensional perception, and
adding it by itself fails to enhance the overall immersion significantly.
Many monocular depth cues such as perspective, texture gradients and realistic lighting and shadows
can be captured and produced with today’s computer graphics, but motion parallax, which has been
considered an even stronger depth cue than stereopsis, is missing from traditional formats of digital
Motion parallax is the visual effect of objects that are closer to the viewer moving larger distances
on the image plane than objects that are further away, when the viewpoint is shifted sideways. One
often-used example is a landscape observed through the window of a moving train. Trees and grass
Moving Towards Next Level
Immersion with Mixed Reality
that are always
looking for new,
This has led to the development
of many forms of entertainment,
ongoing since the dawn of
communication. Today, enabled
by state of the art technologies,
we can experience alternate
realities from the safety of our
living room couches, theater
seats or amusement park rides
in realistic audiovisual quality.
Innovation Partners | White Paper
closer to the train move faster through the field of view than the buildings further away, while clouds even further away seem
to remain static in the background. In the same way, motion parallax provides strong depth cues to our brains when we move
our head even slightly, as we observe objects at close range.
With emerging MR devices embedded with positional tracking, it is possible to produce virtual experiences that allow viewers
to freely control viewpoint with their head motions. When rendering of the 3D view is done based on head motion, motion
parallax depth cues can be correctly produced. When motion parallax is accurately synchronized with head motion, it creates a
strong illusion of virtual content really existing in the same physical space with the viewer.
In order to enable the effect of motion parallax, visual information rendered for the viewer needs to convey information about
the distance of each element in the image from the viewer. The element distance determines how much the element needs to
translate when the viewpoint shifts. In addition to depth information, visual information for all areas of an image, even those
that may be occluded by closer objects but can become visible once the viewpoint is shifted, are needed. Traditional 2D media
loses both depth information and visual information for occluded areas when the scene is recorded with a 2D camera from a
single viewpoint. Only 3D content formats that provide both depth information and visual information for occluded areas can
enable recreation of motion parallax.
From existing 3D formats, fully synthetic 3D scenes created by artists
through tedious 3D modelling and animation processes contain
complete depth and visual information, the data is inherently defined
in all three dimensions. Other 3D formats that are used to capture a
scene in a way that enables motion parallax, such as light-field video,
contain multiple views of the scene which can then be used to infer
depth information and provide views of occluded areas from various
viewpoints, thus enabling motion parallax for at least a limited area.
Some other formats, such as stereoscopic video with just two separate
views, may have too little information to enable recreation of motion
parallax with good quality, as they lose much of the depth information
and visual information for occluded areas.
Combined with the rise of consumer MR devices, we have seen a
number of new terms become more widely used to determine various
aspects of the immersive characteristics of virtual experiences. A key
term representing how much freedom the viewer has to navigate within
the content is degrees of freedom (DoF).
DoF is a definition of the number of axes through which motion can take place. In the case of MR, some low-end devices, such
as mobile devices used as VR displays, have sensors that can only register rotations of the device. As the motion of the device
is not constrained, the device can rotate around all three axes of the physical space, thus enabling 3DoF navigation. When the
motion of the device is being tracked in addition to device rotations, free translation along each of the three axes of physical
space provides an additional three degrees of freedom, i.e., translation and rotation together equals 6DoF, which gives full
freedom of navigation.
Challenges with freedom of navigation
The illusion of virtual content surrounding the user comes from virtual elements having consistent spatial anchoring so that
when the user moves his or her head, the elements react to the motion naturally, as if they were present in the same space.
This requires that user motion is known, i.e., tracked, and that the viewpoint from which the virtual content is being drawn can
be changed according to the tracked motion.
Innovation Partners | White Paper
VIRTUAL CONTENT NAVIGATION
LIMITED TO 3DoF DOES NOT
DELIVER A DEEP ENOUGH
LEVEL OF IMMERSION AND THUS
FALLS SHORT ON THE REALISM
CLAIMED BY MANUFACTURERS
Innovation Partners | White Paper
Traditionally, tracking has been seen as the main challenge preventing full 6DoF navigation. However,
thanks to recent advances in tracking methods and sensor technology integrated within MR platforms,
tracking can be largely considered a solved problem. The problems hindering virtual experiences are
actually more pressing on the content side. Technologies enabling capture, distribution and display of
alternate realities in a format that not only provides photorealistic scene information but also enough
depth and “extra” content to deal with depth queues and occluded scene content, thus enabling full
6DoF navigation, are primarily limited by the content itself.
At the moment, only the real-time 3D rendering widely used in computer games enables full 6DoF
navigation for MR. However, even with rapid development of real-time 3D computer graphics over the
last several decades, real-time computer graphics has not yet been able to deliver fully-photorealistic
virtual experiences, especially not on consumer MR devices with limited graphics processing
performance. For high quality experiences, only off-line rendering and off-line reality capture can
truly deliver. One downside of both off-line rendering and off-line reality capture is that the amount of
data required to enable free navigation exceed all reasonable means of content delivery. For example,
a limited light-field format that includes data from relatively small camera arrays requires several
dozen simultaneous high-resolution video streams instead of the one required by normal 2D video.
Supporting low latency operation is a second downside of off-line rendering.
Tackling these limitations and enabling the motion parallax needed for 6DoF MR experiences will
require novel solutions achieved by innovative approaches. Recently, some solutions have been
emerging. Many of them limit where content can be navigated freely from full 6DoF to a more
restricted area that usually only allows free head motion when the viewer position is expected to
remain stationary. This limitation permits cutting some corners in content optimization. This approach
of enabling motion parallax for only a small area is referred to as a 3DoF+ solution - one that allows
freedom of navigation that is more than 3DoF, but not full 6DoF.
In the following chapter, we introduce examples of solutions that enable increased levels of
immersion for MR using such 3DoF+ approaches.
Achieving motion parallax on current-generation MR devices requires a radical reduction in the
amount of transmitted data. For pre-rendered and captured content, there are no obvious ways to
achieve the level of data reduction required, so innovative approaches are needed.
As outlined above, the challenges are significant and the industry has just started to realize that we
need to develop novel solutions quickly. As consumer adoption of MR technology has already begun,
early experiences are impacted by a lack of realism due to currently-available technology. Virtual
content navigation limited to 3DoF does not deliver a deep enough level of immersion and thus falls
short on the realism claimed by manufacturers, and therefore expected by consumers.
Using a 3DoF+ approach, where freedom of motion is limited in order to achieve efficient data
optimization, several players investing in MR technology are developing solutions for next generation
immersive content. Recently, Google announced early hints of Seurat1, a surface light-field rendering
technology they have been working on. Disney Research also published an initial academic paper2
describing the real-time animated light-field approach they have been investigating. Both approaches
limit freedom of motion within the content to a confined area and optimize the needed visual data
based on selected viewing areas.
InterDigital’s Innovation Partners research group has also been working in the area of 3DoF+ solutions, developing several
enablers for the next level of immersion for MR content, including for mobile devices with relatively low performance
One approach analyzes and reduces full 3D scenes into a number of discrete spherical video layers that can be efficiently
compressed, streamed and rendered with motion parallax on client devices. With this approach, instead of using more
common per pixel depth values or full 3D reconstructions of the scene, the virtual view is segregated into a number of co-
centric spherical 360-degree video layers based on the depth values of the scene. The viewpoint can then be shifted around a
central point of the spherical video layers according to head tracking, approximating motion parallax by varying the speeds of
motion of different video layers at different distances. This is similar to how traditional 2D cartoons create the illusion of motion
parallax by moving several image layers at different speeds. Essential to the illusion are the distance and number of spherical
video layers quantizing the overall depth of the scene, which can be dynamically adjusted during content delivery based on
scene depth variation, available communication bandwidth and rendering performance of the viewing client, thus optimizing
the quality of the experience for given resources.
Other approaches analyze the content and intelligently and dynamically
package it into layers of spherical video and 3D assets. This approach
takes into account how to optimize the quality of experience while
providing motion parallax, by leveraging 3D rendering and video
rendering resources available on both the content server and client
device side. It extends the idea of dynamically reducing the 3D
information needed to support full 6DoF navigation, towards 3DoF+
with limited navigation, to avoid bottlenecks in the content distribution
pipeline. In this approach, 6DoF content is analyzed to estimate how
well motion parallax can be approximated for different visual elements
with spherical video layers and which elements benefit most from full
3D information. Based on this analysis and limitations of communication
bandwidth and rendering performance, the virtual scene is then
distributed for a viewing client as a combination of spherical video layers
and full 3D assets, dynamically adjusting the per-visual element content
format on a per-client basis.
These solutions adapt data optimization and data amounts to the communication and processing resources of the MR
client device and content server, thus maximizing perceived quality given available resources. Both approaches use spatial
information that is much simpler than the full 3D scene description needed for 6DoF navigation. The approach of reducing the
spatial information needed to recreate motion parallax in these solutions may lead to the ability to recreate motion parallax for
existing standard 2D content. We have already seen some early examples of deep neural networks trained with stereoscopic
content inferring depth information for standard 2D video. Estimating depth information at a level that permits the segregation
of visual elements to different depth layers combined with a way to recreate missing (occluded) information will enable some
level of motion parallax for legacy 2D content. Remastering this content would radically alter the dynamics of commercial
adoption of new MR devices by allowing older experiences to be consumed with a completely new level of immersion.
In this white paper, we have identified the limitations of current-generation MR content, and the fact that solutions enabling
full 6DoF navigation of photorealistic MR content are in high demand, as full 6DoF navigation will unlock the true potential of
MR. We have also explained how, rather than aiming for full 6DoF navigation, similar consumer quality of experience can be
achieved on MR devices with limited performance characteristics using a 3DoF+ approach.
WP_201804_009 Innovation Partners | White Paper
6DoF CONTENT IS ANALYZED
TO ESTIMATE HOW WELL
MOTION PARALLAX CAN BE
APPROXIMATED FOR DIFFERENT
Innovation Partners | White Paper WP_201804_009
As the limitations associated with content navigation become better understood among MR
explorers, and the value of solutions approaching full 6DoF navigation in photorealistic virtual
experiences is realized, we expect rapid development and adoption of 3DoF+ solutions that move
MR immersion to the next level. This new level of immersion will usher in a completely new era of
interactive media. In our wildest imagination, this era will not be limited to only new content, but will
unlock a new, immersive way of experiencing the traditional 2D content that has been created over
In addition to the solutions explained briefly in this white paper, InterDigital is continuously working
on key enabling technologies that bring new virtual experiences to consumers, pushing the
boundaries of MR quality of experience.