Remote Augmentation in Multipoint Telepresence
Augmented Reality (AR) is a concept and a set of technologies for merging of real and virtual elements to produce new visualizations – typically a video – where physical and digital objects co-exist and interact in real time. Most AR applications support real-time interaction with content (AR scene with virtual objects) which has been produced in advance or offline. In many cases, for example in ad hoc remote guidance applications, AR interaction over a network is required. AR interaction over a network means, for example, adding virtual objects into a video feed captured in a remote location. AR interaction over networks includes solutions for both: 1) real-time situations, where users are simultaneously interacting with each other and with AR content, and 2) off-line situations, where the users are not simultaneously interacting with each other, but still want to produce or share AR content over a network. Support for remote AR interaction needs to also be available when real-time and offline sessions are following or alternating with each other. This requires that the AR content be produced, stored, and updated seamlessly in successive sessions.
Augmented Reality (AR) is a concept and a set of technologies for merging of real and virtual elements
to produce new visualizations – typically a video – where physical and digital objects co-exist and
interact in real time. Most AR applications support real-time interaction with content (AR scene with
virtual objects) which has been produced in advance or offline. In many cases, for example in ad
hoc remote guidance applications, AR interaction over a network is required. AR interaction over a
network means, for example, adding virtual objects into a video feed captured in a remote location.
AR interaction over networks includes solutions for both: 1) real-time situations, where users are
simultaneously interacting with each other and with AR content, and 2) off-line situations, where
the users are not simultaneously interacting with each other, but still want to produce or share AR
content over a network. Support for remote AR interaction needs to also be available when real-time
and offline sessions are following or alternating with each other. This requires that the AR content be
produced, stored, and updated seamlessly in successive sessions.
2.1 Local Augmented Reality
3D models and animations are the most obvious virtual elements to be visualized in AR. However,
AR objects can basically be any digital information for which spatiality (3D position and orientation in
space) gives added value such as, for example, pictures, videos, graphics, text, and audio.
Augmented reality visualizations require a means to see augmented virtual elements as a part of the
physical view. This can be implemented by, for example, a tablet with an embedded camera, which
captures video from the user’s environment and shows it together with virtual elements on its display.
Augmented reality glasses, either video-see-through or optical-see-through, and either monocular or
stereoscopic, can also be used for viewing. The user of the viewing device can also be a remote user
watching, over a network, the same augmented scene as the local user.
AR visualizations can be seen correctly from different viewpoints, so that when the user changes
his or her viewpoint, virtual elements remain in place and act as if they are of the physical scene.
This requires that positions of virtual objects are defined with respect to 3D coordinates of the
environment, and tracking technologies are used for tracking the viewer’s (camera) position with
respect to the environment.
Traditionally printed graphical markers are placed in the environment to be detected from a video
and used as a reference for both augmenting virtual information in the right orientation and scale,
and for tracking the viewer’s (camera) position. Recently, a lot of research has been made on so called
markerless AR, which – instead of sometimes disturbing markers – relies on detecting distinctive
features of the environment, and using them for augmenting virtual information and tracking
the user’s position.
The majority of AR applications are meant for local viewing of the AR content; in other words, the
viewer is also in the same physical space which is being augmented. However, as the result is typically
shown as a video on a display, augmented video can naturally also be seen remotely over a network,
In Multipoint Telepresence
is a concept and a
set of technologies
for merging of real
and virtual elements
to produce new
typically a video –
and digital objects
co-exist and interact
in real time.
Innovation Partners | White Paper
2.2 Remote Augmented Reality
Producing AR content remotely – i.e., adding virtual objects and
animations to be augmented over network – is a very useful feature
in many applications, such as remote guidance applications where
a remote expert can add virtual objects that are augmented to a
video viewed by a local user. A poorly supported area with growing
importance is delivering virtual objects in telepresence and social
media applications. The common feature of all these applications is the
need for synchronous interaction between two or more users, both
AR content producers and consumers. Here synchronous interaction
means the remote and local users have a video conference and see the
virtual objects that are added to the video stream in real time. For many
applications, supporting real-time AR interaction is quite demanding due
to required bandwidth, processing time, latency, etc.
Asynchronous interaction is about delivering and sharing information
(messages, for example) without hard real-time constraint. In many
cases, asynchronous interaction is preferred, as it does not require
simultaneous presence from the interacting parties. In teleconferencing,
asynchronous interaction may happen after the live conference has
ended to allow participants to add virtual objects to other participants’
environments. The participants can see the virtual objects when later
accessing the conference space.
In many applications, supporting synchronous and asynchronous functionalities in parallel is necessary or beneficial. They can
also be mixed in more integral ways in order to create new ways of interacting.
2.2.1 Producing AR content remotely
If graphical markers are attached to the local environment, remote augmentation can be made simply by detecting the
markers’ pose (position, orientation, and scale) from the transmitted local video, and aligning virtual objects with respect
to them. This is fairly simple and fast – and can even be partly automated – and is suited well to synchronous interactions.
An early example of this approach is given by Hirokazu Kato and Mark Billinghurst [Kato & Billinghurst 1999].
Markerless 3D feature-based methods can be used in cases when visible markers are too disturbing or do not work at
all, like in large-scale augmentations outdoors. Typical for feature-based methods is that they require more advance
preparations than marker-based methods. They also require more complex data capture, more complex processing, and
more complex tools for AR content production compared to a marker-based approach. In addition, they don’t give as
explicit a scale reference for the augmentations as when using markers.
Feature-based methods can be used for augmenting remote spaces, if the required preparations (for example, 3D
scanning of the environment) can be made in advance, and if the local environment stays stable enough so that the
results of those preparations can be used repeatedly, in several synchronous sessions. In these solutions, 3D scanning of
the local space can be made by using a moving camera or a depth sensor.
Marker-based methods can be applied even if there are no predefined markers in the local environment. In this approach
the application allows a remote user to select a known feature set (e.g., a poster on the wall or a logo of a machine) from
the local environment. This set of features used for tracking is in practice an image that can be used similarly as markers
to define 3D location and 3D orientation [Reitmayr et al. 2007, Siltanen et al. 2015].
With restrictions, even unknown planar features, recognized and defined objectively by the remote user, can be used to
help remote augmentation. In this case, however, the depth and scale cannot be derived precisely from the remote video,
Innovation Partners | White Paper
PRODUCING AR CONTENT
REMOTELY – I.E. ADDING VIRTUAL
OBJECTS AND ANIMATIONS TO BE
AUGMENTED OVER NETWORK –
IS VERY USEFUL FEATURE IN MANY
APPLICATIONS, SUCH AS REMOTE
GUIDANCE APPLICATION WHERE
A REMOTE EXPERT CAN ADD
VIRTUAL OBJECTS THAT ARE
AUGMENTED TO VIDEO VIEWED
BY A LOCAL USER.
Innovation Partners | White Paper
and the augmentation is often restricted to replacing planar feature sets with other subjectively
scaled planar objects (e.g., a poster with another poster).
In a local environment that has no features that are known in advance, the method called
simultaneous localization and mapping (SLAM) has been developed [Klein & Murray 2007].
SLAM simultaneously estimates the 3D pose of the camera and 3D features of the scene from a
live video stream. The method results in a set of 3D points, which can be used by a remote user
to align virtual objects to a desired 3D position, while the local user is moving a camera to show
the local environment from different angles.
Local 3D features can also be captured with a set of fixed video cameras, each filming the
environment from different angles. These streams can be used to calculate a set of 3D points
that can be used by the remote user [Seitz et al. 2006]. Optionally, the above-described 3D point
cloud can be created by using a depth camera [Izadi et al. 2011]. For making the point cloud,
related camera and/or depth sensor-based solutions described for 3D telepresence are also
applicable [Maimone et al. 2012].
2.3 Coding and transmission of 3D data
Coding and transmission of real-time captured 3D data requires much more bandwidth than real-time
video. For example, the raw data bitrate of Kinect 1 sensor is almost 300MB/s (9.83 MB per frame).
So, obviously, efficient compression methods are needed. Compression methods for Kinect type of
depth data (either RGB-D or ToF) are, however, still in their infancy. The amount of real-time captured
depth sensor data (color plus depth) can be considered much bigger than that of a video camera. The
same relative comparison also holds for multi-sensor systems, compared to multi-camera systems.
US patent application US 2013/0321593 (Dec. 5, 2013 by Microsoft) is an example of a method for
real-time transmission of 3D geometry and texture data, on demand, from a viewpoint specified by
the rendering client. The application describes a method of transmitting only the texture data and
geometry necessary to render the view from a selected viewpoint, instead of transmitting the whole
3 New methods for supporting unassisted remote AR
As teleconferencing systems have become
increasingly popular, people are accustomed to
using real-time video in their communication.
Enabling a remote user to enrich remote
live video by virtual information is a natural
extension to standard teleconferencing that
brings added value in many applications. In
order to add virtual objects to the video feed,
the remote user needs to have 3D information
of the environment to position the virtual
objects into 3D space. The 3D information can
be captured as described in Section 2.2 and
transmitted to the remote user. The remote
user can then use the 3D model as a reference
when positioning virtual information with
respect to the environment.
ENABLING A REMOTE USER TO
ENRICH REMOTE LIVE VIDEO
BY VIRTUAL INFORMATION
IS A NATURAL EXTENSION TO
THAT BRINGS ADDED VALUE IN
3.1 Unassisted feature-based remote AR
This white paper introduces new solutions for unassisted remote AR which does not require assistance for capturing 3D
features of the local environment. Advance preparations or local assistance can be avoided by capturing local space with a fixed
setup of 3D cameras and/or sensors, and providing this information for a remote user to make accurate 3D augmentations. 3D
feature capture and reconstruction has been studied extensively, and many technologies and solutions exist. Since our goal is
to support real-time applications and not to require local assistance in 3D feature capture, methods based on moving a single
camera or depth sensor in space are not applicable.
Solutions for real-time, unassisted 3D capture are used, for example, in real-time 3D telepresence. In these systems, multi-
sensor capture is typically used for deriving a 3D representation of the captured scene. In [Kuster 2011], a combination of three
color cameras and two Kinects (using their IR parts only) were used to make high-quality view synthesis at about 7fps, and the
framerate was expected to increase by fourfold through the parallelization of algorithms. In [Maimone et al. 2012], five Kinects
were used for capturing and rendering conferencing participants at an average rate of 8.5fps. New viewpoints to an unchanged
volume were rendered at 26.3fps.
Related US patent US 8,872,817 B2 (Oct. 28, 2014 by ETRI) describes an apparatus and method for real-time 3D reconstruction.
Another patent, US 8,134, 556 B2 (Mar. 13, 2012 by Elsberg et al.), describes a method for forming a photorealistic view on
demand of a three-dimensional simulation model using ray tracing method.
3.2 Adding a virtual object to a remote environment
The ability to enrich remote live video by virtual information does
not exist in standard teleconferencing systems yet, partly because
transmitting real-time captured 3D information over network has
some disadvantages. For example, bandwidth requirements of 3D
information are much higher than normal video. Also, in order to
generate an accurate 3D representation, the environment must be
captured from several directions. Normally, users are accustomed
to pointing a video camera so that it captures only those parts
of the environment that the user wants to show to remote
participants of the video conference. In the case of 3D capture,
there is no obvious way to restrict the information shown to the
remote participants, because the whole point of 3D capture is to
gather as much complete 3D information as possible.
In the new InterDigital solution [US 62/316,884], the user may
restrict the 3D captured information transmitted to the remote
users using the same method as in traditional video conferencing:
by pointing a video camera in the desired direction. Even though
the local 3D capture set-up generates 3D information of the whole
environment, the system computes which 3D objects are visible in the camera view and transmits only those to the remote
participants. The user may use the camera of a phone or a tablet to show which area is visible to the remote users (Figure 1) or
a laptop computer camera used in the teleconference may restrict the visible area, while the local user is using the camera to
communicate with the remote participants.
Since the amount of 3D information within the visible area is smaller than the whole environment, the solution reduces the
bandwidth requirements when transmitting the 3D model to the remote participants. Also, since the remote user sees the local
environment only from one direction at a time, the InterDigital solution suggests transmitting virtual camera views from the
angle requested by the remote user, reducing the bandwidth requirements to the same as those of a normal video feed.
WP_201804_009 Innovation Partners | White Paper
3D information is
transmitted only from
area visible to the
local user’s camera
Figure 1. Restricting a visible area
Innovation Partners | White Paper
3.3 Calibrating capture setup
The 3D information is generated by a capture
setup, using a set of sensors (that may be
depth cameras or normal video cameras)
hanging on room walls (cf. Figure 1). The
sensor system needs to be calibrated in order
to create a common coordinate system for the
whole setup, including the user’s camera. In
the InterDigital solution, the sensor setup is
self-calibrating, so that a user need only follow
basic instructions for the assembly of sensors
and the user’s camera in the environment,
and the system takes care of the calibration.
The calibration is implemented so that it
allows flexible re-configuration of the setup,
for example to enable better capture of some
areas of the space.
In the InterDigital solution [US 62/316,884], automatic calibration is achieved by using so-called
camera markers [US 62/202,431], where the sensors are combined with a marker of known
dimensions, as capture set-up sensors (Figure 2). The markers may be printed on paper or the
camera markers may have a display where the marker is shown when needed.
The capture setup consists of wide-angle cameras with markers. The capture setup needs to be
calibrated in order to create a common coordinate system for the whole setup, including the user’s
video camera. A common coordinate system is needed for the system to be able to augment virtual
objects to the video feed transmitted from the user’s camera to the remote participants. The camera
markers are positioned so that each of the cameras captures at least one of the markers, allowing the
system to calibrate that camera automatically, using prior art methods, such as [Brückner 2010]. If
the marker is shown on the camera marker’s display, the marker needs to be shown only during the
calibration process and other times the display may be used for different purposes.
The user’s camera needs to also be calibrated to get accurate position and orientation information in
the local space. The user’s camera calibration requires orienting the camera so that it captures at least
one of the camera markers. When the camera is calibrated, the virtual objects can be augmented to
the video feed it captures and the virtual objects remain in correct 3D position as long as the camera
captures enough distinctive features from the local space.
3.4 Synchronous and asynchronous interaction
Using the capture setup described above, the 3D model of local environment is produced in real-
time. The model can be used by local or remote user(s) as a spatial reference for producing an
accurate AR scene, i.e., a compilation of virtual elements, each with precise position and orientation.
In synchronous interaction, this 3D data is provided for remote users together with real-time video
view of the local space. The video view is generated by the local user’s video camera for example on a
laptop or a tablet. A remote user can edit the local AR scene using both 3D data and video view, and
other users may see the scene augmented to the video feed they receive in real time.
However, synchronous interaction is only one way people interact with each other. Synchronous
interaction is possible only when people are in the same space (virtual or real) at the same time. People
are accustomed to interacting with each other also asynchronously, by changing real space, e.g., by
leaving notes or moving objects, so that another person sees the change when entering the space.
Camera markers with
wide angle cameras
Figure 2. Self-calibrating capture set-up
For supporting asynchronous interactions, the InterDigital solution [US - 62/320,098] allows remote users to change the AR
scene related to the local user’s physical space when the local user is not present in a teleconference. The 3D data generated
during synchronous sessions is stored and can be accessed by the remote users for AR scene creation even when the local
user has left the teleconference. Again, the AR object’s pose can be set using different perspective views to the 3D data. The
AR object scene generated during an asynchronous session can then be augmented to the local user’s video view in the next
synchronous session. Again, the local user may restrict the space visible to the remote users as described in Section 3.1.
The InterDigital solution enables support for remote AR interaction using familiar video-based communication. A remote user
can add virtual objects (any digital information for which spatiality gives added value, such as, for example, pictures, videos,
graphics, text, and audio) to the local scene, which will be augmented to the video feeds transmitted from the local site over
network to remote participants as well as presented to the local user through augmentation. The solution supports both
synchronous and asynchronous interactions using a fixed real-time capture setup.
A special feature of the system is management of user privacy, where the user has control over what 3D information he or she
shows to remote parties just by adjusting the real-time video view.
Since the disclosed concept enables value-adding features to already familiar video communication solutions, it can be
expected that the new solution is easy to accept by the users, and it is easily implemented on top of existing solutions. The
solution also supports non-symmetrical use cases, where all the users do not need to have the 3D capture setup installed in
WP_201804_009 Innovation Partners | White Paper
Innovation Partners | White Paper WP_201806_010
Kato, H.; Billinghurst, M. (1999). “Marker tracking and HMD calibration for a video-based augmented
reality conferencing system.” Proceedings 2nd IEEE and ACM International Workshop on Augmented
Reitmayr, G.; Eade, E.; Drummond, T. W. (2007). “Semi-automatic annotations in unknown
environments.” 6th IEEE and ACM International Symposium on Mixed and Augmented Reality
Siltanen, P.; Valli, S.; Ylikerälä, M.; Honkamaa, P. (2015), “An Architecture for Remote Guidance
Service.” 22nd ISPE Concurrent Engineering Conference (CE2015).
Klein, G., & Murray, D. (2007). “Parallel tracking and mapping for small AR workspaces.” 6th IEEE
andACM International Symposium on Mixed and Augmented Reality (ISMAR 2007).
Seitz, S. M.; Curless, B.; Diebel, J.; Scharstein, D.; and Szeliski, R. (2006), “A comparison and evaluation
of multi-view stereo reconstruction algorithms.” IEEE Computer Society Conference on Computer
Vision and Pattern Recognition (CVPR’06).
Izadi, S.; Kim, D.; Hilliges, O.; Molyneaux, D.; Newcombe, R.; Kohli, P.; Shotton, J.; Hodges, S.; and
Freeman, D. (2011). “KinectFusion: real-time 3D reconstruction and interaction using a moving
depth camera.” 24th ACM Symposium on User Interface Software and Technology (UIST 2011).
Maimone, A. & Fuchs, H. (2012), “Real-time volumetric 3D capture of room-sized scenes for
telepresence”, 3DTV Conference: The True Vision - Capture, Transmission and Display of 3D Video
Kuster, C.; Popa, T.; Zach, C.; Gotsman, C.; and Gross, M. (2011), “FreeCam: A hybrid camera system
for interactive free-viewpoint video.” 16th Annual Workshop on Vision, Modeling and Visualization
Brückner M. & Denzler J. (2010), “Active self-calibration of multi-camera systems.” 32nd Annual
Symposium of the German Association for Pattern Recognition (DAGM 2010).
InterDigital Patent Applications Referenced:
Valli, S. T., Siltanen P. K., Apparatus and Method for Supporting Interactive Augmented Reality
Functionalities, US 62/316,884, filed as a Provisional patent application on 01 Apr 2016.
Valli, S.T., Siltanen P.K., Apparatus and Method for Supporting Interactive Augmented Reality
Functionalities, US 62/202,431, filed as a Provisional patent application on 07 Aug 2015.
Valli, S.T., Siltanen P.K., Apparatus And Method For Supporting Synchronous And Asynchronous
Augmented Reality Functionalities, US 62/320,098, filed as a Provisional patent application on
08 Apr 2016.