Augmented Reality

Topic Description

Augmented reality (AR) is a field of computer research which aims at supplementing reality by mixing computer-generated data and real world environments. When dealing with computer vision, the so-called computer-generated data are generally 3D models, text, videos... that become virtual objects of the scene when displayed as they were really there and at the right time. Augmented reality has to be both structurally and contextually synchronized with the real world. The structural coherence is needed to deceive the human eyes and make virtual objects appear and behave as they would really exist in the scene. The contextual coherence helps in providing selected information according to the user needs, without flooding him with plenty of undesired data. Computer generated data comprise also music, speech and all kind of multimedia.

A usual example of AR is the match score showed in television broadcasts of football games. The real-world elements are the football field and the players while the virtual elements are the score numbers and the team flags, which are drawn over the image by computers in real time, stretched as they would appear if laid on the court.

Nowadays applications include but are not limited to:

  • Support for complex tasks, i.e. assembly, maintenance, surgery etc, by inserting additional information into the field of view
  • Enhanced sightseeing: labels or any text related to the objects/places seen, rebuilt ruins, building or even landscape as seen in the past.
  • Navigation devices, in buildings, outdoors, cars, airplanes
  • Visualization of architecture (virtual reconstruction of destroyed historic buildings as well as simulation of planned construction projects)
  • Simulation, e.g. flight and driving simulators
  • Collaboration of distributed teams conferences with real and virtual participants.
  • Entertainment and education virtual objects in museums and exhibitions.

Most of the AR applications can be dramatically boosted using see-through display glasses or other special visual devices. For instance, driving assistance applications, inside cars or airplanes, make up for this by deploying head-up displays integrated into the windshield.

The computer vision challenges in the field of AR are the development of algorithms and systems that perform automatically and independently of the user motion and inside real environments.


The most crucial task of every augmented reality approach is pose estimation, namely the determination of the position and the orientation of the observer with respect to the environment. Every pose estimation algorithm relies on pattern recognition and matching. The system is usually trained with a set of patterns that model the appearance and/or the 3D structure of specific objects inside the environment. Those patterns can be fiducial markers expressly inserted into the scene (marker-based) or natural markers already present inside the environment (markerless). At run-time, the system attempts to track the visible patterns in the observer's field of view. Using the deformations affecting the recognised patterns the system burns an estimate of the position and the orientation of the observer with respect to the environment, called pose. Once the pose is known, computer graphic techniques allow virtual objects to be rendered consistently with the structure of the scene, so that they appear as they would if they truly exist in the environment. The observer's pose and the identity of the recognised patterns is usually exploited to deliver location based services.

The AR system developed at CV LAB is markerless and relies on planar patterns, i.e. objects, naturally present in the scene. Each pattern is encoded as a spatial cluster of natural local features, e.g. SIFT. The geometric transformation between training patterns and objects visible in a given view is considered representative of the motion of the observer with respect to the scene. Different analytical models can be employed to perform parameter estimation, these include homography, essential/fundamental matrix, explicit 3D roto-translation. Our method performs homography based estimation when dealing with uncalibrated cameras or varying intrinsics. On the other hand, explicit 3D estimation is employed when intrinsics are known and fixed. In specific circumstances mosaicing techniques can be leveraged to significantly enhance the accuracy with which the pose is estimated, besides this allows also for wider observer trajectory and improved estimation stability.

Experiments with two feature extraction and matching algorithms, namely SIFT and SURF, were conducted in order to asses their relative merits along two dimensions: achievable frame-rate and realism of the rendered contents (i.e. correct estimation of pose, stability, robustness to noise). The best augmentation illusion comes from the use of SIFT. Nevertheless, a solution based on the standard SIFT proposal is too slow to be used for real-time AR. A near real-time system can be obtained replacing SIFT with SURF, which is explicitly designed to offer a frame-rate speed-up, avoiding the initial image doubling pre-processing and working on sub-sampled (320x240) frames. However, this solution is not really able to provide a convincing illusion of augmentation, because of short spurious estimations and jitter. These errors are removed or mitigated exploiting temporal consistency across frames. A module based on Support Vector Machines used in e-regression mode provides a consistent smoothed pose, merging information from previous frames with the estimation performed on current frame.


According to the dimensionality of the virtual objects (thin layer or solid), we distinguish augmentation as follows:

  • Layered, or 2D, augmented reality
Video processed at about 1 frame/s with SIFT on 640x480 frames
  • Solid, or 3D, augmented reality
Video processed at about 1 frame/s with SIFT on 640x480 frames

To illustrate the performances of the SVR based jitter reduction module, we present a tough sequence for real-time markerless pose estimation based on natural features. Difficulties come from the small amount of texture on the reference object, the relatively fast camera movement, the wide range of camera poses.

Videos processed at about 10 frame/s with SURF+SVRs without image doubling pre-processing on 320x240 frames

Some movies are rather large (about 30 MB), we suggest to download and play them on your PC.


  • Virtual assistant for tourists visiting a historical museum
Video processed at about 1 frame/s with SIFT on 640x480 frames

Research partners

This research work is carried out in collaboration with:

Centro Italiano Ricerche Aerospaziali (CIRA)

Laboratorio di Realtà Virtuale e Simulazione (V-Lab), University of Bologna, Polo di Forlì


[1] P. Azzari, L. Di Stefano, F. Tombari, S. Mattoccia, Markerless augmented reality using image mosaicsInternational Conference on Image and Signal Processing - ICISP 2008, July 1-3, 2008, Cherbourg-Octeville, Normandy, France.
[2] S. Salti, L. Di Stefano, SVR-based jitter reduction for markerless Augmented Reality, submitted to International Conference on Image Analysis and Processing - ICIAP 2009