The ability to capture depth information form an scene has greatly increased in
the recent years. 3D sensors, traditionally high cost and low resolution sensors, are
being democratized and 3D scans of indoor and outdoor scenes are becoming more
and more common.
However, there is still a great data gap between the amount of captures being per-
formed with 2D and 3D sensors. Although the 3D sensors provide more information
about the scene, 2D sensors are still more accessible and widely used. This trade-
off between availability and information between sensors brings us to a multimodal
scenario of mixed 2D and 3D data.
This thesis explores the fundamental block of this multimodal scenario: the reg-
istration between a single 2D image and a single unorganized point cloud. An
unorganized 3D point cloud is the basic representation of a 3D capture. In this
representation the surveyed points are represented only by their real word coordi-
nates and, optionally, by their colour information. This simplistic representation
brings multiple challenges to the registration, since most of the state of the art works
leverage the existence of metadata about the scene or prior knowledges.
Two different techniques are explored to perform the registration: a keypoint-based
technique and an edge-based technique. The keypoint-based technique estimates the
transformation by means of correspondences detected using Deep Learning, whilst
the edge-based technique refines a transformation using a multimodal edge detection
to establish anchor points to perform the estimation.
An extensive evaluation of the proposed methodologies is performed. Albeit fur-
ther research is needed to achieve adequate performances, the obtained results show
the potential of the usage of deep learning techniques to learn 2D and 3D similari-
ties. The results also show the good performance of the proposed 2D-3D iterative
refinement, up to the state of the art on 3D-3D registration.