VLAD: vector of locally aggregated descriptors

Assuming the local descriptor to be d-dimensional, the

dimension D of our representation is D = k × d. In the

following, we represent the descriptor by vi,j , where the

indices i = 1 . . . k and j = 1 . . . d respectively index the

visual word and the local descriptor component. Hence, a

component of v is obtained as a sum over all the image descriptors:


螢幕快照 2016-05-27 下午7.16.05(1)

where xj and ci,j respectively denote the j

the descriptor x considered and of its corresponding visual

word ci. The vector v is subsequently L2-normalized by

v := v/||v||2 .

Experimental results show that excellent results can be

obtained even with a relatively small number of visual

th component of words k: we consider values ranging from k=16 to k=256.

Extra-Trees splitting algorithm (for numerical attributes)

Split a node(S)
Input: the local learning subset S corresponding to the node we want to split
Output: a split [a < ac] or nothing
– If Stop split(S) is TRUE then return nothing.
– Otherwise select K attributes {a1, . . . , aK } among all non constant (in S) candidate attributes; –DrawKsplits{s1,…,sK},wheresi =Pick a random split(S,ai),∀i =1,…,K;
– Return a split s∗ such that Score(s∗, S) = maxi=1,…,K Score(si , S).

Pick a random split(S,a)
Inputs: a subset S and an attribute a
Output: a split
– Let amS ax and amS in denote the maximal and minimal value of a in S; – Draw a random cut-point ac uniformly in [amS in , amS ax ];
– Return the split [a < ac].

Stop split(S)
Input: a subset S
Output: a boolean
– If |S| < nmin, then return TRUE;
– If all attributes are constant in S, then return TRUE; – If the output is constant in S, then return TRUE;
– Otherwise, return FALSE.

Random Forest

Random forests is a notion of the general technique of random decision forests that are an ensemble learning method for classification, regression and other tasks, that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. Random decision forests correct for decision trees’ habit of overfitting to their training set.

Camera Calibration

Camera calibration is the estimation of the internal (intrinsic) parameters of a

camera. It is an important step in order to correct any optical distortion artifacts.

Equation (1) shows the intrinsic matrix (also called camera matrix) containing 5

intrinsic parameters. cx and cy represent the coordinates of the principal point,

which would be ideally at the image center. ∞x and ∞y represent the focal length

Markerless Tracking for Augmented .

and s is the skew coefficient. Intrinsic parameters are specific to a camera, so once

calculated; it can be stored for future purposes. There are many approaches used to

calibrate a camera. One of them is to take a number of images of a planar pattern by

the targeted camera from different distances and points of view . The pattern

used in our case is a chessboard.

螢幕快照 2016-05-27 下午3.11.32(1)

Camera Pose Estimation

Camera pose estimation is the problem of determining the geometric transformation

between the world coordinate system and the camera coordinate system. This

transformation is represented by a 3 × 4 matrix, consisting of a 3 × 3 rotation matrix

and a translation vector as shown in Eq. (2).

螢幕快照 2016-05-27 下午3.04.00(2)
They are called the external (extrinsic) camera parameters. Pose estimation is obtained from 2D-3D correspondences using

a solution to Perspective-n-Point (PnP) problem . It is about the estimation of

the pose of a calibrated camera, given n (n ≥ 3) 3D reference points in the object

framework and their corresponding 2D projections. This estimation should be

accurate enough for correct augmentation.

The Projection of a 3D point in the world coordinate system to a 2D point in the

camera coordinate system is obtained by Eq. (3). Both points are expressed in

homogeneous coordinates according to a pinhole calibrated camera model.

螢幕快照 2016-05-27 下午3.04.12(3)

Edge Histogram Descriptor

The EHD basically represents the distribution of 5 types of

edges in each local area called a sub-image. As shown in Fig. 1,

螢幕快照 2016-05-27 下午1.53.58

the sub-image is defined by dividing the image space into 4×4

nonoverlapping blocks. Thus, the image partition always yields

16 equal-sized sub-images regardless of the size of the original

image. To characterize the sub-image, we then generate a his-
togram of edge distribution for each sub-image. Edges in the

sub-images are categorized into 5 types: vertical, horizontal,

45-degree diagonal, 135-degree diagonal, and non-directional

edges (Fig. 2).

螢幕快照 2016-05-27 下午1.54.06

Thus, the histogram for each sub-image repre-
sents the relative frequency of occurrence of the 5 types of

edges in the corresponding sub-image. As a result, as shown in

Fig. 3,

螢幕快照 2016-05-27 下午1.54.15

each local histogram contains 5 bins. Each bin corre-
sponds to one of 5 edge types. Since there are 16 sub-images in

the image, a total of 5×16=80 histogram bins is required, (Fig.


螢幕快照 2016-05-27 下午1.54.21

Note that each of the 80-histogram bins has its own seman-
tics in terms of location and edge type. For example, the bin for

the horizontal type edge in the sub-image located at (0,0) in Fig.

1 carries the information of the relative population of the hori-
zontal edges in the top-left local region of the image.


Random Sample Consensus (RANSAC), for fitting a model to experimental data is introduced. RANSAC is capable of interpreting/smoothing data containing a significant percentage of gross errors, and is thus ideally suited for applications in automated image analysis where interpretation is based on the data provided by error-prone feature detectors. The application of RANSAC to the Location Determination Problem (LDP): Given an image depicting a set of landmarks with known locations, determine that point in space from which the image was obtained. In response to a RANSAC requirement, new results are derived on the minimum number of landmarks needed to obtain a solution, and algorithms are presented for computing these minimum-landmark solutions in closed form. These results provide the basis for an automatic system that can solve the LDP under difficult viewing


The iterative closest points (ICP) algorithm , which is based on least squares minimization, has been widely used for aligning range images. However, conventional least squares approaches are subject to bias in the presence of outliers.