Space - Use of GNSS-based positioning for road Intelligent Transport System (ITS) - Mathematical PVT error model

This document is written in the frame of WP1.3 of GP-START project. It discusses several models to provide synthetic data for PVT tracks and the ways to analyse and compare the tracks to ensure these are similar to the reality.

Mathematisches PVT-Fehlermodell

Espace - Utilisation du positionnement GNSS pour les systèmes de transport routier intelligents (ITS) - Modèle d’erreur mathématique PVT

Vesolje - Ugotavljanje položaja z uporabo GNSS za cestne inteligentne transportne sisteme (ITS) - Matematični model za napake PVT

General Information

Status
Published
Public Enquiry End Date
06-Nov-2019
Publication Date
16-Feb-2020
Technical Committee
Current Stage
6060 - National Implementation/Publication (Adopted Project)
Start Date
13-Feb-2020
Due Date
19-Apr-2020
Completion Date
17-Feb-2020
Technical report
SIST-TP CEN/TR 17447:2020 - BARVE
English language
59 pages
sale 10% off
Preview
sale 10% off
Preview
e-Library read for
1 day

Standards Content (Sample)


SLOVENSKI STANDARD
01-april-2020
Vesolje - Ugotavljanje položaja z uporabo GNSS za cestne inteligentne transportne
sisteme (ITS) - Matematični model za napake PVT
Space - Use of GNSS-based positioning for road Intelligent Transport System (ITS) -
Mathematical PVT error model
Mathematisches PVT-Fehlermodell
Espace - Utilisation du positionnement GNSS pour les systèmes de transport routier
intelligents (ITS) - Modèle d’erreur mathématique PVT
Ta slovenski standard je istoveten z: CEN/TR 17447:2020
ICS:
03.220.20 Cestni transport Road transport
33.060.30 Radiorelejni in fiksni satelitski Radio relay and fixed satellite
komunikacijski sistemi communications systems
35.240.60 Uporabniške rešitve IT v IT applications in transport
prometu
2003-01.Slovenski inštitut za standardizacijo. Razmnoževanje celote ali delov tega standarda ni dovoljeno.

TECHNICAL REPORT
CEN/TR 17447
RAPPORT TECHNIQUE
TECHNISCHER BERICHT
February 2020
ICS 03.220.20; 33.060.30; 35.240.60

English version
Space - Use of GNSS-based positioning for road Intelligent
Transport System (ITS) - Mathematical PVT error model
Espace - Utilisation du positionnement GNSS pour les Mathematisches PVT-Fehlermodell
systèmes de transport routier intelligents (ITS) -
Modèle d'erreur mathématique PVT

This Technical Report was approved by CEN on 8 December 2019. It has been drawn up by the Technical Committee
CEN/CLC/JTC 5.
CEN and CENELEC members are the national standards bodies and national electrotechnical committees of Austria, Belgium,
Bulgaria, Croatia, Cyprus, Czech Republic, Denmark, Estonia, Finland, France, Germany, Greece, Hungary, Iceland, Ireland, Italy,
Latvia, Lithuania, Luxembourg, Malta, Netherlands, Norway, Poland, Portugal, Republic of North Macedonia, Romania, Serbia,
Slovakia, Slovenia, Spain, Sweden, Switzerland, Turkey and United Kingdom.

CEN-CENELEC Management Centre:
Rue de la Science 23, B-1040 Brussels
© 2020 CEN/CENELEC All rights of exploitation in any form and by any means Ref. No. CEN/TR 17447:2020 E
reserved worldwide for CEN national Members and for
CENELEC Members.
Contents Page
European foreword . 3
1 Scope . 4
2 Normative references . 4
3 Terms and definitions . 4
4 List of acronyms . 4
5 Approach . 5
6 Input data . 7
7 Model . 10
7.1 Introduction . 10
7.2 Discussion about models . 11
7.3 AR model . 16
8 Analysis tools . 29
8.1 General . 29
8.2 State of the art on data set similarity assessment . 30
8.3 Validation tool description . 34
9 Conclusions . 48
Annex A (normative) Neural network . 50
A.1 Neuron network concept . 50
Bibliography . 59

European foreword
This document (CEN/TR 17447:2020) has been prepared by Technical Committee CEN-CENELEC/TC 5
“Space”, the secretariat of which is held by DIN.
Attention is drawn to the possibility that some of the elements of this document may be the subject of
patent rights. CEN shall not be held responsible for identifying any or all such patent rights.
1 Scope
This document is written in the frame of WP1.3 of GP-START project. It discusses several models to
provide synthetic data for PVT tracks and the ways to analyse and compare the tracks to ensure these are
similar to the reality.
2 Normative references
The following documents are referred to in the text in such a way that some or all of their content
constitutes requirements of this document. For dated references, only the edition cited applies. For
undated references, the latest edition of the referenced document (including any amendments) applies.
EN 16803-1:2016, Space — Use of GNSS-based positioning for road Intelligent Transport Systems (ITS) —
Part 1: Definitions and system engineering procedures for the establishment and assessment of
performances
3 Terms and definitions
For the purposes of this document, the terms and definitions given in EN 16803-1:2016 apply.
ISO and IEC maintain terminological databases for use in standardization at the following addresses:
• ISO Online browsing platform: available at http://www.iso.org/obp
• IEC Electropedia: available at http://www.electropedia.org/
4 List of acronyms
ANOVA Analysis of variance
AR Autoregressive
ARMA Autoregressive moving average
CDF Cumulated distribution function
CET Central european time
DFT Direct Fourier transform
DOP Dilution of precision
FFT Fast Fourier transform
GNSS Global navigation satellite system
GPS Global positioning system
HDOP Horizontal dilution of precision
HPE Horizontal position error
IGS International GNSS service
ITS Intelligent transport systems
KS Kolmogorov–Smirnov
MFNN Multilayer feedforward neural networks
NED Northeast down
NN Neural network
OBU On board unit
PVT Position velocity time
SBAS Satellite based augmentation system
SVM Support vector machine
UTC Universal time coordiante
VDOP Vertical dilution of precision
WGS84 World geodetic system 1984
WP Work package
5 Approach
The objective of the WP1.3 model the PVT tracks similar enough to the reality to be used for application
validation. The approach used is to twofold: to find a model and to build the tools to compare the
simulated data to the real data.
The following figure shows the overall logic of the work to be undertaken in WP 1.3:

Figure 1 — WP1.3 logic
The blue boxes represent the raw data that can be stored in files. The red boxes are the computations that
need to be carried out to get new data. Finally, the green boxes correspond to the outputs of this work
package that need to be documented.
In some more detail:
— Reference data set: These are the reference trajectories computed with extra sensors and is expected
to have errors at centimetric level or better.
— Real data sets: These are the trajectories computed by the receiver under test.
— Error calculation: The error calculation computes the difference between the position computed by
the receiver and the one from the reference trajectory. Note that this step will be used to define the
parameters to be modelled. For instance in the Ifsttar paper [RD2], the model was based on the
absolute error and its angle. But one could use the position in XYZ (WGS84 coordinates). Since the
behaviour in the vertical axis is usually different, one could use a NED reference system or change
the reference to an axis in the direction of the movement another across the direction of the
movement and the third would be the vertical.
— Real error files: These files should contain the errors for each receiver trajectory.
— Error models: The error models will be described in the documentation and should depict the error
of the position depending on the previous errors, random variables (which can follow any
distribution) and parameters of the distributions.
— it algorithm: The fit algorithm should calculate the parameters that define the models.
— Model parameters: Storage of the models parameters.
— Synthetic error generation: Once the parameters are found, by starting the random variables with
different seed, one will get synthetic errors that should be equivalent to the real ones.
— Error analysis: As can be on the figure, the error analysis box appears twice: for the real errors and
for the synthetic ones. These analysis will be compared to validate the models. So far, the analysis
used have been the CDF (cumulative distribution function) and autocorrelation. As far as the CDF will
be representative for the error distribution in the position domain, in the time domain,
autocorrelation may not be enough to observe the behaviour of the error. It is suggested to compute
the DFT (discrete Fourier transform) to analyse the error in the frequency domain and observe if
there are errors at particular frequencies. Autocorrelation analysis will be kept for comparison with
previous models.
— Comparison: the Error analysis of real and synthetic data are compared to validate the models.
Synthetic data generator: this step adds the synthetic errors with the reference trajectory to create a new
data set that can be used for the sensitivity analysis.
6 Input data
The input data to observe and models has been provided by Q-Free. It contains a header and then the
data with each column separated by a space. The header contains the list of fields
# SAVE Position Export
#
# Equipment:  1, seagull_datalogger_1
# Export Date: Tue Dec 13 13:58:20 CET 2016
#
# Space separated field list
#
# Equipment id (6 is reference)
# Drive id
# Leg id
# Date yyyy-mm-dd (UTC)
# Time hh:mm:ss.fff (UTC)
# Gps week
# Gps time of week, seconds
# OBU Timestamp, seconds
# GNSS Good flag (1 is good, 0 is bad)
# GNSS Fix Type
# Latitude
# Longitude
# Ellipsode height (meters)
# Geoid height, height over sea level (meters)
# Speed over ground (m/s)
# Course over ground (deg)
# Yaw (deg/second)
# Pitch (deg/second)
# Roll (deg/second)
# Horizontal pos error (meter)
# Vertical pos error (meter)
# sN, st.dev of position, in north direction (meters)
# sE, st.dev of position, in east direction (meters)
# sD, st.dev of position, in down direction (meters)
# HDOP
# PDOP
# NED n speed (m/s)
# NED e speed (m/s)
# NED d speed (m/s)
# Number of satellites used in solution
# Average signal quality
# SBAS Active
# rN integrity, north (m)
# rE integrity, east (m)
# rD integrity, down (m)
#
To do the analyses and modelling of the PVT error the outputs computed by the receiver in a specific
proprietary way are not used. For instance, the horizontal and vertical position errors are calculated by
the receiver but one does not know how. Therefore, all the positions in latitude, longitude and height are
compared to the receiver #6 which is the reference. The other fields used are:
— Equipment Id and leg to separate the data sets;
— GNSS good flag and type fix to use only the positions with 4 (four) or more satellites that do not
use additional sensors (i.e.: type of fix 4 or 6).
For more details, the GNSS fix types are:
— NO_FIX = 0, → red
— ONE_SV = 1, → purple
— TWO_SV = 2, → blue
— THREE_SV = 3, → orange
— FOUR_PLUS_SV = 4, → green
— LS_2D = 5, → cyan
— LS_3D = 6, → dark green
— DEAD_RECKONING = 7, → yellow
— PSR_DIFF = 8, → brown
— SENSOR_FUSION = 9 → dark purple
The following figure shows the behaviour one can expect of a receiver (Reference is black) with a tunnel
in the middle of the path:
Figure 2 — Oslo input data plotted
Some other issues have been observed in the input data. The following figure shows the histograms of
the error a data set from Frankfurt. The 6 (six) plots show the same data with different histogram bin
widths:
Figure 3 — Frankfurt errors histograms
As it will be seen in the following clauses, the chosen model tends to a Gaussian distribution for data sets
big enough. However, the errors observed in Frankfurt data set, are not likely to follow a normal
distribution. When one observes the errors in the time domain, one can see that the error distribution
changes during the test. This is, the test starts in small error environment (an almost open-sky) and ends
in a great error environment (urban canyon).

Figure 4 — Frankfurt errors vs. time
7 Model
7.1 Introduction
Clause 7 presents the model chosen to create synthetic data representative of real PVT environments.
The first subclause discusses the previous model made by Ifsttar and Q-Free and the ARMA models
proposed in this document.
The second subclause presents the AR model chosen to fit the real data and the way its parameters are
calculated.
7.2 Discussion about models
7.2.1 Laplace-steps and Cauchy-random walk
A previous model discussed is the one by Ifsttar and Q-Free that have already worked on the data sets to
be studied (See RD17). The following analysis intends to analyse which are the error characteristics to be
simulated and the limits of the model.
The model is twofold: on the one hand the absolute value of the horizontal error is calculated and on the
other the angle between the reference heading and the error direction. The radius (absolute error) is
calculated as follows:
 
R n with probability 1−p
( )
 
  
Rn 1 { + C n
r
  
L with probability p
r
p is the probability of a Bernoulli trial that in case of failure keeps the radius and in case of success will
draw a new radius from a Laplace distribution (positive values only) plus a Cauchy noise that accumulates
every time a step takes place.
NOTE On the radius calculation:
— Depending on the probability of the Bernoulli trial one can observe arcs of circle which is not a GNSS
positioning known behaviour.
— The probability of the Bernoulli trial allows mastering the correlation of the process since one can
manage the proportion of strongly correlated samples and those which are not.
— The Laplace distribution allows managing the mean and variance of the error.
— When one adds a cumulative error, the variances add up: This means that the cumulative error will
tend to infinity with time. On top of that the Cauchy distribution has undefined variance. So in the model a
minimum and maximum values need to be identified to avoid divergence.
Finally the error is filtered by an autoregressive filter.
The angle is calculated with a random walk:
   
∅+n 1 = ∅ nS+ n
   ∅
The noise chosen is a Cauchy distribution truncated with a min and a max values similarly to the Cauchy
distribution for the radius.
7.2.2 ARMA models
7.2.2.1 Introduction
The previous subclause has already discussed about an autoregressive filter. The idea of this subclause is
to present the characteristics of these models without the complex inputs (Laplace, Cauchy, random
walk…) of the previous subclause.
To do so, the input noise of the processes shown below will be a normal noise and the differences in the
outputs will entirely be due to the ARMA filter.
+=
7.2.2.2 Definition
ARMA stands for autoregressive moving average. The autoregressive part means that it depends on the
previous values of the process and a weighting of the previous inputs.
An ARMA process can be defined as follows:
X a X +… + b N + bN + bN +…
nn1 −1 0 n 11n− 22n−
where
X can be the position in 1 (one) axis or the radius of the error;
N is a noise;
a and b are real numbers;
n is the index of the series.
One can use the Z transform to obtain:
Bz
( )
X= N
Az
( )
where
A and B are the polynomials with a and b coefficients;
The zeroes of B are the points at which there is no frequency response;
The zeroes of A are the points at which there are peaks in the response.
As a brief reminder, the roots of A need to be in the unit circle for the system to be stable. The closest to
the unit circle the bigger the impact of the root on the output.
7.2.2.3 Examples
7.2.2.3.1 General
Subclause 7.2.2.3 shows a few examples of the outputs that can be obtained with an ARMA process.
7.2.2.3.2 Gaussian noise
A Gaussian noise is hardly an ARMA process since it does not depend on the past information, but it is
important to see the response of a system when there is no filtering yet.
Its formula is the following:
X = bN
nn0
The outputs shown are the direct Fourier transform (DFT) and the time series. These two plots should
illustrate the outputs of the model.
=
Figure 5 — DFT of a white noise
It is important to see that there is no specific resonance point: all the frequencies react just the same. Note
that the mean is not centred at zero. The frequencies observed go up to 0,5 Hz since the GNSS positions
are updated every second.
In the time domain, one observes the normal distribution behaviour.

Figure 6 — White noise against time
7.2.2.3.3 AR process
For an AR process, there is no mean of the previous results. The solution depends only on the current
input and some previous results. This first example is the first order AR, so the output depends only on
the previous term.
X a X+ bN
nn1 −1 0 n1
Figure 7 — DFT of an AR process
One can observe that there is a peak close to the zero. This means that there is a root of the A polynomial
close to the unit circle. Indeed the root is (z-a) with 0 < a < 1. When a is close to 1, X is close to X
n n−1
with an added noise. This means that the correlation is strong between close samples. In the example
provided a is 0,8 and b is 1.
1 0
In the time domain, it can be observed that the samples change in some correlated way:
=
Figure 8 — AR process against time
7.2.2.3.4 Peak example
The peak example is meant to show the behaviour of the DFT when the A polynomial contains complex
roots. To do so, one needs at least a second order AR process.
X= aX++aX b N
nn11−−22n 0 n
Figure 9 — DFT of AR process with 2 (two) poles
Figure 10 — AR process with 2 (two) poles against time
In the example provided, a is −1,29, a is 0,64 and b is 1.
1 2 0
7.3 AR model
7.3.1 Model choice
The model chosen is an autoregressive one because it reduces the number of parameters needed for the
model and it can fit the data set errors calculated from a DFT point of view. The following figures show
the DFT transforms observed in different environments:
Figure 11 — Frankfurt across error DFT

Figure 12 — Oslo across error DFT
Figure 13 — Trondheim across error DFT

Figure 14 — Troms across error DFT
The previous figures show a behaviour similar to the AR model with a peak close to the zero that goes
down asymptotically to a certain value.
7.3.2 Model parameters
7.3.2.1 Regression parameters
The AR equation is the following:

X a X+ bN
nn1 −1 0 n
The previous expression shows a model based on 2 (two) parts:
— aX : which represents the correlation (it makes one error depend on the previous one);
11n−
— bN : a random noise.
0 n
In order to figure the shape of the DFT one needs to find a . The equation for all N is:
   
   
XX N
2 1 1
   
X a X+ bN
   
3 12 0 2
 
   
  XN 
X
N NN−−11
   
One can remove the error term and solve the over-determined system with the mean-squares estimator:
−1
 
 
 
 
XX
 
12
a=XX……X X XX X X
( )  ( )
1 1 2 NN−−1 2 12 1 3
 
  

  
X X
N−1 N
 
 
N−1
xx
ii+1

i=1
a =
N−1
x
i

i=1
Check that a < 1 .
7.3.2.2 Noise parameter
7.3.2.2.1 Gaussian constant Noise parameter
Once the a has been figured out, one can calculate b . To do so, one needs to apply some basic
1 0
knowledge about GNSS positioning: the error will not go to infinity and it will not converge to a certain
number. From a statistics point of view, this means that the variance is constant.
Starting with the AR equation:
X a X+ bN
nn1 −1 0 n
         
var X var a X+ b N var a X+ var b N a var X+ b var N
 nn 11−−0 nn 11  0 n 1  n−1 0  n
The variance of a normal standard distribution is 1. If we want the variance to remain constant:
   
var X var X var X
nn−1
   
==
= = =
=
=
=
 being the variance of the process regardless of the time.
var X

 
var X a var X+ b
 
b

var X =

1− a
In our case the calculation is made the other way round: we have the process variance that is calculated
from the real data, and one needs to calculate the model. Therefore:

b var X 1− a
( )
01
The noise parameter has been computed in 3 (three) cases:
— all the data;
— 99 % of the data (excluding the 1 % of biggest errors);
— 95 % of the data (excluding the 5 % of biggest errors).
The results provided by validation tool (see Clause 8 Analysis tools) are the following:
— all the data: 0 synthetic data sets are considered not similar to the real ones;
— 99 % of the data: 56,9 % not similar;
— 95 % of the data: 42,3 % not similar.
Tests have been performed with Oslo data sets which are representative of the urban environment.
7.3.2.2.2 Gaussian sliding noise parameter
From the previous subclause, it is understood that the variance remains constant through all the test.
However, it is observed that this is not the case in real data. Therefore, the model has been modified to
take into account the changing variance of the GNSS error.
n−1
ˆ

var X X− X
( )
n ∑ i
in−50
Indeed, the variance at each moment is calculated taking into account the last 50 values of the error.
Of course, b needs to be calculated each second of the data set and the formula remains almost the same:

b var X 1− a
( )
01nn
One of the main issues regarding the conception of a model is that one has to define the reality to model.
It is usually characterized by a type of receiver configuration and an environment. As far as the receiver
is easily identified, the environment is not. One of the parameters that describes this different
environments is the variance observed in each of those. Having a moving variance allows accounting for
the changes of environment end be representative of all the environments of the data set. Note that
although the scenarios have categories such as “urban”, it is impossible to guarantee that the buildings
around will always have the same height or be made out of the same material.
=
=
=
=
=
With this sliding noise, the results provided by the analysis tool (see Clause 8 Analysis tools) are that only
7,3 % of the data sets generated are not similar to real ones they should emulate.
The problem of this model is that one needs data sets to build the moving variance instead of having a
defined distribution of a set of parameters. This makes the model very deterministic, since one will
always have the peaks at the same times. As this model is not kept, no further tests have been done with
variances taking into account more or less than 50 error values.
7.3.2.2.3 Experimental distribution
As mentioned previously, the model equation is:
X a X+ bN
nn1 −1 0 n
Then
N−1
xx
∑ ii+1
i=1
a =
N−1
x
i

i=1
which is not time dependant. So, if we model properly,
We then model bN X− a X bN
0 nn 11n− 0 n
we should have a distribution that resembles the reality at the end. The first approximation is to get the
experimental distribution for the ahead and across axes.
The tool validation results says that only 4,9 % of the synthetic data are not similar to the real data that
it has to emulate.
Tests have been performed with Oslo data sets which are representative of the urban environment.
This is a good result, but it has one issue, when one plots the ahead errors against the across errors.

Figure 15 — Horizontal error
As can be seen on the figure, the errors are as big as expected, but they are always on the axes. This is
because the big errors are very rare and it is very improbable to have big error on both axes at the
same time.
=
=
7.3.2.2.4 Experimental distribution with polar reference
7.3.2.2.4.1 Correlation factor for horizontal error
The first approximation is to compute the correlation of the horizontal error. Then one computes the
vector of the non-correlated error. This is, the difference between the horizontal error and the part of the
error due to the previous one. Finally, the cumulated distribution of the norm of non-correlated errors is
computed.
Figure 16 — Horizontal error correlation
To build the synthetic data sets, the opposite operation needs to be computed.
The norm of the error is retrieved randomly from the cumulated distribution and an angle at random
value between zero and 2 π. The vector with that norm and angle is added to the previous error times the
correlation factor (a1 of the AR model) to get the new error vector.
Figure 17 — Error generation with horizontal error correlation
This model provides 65,04 % of non-similar synthetic data sets when compared to the real data sets that
should be similar.
Tests have been performed with Oslo data sets which are representative of the urban environment.
7.3.2.2.4.2 Correlation accounted differently for both axes
The reasoning is similar to the previous one, but in this case the errors are built considering that the
correlation is different for each axis:
Figure 18 — Axis based correlation
Then the error is built accounting for the different correlations as well:

Figure 19 — Error generation with axis correlation
The number of non-similar synthetic data sets is 2,4 % compared to the real ones.
Tests have been performed with Oslo data sets which are representative of the urban environment.
7.3.2.2.4.3 Correlation between radius and delta angle
It has been observed that the delta angle follows some distribution. The delta angle is the angle between
the previous uncorrelated error vector and the current one:
Figure 20 — Error generation with axis correlation and experimental angle distribution
However, this angle does not only follow a distribution, but is also correlated to the radius of the
uncorrelated vector.
Figure 21 — Error size and angle correlation
In the previous figure, the x-axis from 0 to 50 is the radius of the error, the y-axis from −4 to 4 is the delta
angle in radians. On the Z axis the number of events is plotted.
The number of non-similar synthetic data sets is 8,13 % compared to the real ones.
7.3.2.2.4.4 Comparison with all data sets
Since the last 2 (two) models with experimental distribution have proven to be good, further analysis
have been made with the other data sets. The following table shows the number of non-similar data sets
compared to the real data:
Table 1 — Number of non-similar data sets compared to real data
Correlation by
axis and
Horizontal Correlation between radius
correlation by axes and delta angle
Oslo 65,04 % 2,44 % 8,13 %
Frankfurt 80,42 % 16,93 % 48,15 %
Trondheim 47,61 % 0,00 % 0,00 %
In order to better analyse these results, metrics used by the validation tool for specific data set pairs
are plotted:
— Figure 22 shows, for an Oslo data set, pairs classified as “not similar” the autocorrelation functions,
the CDFs and time series respectively for “horizontal correlation” (left column), “correlation by axes”
(centred column) and “correlation by axis and between radius and delta angle” (right column).
— Figure 23 shows, for an Oslo data set, pairs classified as “similar” the autocorrelation functions,
the CDFs and time series respectively for “horizontal correlation” (left column), “correlation by axes”
(centred column) and “correlation by axis and between radius and delta angle” (right column).
On the example of two data sets classified as “not similar” (Figure 22), we can visually observe that the
autocorrelation functions of synthetic data sets do not fit very well the one of the real data set, whatever
the synthetic model used. The same analysis can be made on the CDFs. The percentiles of the synthetic
data set CDFs are significantly different from the real data set. For example, the difference for the
th
99 percentile is between 20 m and 30 m, depending on the considered synthetic model. Morevoer, time
series of real data set has more big errors than time series of synthetic data sets.
At the opposite, on the example of two data sets classified as “similar” (Figure 23), we can visually observe
that the autocorrelation functions of synthetic data sets fit very well the one of the real data set, whatever
the synthetic model used. The CDFs of the “correlation by axes” and “correlation by axis and between
radius and delta angle” model based data sets are very similar to the CDF of the real data set. The CDF of
the “horizontal correlation” model based data sets fit less better the CDF of the real data set. Visually, time
series seems to be similar.
Figure 22 — Metrics and time series of data sets classified as “not similar” for “horizontal correlation” (left column), “correlation by
axes” (centred column) and “correlation by axis and between radius and delta angle” (right column)
Figure 23 — Metrics and time series of data sets classified as “similar” for “horizontal correlation” (left column), “correlation by axes”
(centred column) and “correlation by axis and between radius and delta angle” right column)
7.3.2.3 Mean parameter
The errors observed during the field tests show that errors are biased. The fact that the error mean is not
zero can be explained by several factors:
— The environment: the surrounding buildings that generate multipath do not need to be
symmetrically distributed
— Since the signal travels at the speed of light but cannot travel faster, most errors are delays of the
signal, i.e.: always in the same direction. The distance between satellite and receiver tends to be
overestimated.
— GNSS constellations are not uniformly distributed in the sky. The following picture shows the
satellites trajectories in a sky-plot for IGS station ONS1 (at about 57° of latitude). There are parts of
the sky where it is impossible to see a satellite.

Figure 24 — Satellite trajectories in a sky plot for IGS station ONS1 (at about 57° of latitude)
Once we have the AR equation, one needs to add the mean to every X but to keep the rest as it was, the
n
mean needs to be subtracted before calculating the next point:
X=µµ+ a X −+ bN
( )
nn11− 0 n
The mean is measured form the input data and no extra calculation is needed.
8 Analysis tools
8.1 General
The overall logic is presented in Figure 25. Clause 8 deals with the validation tool which is represented
on this figure by the blue-dashed rectangle.
Figure 25 — Overall logic
8.2 State of the art on data set similarity assessment
In the literature, data similarity assessment is an active research topic especially for data mining, audio
signal or object recognition or image classification. However, similarity between two data sets has not a
clear definition. The similarity should be assessed according to a particular feature of the data set. Hence,
[RD4] precises that:
— the similarity is related to their commonality in that the more commonality they share, the more
similar they are;
— the similarity is related to the differences between them, in that the more differences they have, the
less similar they are;
— the maximum similarity is reached when X and Y are identical.
Similarity is generally measured in the range 0 to 1 but can be in the range from −1 to 1. In any case, “1”
means good similarity between both data sets considering a particular feature. Generally, we can
compute a similarity measure to have a range between [0,1] from any distance measure with the
following rule:
s xy, =
( )
1+ d xy,
( )
is the distance measure between data sets and .
Where d xy, x y
( )
Depending on the feature similarity to be considered, several techniques and metrics exist.
When considering the time domain in which a specific pattern is searched, cross-correlation between two
data sets is generally performed [RD3]. Indeed, this metric refers to the time domain and point out only
the pattern similarity in time domain. This is generally used in speech recognition or signal alignment. In
a similar way, the so-called “dynamic time warping” algorithm [RD7] is classically used to measure the
similarity between two time series which may vary in speed but have some common patterns. It is used
in speech recognition to handle the different speaking speeds for example.
When considering statistical behaviour, the probability density functions or cumulative distribution
function are classically analysed.
To compare data sets according their probability density functions, several distance or similarity
measures can be defined [RD1] and [RD2]. [RD1] provides a survey of these metrics such as the Euclidean
L2 distance, the City block L1 distance, the Pearson chi2 distance or the Kullback-Leibler distance. These
metrics allow to compare the normalized histograms of different data sets. [RD6] categorizes these
distances according to the so-called f-divergence, α-divergence, or Bregman divergences. To be used as a
metric, distance measure should meet the following properties [RD3]:
— the non-negativity: d xy, ≥ 0
( )
— the identity: if:
d xy, = 0 xy=
( )
— the symmetry:
d x,,y = d yx
( ) ( )
— the triangle inequality: dx,,z ≤ dx y + d y, z
( ) ( ) ( )
[RD3] details some similarity measures used in data mining which can be applied to PDF comparison.
These similarity measures rely on the following function:
xy−
numSim x, y 1−
( )
xy+
Which is used to compute:
— the mean similarity:
n
tsim x, y = numSim x , y
( ) ( )
∑ ii
n
i=1
— the root mean square similarity:
n
rtsim x,,y = numSim x y
( ) ( )
ii

n
i=1
— the peak similarity:
=
n  
x − y
ii
 
psim x, y 1−
( )

 
n
2max x , y
( )
i=1 ii
 
 
In [RD9], a distance between the CDF is defined as follows:
x y
CC−
perc i perc i
() ()
d =
cdf ∑

x y
i=1
05. C + C

perc i perc i
() ()

x y
With C and C the value of CDF at a specific percentile of data set x and y, respectively. The
perc i perc i
() ()
th th th
chosen percentiles are 50 , 75 and 95 .
This distance does not measure the similarity on the whole CDF. Indeed, by choosing 50th as the
minimum percentile, the distribution at low value is not screened.
The Pearson correlation score [RD8] which can be applied to PDF comparison is computed as:
∑∑x y
∑−x y
ii
n
Pearson x, y =
( )
2 2
  
∑∑xy
( ) ( )
  
i i
2 2
∑−xy∑ −
  
ii
nn
  
  
The Pearson correlation score is invariant to scaling and to adding any constant. That means the Pearson
correlation score measure the similarity in relative and not in absolute.
The Kolmogorov–Smirnov test (KS test) is a non-parametric test which allows to compare if the
underlying probability distribution of two data sets are equal at a certain level. The KS test deals with
unknown probability distribution. It provides a distance between the CDF of two data sets.
The so-called method ANOVA (for analysis of variance) is a statistical test to assess if the means of
different random variables are equal [RD10]. ANOVA considers that a random variable denoted y can be
modelled as follows:
y µ+ f i,,jk,…+
( )
with
µ
a constant;
i,,jk are the so-called descriptive variable;
f () is an unknown function;

is the measurement error which is assumed to be distributed as a Gaussian with null
mean.
As ANOVA relies on the Fisher test, it assumes that all the descriptive variables are Gaussian distributed
and all the observations are independent.
=
=
When considering the time dependency of a time series, autocorrelation or frequency spectral analysis is
usually performed. [RD5] analyses the difference between the autocorrelation analysis and the frequency
spectrum analysis through the coherence estimate. The magnitude squared coherence estimate is defined
as:
Pf
( )
xy
Cf =
( )
xy
P f P f
( ) ( )
xx yy
With the power spectral densities ( Pf and Pf ) of x and y and the cross power spectral density
( ) ( )
xx yy
( Pf of x and y ). When the magnitude squared coherence is close to 1, the frequency spectral are
( )
xy
similar. At the opposite, when it is close to 0, they are quite different.
The autocorrelation for not complex signal is defined as:
 
R ττE xt .xt−
( ) ( ) ( )
x
 
In [RD5], it points out on the one hand that the autocorrelation analysis is more sensitive to differences
between signals in time. On the other hand, coherence estimate is more reliable for these differences.
Furthermore, autocorrelation degrades quickly when the noise increases contrary to the coherence
estimate.
To conclude, assessing the similarity of several data sets requires three steps:
1) Define the features of the data sets which are relevant for the application.
2) Select one or several metrics for each selected features.
3) Take a decision on the similarity of the data sets based on the (possibly conflictual) metrics value.
A data set can have a variety of features and several metrics can be used to assess the similarity according
to a specific feature. Here is a list of the main features classically assessed and some associated metrics:
— Spectral signature which can be assessed by using the coherence function.
— Time dependency behaviour which can be assessed by comparing the autocorrelation function on
chosen points.
— Statistical distribution which can be assessed by using non-parametric test, histogram-based
distance or CDF-based distance.
Others features can describe a data set or signal. For example:
— Deterministic/random characteristic.
— Periodic/aperiodic characteristic.
=
8.3 Validation tool description
8.3.1 GP-START project context
In the GP-START project, the main goal of generating synthetic data are to provide a huge number of data
sets for ITS application. These synthetic data sets allow the designer to assess the ITS application
performance especially in terms of false alarm and missed detection. These cases occur when the
estimated position is distant from the true position at a certain level. Thus it seems important to check
the tail of the positioning error distribution. Ideally, the tail can be defined from the value which induce
false alarm or missed detection in the ITS application.
Which is the most important for ITS application is the statistical behaviour of the data sets (especially in
high value – distribution tail) and, depending on how the application is designed, the time dependency of
the positioning error. Thus, deterministic component present in only one of both data sets should raise a
non-similarity warning only if it affects significantly the statistical distribution and the time dependency.
For the GP-START positioning problem, we take the following hypotheses:
— The data sets are not expected to have common relevant patterns in the time domain as long as the
same environment is concerned . This makes the cross correlation or dynamic time warping
methods irrelevant.
— The typical low frequency PVT log implies that only the very-low frequency can be analysed through
frequency analysis (due to Shannon theorem when using FFT). Moreover, autocorrelation function
comparison is easier. Thus, autocorrelation analysis is preferred.
— The high value positioning errors (and its rate in a specific data set) are expected to be the main
component affecting the ITS application perfo
...

Questions, Comments and Discussion

Ask us and Technical Secretary will try to provide an answer. You can facilitate discussion about the standard in here.

Loading comments...