ML theory
  • The statistical perspective of machine learning
  • Variational Inference
  • Kalman Filter
Powered by GitBook
On this page
  • Model of Kalman Filter
  • Model Training
  • Reference

Was this helpful?

Kalman Filter

Model of Kalman Filter

We assume the current state can be modeled as a Gaussian distribution

P(zt∣zt−1)∼N(Azt−1,Q)P(z_t|z_{t-1}) \sim \mathcal{N}(Az_{t-1},Q)P(zt​∣zt−1​)∼N(Azt−1​,Q)

We assume neural observations can also be modeled a Gaussian distribution

P(xt∣zt)∼N(Czt,R)P(x_t|z_{t}) \sim \mathcal{N}(Cz_{t},R)P(xt​∣zt​)∼N(Czt​,R)

We also assume abase case

P(z1)∼N(Π,V)P(z_{1}) \sim \mathcal{N}(\Pi,V)P(z1​)∼N(Π,V)

Thus the model parameters are:

Θ={A,Q,Π,V,C,R}\Theta=\{A,Q,\Pi,V,C,R\}Θ={A,Q,Π,V,C,R}

Model Training

We aim to maximize the joint likelihood of the state and observed date

D={{x}n,{z}n}n=1N={{x1n,…,xTn},{z1n,…,zTn}}n=1N\mathcal{D}=\{\{x\}^n, \{z\}^n\}_{n=1}^N=\{\{x_1^n, \dots, x_T^n\}, \{z_1^n, \dots, z_T^n\}\}_{n=1}^ND={{x}n,{z}n}n=1N​={{x1n​,…,xTn​},{z1n​,…,zTn​}}n=1N​
Θ∗=arg⁡max⁡ΘP({x}n,{z}n∣Θ)=arg⁡max⁡Θ∏n=1NP(z1n)(∏t=2TP(ztn∣zt−1n))(∏t=1TP(xtn∣ztn))=arg⁡max⁡Θ∑n=1Nlog⁡P(z1n)+∑t=2Tlog⁡P(ztn∣zt−1n)+∏t=1Tlog⁡P(xtn∣ztn)=arg⁡max⁡Θ∑n=1N−12log⁡∣V∣−12(z1n−Π)⊤V−1(z1n−Π)+∑t=2T(−12log⁡∣Q∣−12(ztn−Azt−1n)⊤Q−1(ztn−Azt−1n))+∑t=1T(−12log⁡∣R∣−12(xtn−Cztn)⊤R−1(xtn−Cztn))=arg⁡min⁡Θ∑n=1Nlog⁡∣V∣+(z1n−Π)⊤V−1(z1n−Π)+∑t=2T(log⁡∣Q∣+(ztn−Azt−1n)⊤Q−1(ztn−Azt−1n))+∑t=1T(log⁡∣R∣+12(xtn−Cztn)⊤R−1(xtn−Cztn))\Theta^* = \arg\max_{\Theta} P(\{x\}^n, \{z\}^n|\Theta)\\ = \arg\max_{\Theta} \prod_{n=1}^N P(z^n_1)\left( \prod_{t=2}^T P(z^n_t|z^n_{t-1}) \right)\left( \prod_{t=1}^T P(x^n_t|z^n_{t}) \right)\\ = \arg\max_{\Theta}\sum_{n=1}^N \log P(z^n_1) + \sum_{t=2}^T \log P(z^n_t|z^n_{t-1}) + \prod_{t=1}^T \log P(x^n_t|z^n_{t}) \\ = \arg\max_{\Theta} \sum_{n=1}^N -\frac{1}{2} \log|V|-\frac{1}{2}(z^n_1-\Pi)^{\top}V^{-1}(z^n_1-\Pi) + \sum_{t=2}^T \left(-\frac{1}{2} \log|Q|-\frac{1}{2}(z^n_{t}-Az^n_{t-1})^{\top}Q^{-1}(z^n_{t}-Az^n_{t-1})\right) + \sum_{t=1}^T \left(-\frac{1}{2} \log|R|-\frac{1}{2}(x^n_{t}-Cz^n_{t})^{\top}R^{-1}(x^n_{t}-Cz^n_{t})\right) \\ = \arg\min_{\Theta} \sum_{n=1}^N \log|V|+(z^n_1-\Pi)^{\top}V^{-1}(z^n_1-\Pi) + \sum_{t=2}^T \left( \log|Q|+(z^n_{t}-Az^n_{t-1})^{\top}Q^{-1}(z^n_{t}-Az^n_{t-1})\right) + \sum_{t=1}^T \left(\log|R|+\frac{1}{2}(x^n_{t}-Cz^n_{t})^{\top}R^{-1}(x^n_{t}-Cz^n_{t})\right)Θ∗=argΘmax​P({x}n,{z}n∣Θ)=argΘmax​n=1∏N​P(z1n​)(t=2∏T​P(ztn​∣zt−1n​))(t=1∏T​P(xtn​∣ztn​))=argΘmax​n=1∑N​logP(z1n​)+t=2∑T​logP(ztn​∣zt−1n​)+t=1∏T​logP(xtn​∣ztn​)=argΘmax​n=1∑N​−21​log∣V∣−21​(z1n​−Π)⊤V−1(z1n​−Π)+t=2∑T​(−21​log∣Q∣−21​(ztn​−Azt−1n​)⊤Q−1(ztn​−Azt−1n​))+t=1∑T​(−21​log∣R∣−21​(xtn​−Cztn​)⊤R−1(xtn​−Cztn​))=argΘmin​n=1∑N​log∣V∣+(z1n​−Π)⊤V−1(z1n​−Π)+t=2∑T​(log∣Q∣+(ztn​−Azt−1n​)⊤Q−1(ztn​−Azt−1n​))+t=1∑T​(log∣R∣+21​(xtn​−Cztn​)⊤R−1(xtn​−Cztn​))

Suppose

L=∑n=1Nlog⁡∣V∣+(z1n−Π)⊤V−1(z1n−Π)+∑t=2T(log⁡∣Q∣+(ztn−Azt−1n)⊤Q−1(ztn−Azt−1n))+∑t=1T(log⁡∣R∣+12(xtn−Cztn)⊤R−1(xtn−Cztn))\mathcal{L}=\sum_{n=1}^N \log|V|+(z^n_1-\Pi)^{\top}V^{-1}(z^n_1-\Pi) + \sum_{t=2}^T \left( \log|Q|+(z^n_{t}-Az^n_{t-1})^{\top}Q^{-1}(z^n_{t}-Az^n_{t-1})\right) + \sum_{t=1}^T \left(\log|R|+\frac{1}{2}(x^n_{t}-Cz^n_{t})^{\top}R^{-1}(x^n_{t}-Cz^n_{t})\right)L=n=1∑N​log∣V∣+(z1n​−Π)⊤V−1(z1n​−Π)+t=2∑T​(log∣Q∣+(ztn​−Azt−1n​)⊤Q−1(ztn​−Azt−1n​))+t=1∑T​(log∣R∣+21​(xtn​−Cztn​)⊤R−1(xtn​−Cztn​))

the minimize is achieved when the derivative vanishes

∇ΠL=∑n=1N−2(z1n−Π)⊤V−1=0→Π∗=1N∑n=1Nz1n∇VL=∑n=1N(V−1)⊤+(z1n−Π)(z1n−Π)⊤=0→V∗=1N∑n=1N(z1n−Π∗)(z1n−Π∗)⊤∇AL=∑n=1N∑t=2T−2Q−1(ztn−Azt−1n)(zt−1n)⊤=0→A∗=(∑n=1N∑t=2Tztn(zt−1n)⊤)(∑n=1N∑t=2Tzt−1n(zt−1n)⊤)−1∇QL=∑n=1N∑t=2T(Q−1)⊤+(ztn−Azt−1n)(ztn−Azt−1n)⊤=0→Q∗=1N(T−1)∑n=1N∑t=2T(ztn−A∗zt−1n)(ztn−A∗zt−1n)⊤∇CL=∑n=1N∑t=1T−2R−1(xtn−Cztn)(ztn)⊤=0→C∗=(∑n=1N∑t=1Txtn(ztn)⊤)(∑n=1N∑t=1Tztn(ztn)⊤)−1∇RL=∑n=1N∑t=1T(R−1)⊤+(xtn−Cztn)(xtn−Cztn)⊤=0→R∗=1NT∑n=1N∑t=1T(xtn−C∗ztn)(xtn−C∗ztn)⊤\nabla_{\Pi} \mathcal{L} = \sum_{n=1}^N -2 (z^n_1-\Pi)^{\top}V^{-1} = 0 \to \Pi^* = \frac{1}{N} \sum_{n=1}^N z^n_1\\ \nabla_{V} \mathcal{L} = \sum_{n=1}^N (V^{-1})^{\top} + (z^n_1-\Pi)(z^n_1-\Pi)^{\top} = 0 \to V^* = \frac{1}{N}\sum_{n=1}^N (z^n_1-\Pi^*)(z^n_1-\Pi^*)^{\top} \\ \nabla_{A} \mathcal{L} = \sum_{n=1}^N \sum_{t=2}^T -2Q^{-1}(z^n_{t}-Az^n_{t-1})(z^n_{t-1})^{\top} = 0 \to A^* = \left(\sum_{n=1}^N \sum_{t=2}^T z^n_{t}(z^n_{t-1})^{\top} \right)\left(\sum_{n=1}^N \sum_{t=2}^T z^n_{t-1}(z^n_{t-1})^{\top} \right)^{-1}\\ \nabla_{Q} \mathcal{L} = \sum_{n=1}^N \sum_{t=2}^T (Q^{-1})^{\top} + (z^n_{t}-Az^n_{t-1})(z^n_{t}-Az^n_{t-1})^{\top} = 0 \to Q^* = \frac{1}{N(T-1)} \sum_{n=1}^N \sum_{t=2}^T (z^n_{t}-A^*z^n_{t-1})(z^n_{t}-A^*z^n_{t-1})^{\top}\\ \nabla_{C} \mathcal{L} = \sum_{n=1}^N \sum_{t=1}^T -2R^{-1}(x^n_{t}-Cz^n_{t})(z^n_{t})^{\top} = 0 \to C^* =\left(\sum_{n=1}^N \sum_{t=1}^T x^n_{t}(z^n_{t})^{\top} \right)\left(\sum_{n=1}^N \sum_{t=1}^T z^n_{t}(z^n_{t})^{\top} \right)^{-1}\\ \nabla_{R} \mathcal{L} = \sum_{n=1}^N \sum_{t=1}^T (R^{-1})^{\top} + (x^n_{t}-Cz^n_{t})(x^n_{t}-Cz^n_{t})^{\top} = 0 \to R^* = \frac{1}{NT} \sum_{n=1}^N \sum_{t=1}^T (x^n_{t}-C^*z^n_{t})(x^n_{t}-C^*z^n_{t})^{\top}∇Π​L=n=1∑N​−2(z1n​−Π)⊤V−1=0→Π∗=N1​n=1∑N​z1n​∇V​L=n=1∑N​(V−1)⊤+(z1n​−Π)(z1n​−Π)⊤=0→V∗=N1​n=1∑N​(z1n​−Π∗)(z1n​−Π∗)⊤∇A​L=n=1∑N​t=2∑T​−2Q−1(ztn​−Azt−1n​)(zt−1n​)⊤=0→A∗=(n=1∑N​t=2∑T​ztn​(zt−1n​)⊤)(n=1∑N​t=2∑T​zt−1n​(zt−1n​)⊤)−1∇Q​L=n=1∑N​t=2∑T​(Q−1)⊤+(ztn​−Azt−1n​)(ztn​−Azt−1n​)⊤=0→Q∗=N(T−1)1​n=1∑N​t=2∑T​(ztn​−A∗zt−1n​)(ztn​−A∗zt−1n​)⊤∇C​L=n=1∑N​t=1∑T​−2R−1(xtn​−Cztn​)(ztn​)⊤=0→C∗=(n=1∑N​t=1∑T​xtn​(ztn​)⊤)(n=1∑N​t=1∑T​ztn​(ztn​)⊤)−1∇R​L=n=1∑N​t=1∑T​(R−1)⊤+(xtn​−Cztn​)(xtn​−Cztn​)⊤=0→R∗=NT1​n=1∑N​t=1∑T​(xtn​−C∗ztn​)(xtn​−C∗ztn​)⊤

Testing the Model

In the testing phase, we aim to computeP(zt∣x1,…,xt)P(z_t|x_1,\dots,x_t)P(zt​∣x1​,…,xt​). At each time step, we apply two sub-steps, a one-step prediction and then a measurement update.

  • One step Prediction

    P(zt∣x1,…,xt−1)=∫P(zt∣zt−1)P(zt−1∣x1,…,xt−1)dzt−1P(z_t|x_1,\dots,x_{t-1}) = \int P(z_t|z_{t-1}) P(z_{t-1}|x_1,\dots,x_{t-1}) dz_{t-1}P(zt​∣x1​,…,xt−1​)=∫P(zt​∣zt−1​)P(zt−1​∣x1​,…,xt−1​)dzt−1​
  • Measurement update

    P(zt∣x1,…,xt)=P(xt∣zt)P(zt∣x1,…,xt−1)P(xt∣xt−1)P(z_t|x_1,\dots,x_{t}) = \frac{P(x_t|z_t)P(z_t|x_1,\dots,x_{t-1})}{P(x_t|x_{t-1})}P(zt​∣x1​,…,xt​)=P(xt​∣xt−1​)P(xt​∣zt​)P(zt​∣x1​,…,xt−1​)​

    We’ll use the following notation for mean and convenience:

    μtt=E[zt∣x1,…,xt]Σtt=cov[zt∣x1,…,xt]\mu_t^t = E[z_t|x_1,\dots,x_{t}]\\ \Sigma_t^t = cov[z_t|x_1,\dots,x_{t}]μtt​=E[zt​∣x1​,…,xt​]Σtt​=cov[zt​∣x1​,…,xt​]

    One step Prediction

    We assume that we have the mean μt−1t−1\mu_{t-1}^{t-1}μt−1t−1​and covarianceΣt−1t−1\Sigma_{t-1}^{t-1}Σt−1t−1​ from the previous iteration. We need to compute μtt−1\mu_{t}^{t-1}μtt−1​and Σtt−1\Sigma_{t}^{t-1}Σtt−1​. Becausezt=Azt−1+γ,γ∈N(0,Q)z_t = Az_{t-1} + \gamma, \gamma \in \mathcal{N}(0, Q) zt​=Azt−1​+γ,γ∈N(0,Q)

    μtt−1=E[zt∣x1,…,xt−1]=E[Azt−1+γ∣x1,…,xt−1]=AE[zt−1∣x1,…,xt−1]+E[γ∣x1,…,xt−1]=Aμt−1t−1\mu_{t}^{t-1} = E[z_t|x_1,\dots,x_{t-1}] \\ = E[ Az_{t-1} + \gamma|x_1,\dots,x_{t-1}]\\ = AE[ z_{t-1}|x_1,\dots,x_{t-1}] + E[\gamma|x_1,\dots,x_{t-1}]\\ = A\mu_{t-1}^{t-1}μtt−1​=E[zt​∣x1​,…,xt−1​]=E[Azt−1​+γ∣x1​,…,xt−1​]=AE[zt−1​∣x1​,…,xt−1​]+E[γ∣x1​,…,xt−1​]=Aμt−1t−1​
Σtt−1=cov[zt∣x1,…,xt−1]=cov[Azt−1+γ∣x1,…,xt−1]=E[(Azt−1+γ−μtt−1)(Azt−1+γ−μtt−1)⊤∣x1,…,xt−1]=E[(Azt−1−μtt−1)(Azt−1−μtt−1)⊤∣x1,…,xt−1]+E[γγ⊤∣x1,…,xt−1]=E[(Azt−1−Aμt−1t−1)(Azt−1−Aμt−1t−1)⊤∣x1,…,xt−1]+Q=AE[(zt−1−μt−1t−1)(zt−1−μt−1t−1)⊤∣x1,…,xt−1]A⊤+Q=AΣt−1t−1A⊤+Q\Sigma_{t}^{t-1} = cov[z_t|x_1,\dots,x_{t-1}] \\ = cov[ Az_{t-1} + \gamma|x_1,\dots,x_{t-1}]\\ = E[(Az_{t-1} + \gamma - \mu_{t}^{t-1})(Az_{t-1} + \gamma-\mu_{t}^{t-1})^{\top}|x_1,\dots,x_{t-1}] \\ = E[(Az_{t-1} - \mu_{t}^{t-1})(Az_{t-1} -\mu_{t}^{t-1})^{\top}|x_1,\dots,x_{t-1}] + E[\gamma \gamma^{\top}|x_1,\dots,x_{t-1}]\\ = E[(Az_{t-1} - A\mu_{t-1}^{t-1})(Az_{t-1} -A\mu_{t-1}^{t-1})^{\top}|x_1,\dots,x_{t-1}] + Q\\ = A E[(z_{t-1} - \mu_{t-1}^{t-1})(z_{t-1} -\mu_{t-1}^{t-1})^{\top}|x_1,\dots,x_{t-1}] A^\top + Q\\ = A \Sigma_{t-1}^{t-1} A^\top + QΣtt−1​=cov[zt​∣x1​,…,xt−1​]=cov[Azt−1​+γ∣x1​,…,xt−1​]=E[(Azt−1​+γ−μtt−1​)(Azt−1​+γ−μtt−1​)⊤∣x1​,…,xt−1​]=E[(Azt−1​−μtt−1​)(Azt−1​−μtt−1​)⊤∣x1​,…,xt−1​]+E[γγ⊤∣x1​,…,xt−1​]=E[(Azt−1​−Aμt−1t−1​)(Azt−1​−Aμt−1t−1​)⊤∣x1​,…,xt−1​]+Q=AE[(zt−1​−μt−1t−1​)(zt−1​−μt−1t−1​)⊤∣x1​,…,xt−1​]A⊤+Q=AΣt−1t−1​A⊤+Q

Measurement update

We take advantage of the property of the Gaussian distributions Ifx=[xa xb]∼N([μaμb],[ΣaaΣabΣbaΣbb])x=\begin{bmatrix}x_a\ x_b\end{bmatrix}\sim \mathcal{N}\left(\begin{bmatrix} \mu_a \\ \mu_b\end{bmatrix}, \begin{bmatrix}\Sigma_{aa}&\Sigma_{ab}\\ \Sigma_{ba}&\Sigma_{bb}\end{bmatrix}\right)x=[xa​ xb​​]∼N([μa​μb​​],[Σaa​Σba​​Σab​Σbb​​]), then P(xa∣xb)P(x_a|x_b)P(xa​∣xb​) is Gaussian with

E(xa∣xb)=μa+ΣabΣbb−1(xb−μb)cov(xa∣xb)=Σaa+ΣabΣbb−1ΣbaE(x_a|x_b) = \mu_a + \Sigma_{ab}\Sigma_{bb}^{-1}(x_b-\mu_b)\\ cov(x_a|x_b) = \Sigma_{aa} + \Sigma_{ab}\Sigma_{bb}^{-1}\Sigma_{ba}E(xa​∣xb​)=μa​+Σab​Σbb−1​(xb​−μb​)cov(xa​∣xb​)=Σaa​+Σab​Σbb−1​Σba​

Becausext=Czt+σ,σ∈N(0,R)x_t = Cz_{t} + \sigma, \sigma \in \mathcal{N}(0, R) xt​=Czt​+σ,σ∈N(0,R), ThenE[xt∣x1,…,xt−1]=E[Czt+σ∣x1,…,xt−1]=Cμtt−1\mathbb{E}[x_t|x_1,\dots,x_{t-1}]=E[Cz_{t} + \sigma|x_1,\dots,x_{t-1}]=C \mu_{t}^{t-1}E[xt​∣x1​,…,xt−1​]=E[Czt​+σ∣x1​,…,xt−1​]=Cμtt−1​. Similarly cov[xt∣x1,…,xt−1]=CΣtt−1C⊤+Rcov[x_t|x_1,\dots,x_{t-1}]=C \Sigma_{t}^{t-1}C^{\top} + Rcov[xt​∣x1​,…,xt−1​]=CΣtt−1​C⊤+R

cov[zt∣x1,…,xt−1]=CΣtt−1cov[z_t|x_1,\dots,x_{t-1}] = C\Sigma_{t}^{t-1}cov[zt​∣x1​,…,xt−1​]=CΣtt−1​

Now we have the joint distribution P([xt zt]∣x1,…,xt−1)∼N([Cμtt−1mutt−1],[CΣtt−1C⊤+RCΣtt−1 Σtt−1C⊤Σtt−1])P(\begin{bmatrix}x_t\ z_t\end{bmatrix}|x_1,\dots,x_{t-1})\sim \mathcal{N}\left(\begin{bmatrix}C \mu_t^{t-1}\\mu_t^{t-1}\end{bmatrix}, \begin{bmatrix}C \Sigma_{t}^{t-1}C^{\top} + R &C\Sigma_{t}^{t-1}\ \Sigma{t}^{t-1}C^{\top}&\Sigma_{t}^{t-1}\end{bmatrix}\right)P([xt​ zt​​]∣x1​,…,xt−1​)∼N([Cμtt−1​mutt−1​​],[CΣtt−1​C⊤+R​CΣtt−1​ Σtt−1C⊤​Σtt−1​​])

We can write the measurement updates with the Kalman gain

μtt=μtt−1+Kt(xt−Cμtt−1)Σtt=Σtt−1−KtCΣtt−1\mu_t^t = \mu_{t}^{t-1} + K_t (x_t-C \mu_t^{t-1})\\ \Sigma_t^t = \Sigma_{t}^{t-1} - K_t C \Sigma_{t}^{t-1}μtt​=μtt−1​+Kt​(xt​−Cμtt−1​)Σtt​=Σtt−1​−Kt​CΣtt−1​

where Kt=Σtt−1C⊤(CΣtt−1C⊤+R)−1K_t = \Sigma_t^{t-1}C^{\top}(C\Sigma_{t}^{t-1}C^{\top}+R)^{-1}Kt​=Σtt−1​C⊤(CΣtt−1​C⊤+R)−1

Reference

PreviousVariational Inference

Last updated 3 years ago

Was this helpful?

Matrix cookbook

Maximum likelihood for Multivariant Gaussian

https://www.math.uwaterloo.ca/~hwolkowi/matrixcookbook.pdf
https://people.eecs.berkeley.edu/~jordan/courses/260-spring10/other-readings/chapter13.pdf