[{"content":"Joint work with Chenguang Duan, Yuling Jiao, Jerry Zhijian Yang, Cheng Yuan and Pingwen Zhang.\nWe propose nonlinear assimilation method called score-based sequential Langevin sampling (SSLS) within a Bayesian recursive framework.\nProblem Setup Consider the following system: $$ \\begin{align*} \\mathbf{X}^{k} \u0026 =f_{k-1}(\\mathbf{X}^{k-1}, \\eta^{k-1}), \\quad k \u003e 1, \\\\ \\mathbf{Y}^{k} \u0026 =g_{k}(\\mathbf{X}^{k}, \\xi^{k}), \\quad k \\geq 1, \\end{align*} $$ where $\\mathbf{X}_k$ is the latent states of interests evolved by $f_k$, and $\\mathbf{Y}_k$ is the observations under measurement $g_k$. Here we assume that $\\eta^k$ and $\\xi^k$ are noises with known distributions.\nGoal of Data Assimilation: Combine historical observations with dynamics simulation to provide the best estimate of the current states.\nRecursive Bayesian Framework Our work is carried out under the recursive Bayesian framework described below: $$ \\begin{align*} \u0026 {\\color{blue} {p(\\mathbf{x}^k | \\mathbf{y}^{[k]})}} \\\\ \\propto~ \u0026 p(\\mathbf{y}^k | \\mathbf{x}^k, \\mathbf{y}^{[k-1]}) p(\\mathbf{x}^k, \\mathbf{y}^{[k-1]}) \\\\ \\propto~ \u0026 \\underbrace{p(\\mathbf{y}^k | \\mathbf{x}^k)}_{\\text{likelihood}} \\underbrace{\\int \\overbrace{p(\\mathbf{x}^k | \\mathbf{x}^{k-1})}^{\\text{transition}} {\\color{blue} \\overbrace{ {p(\\mathbf{x}^{k-1} | \\mathbf{y}^{[k-1]})}}^{\\text{last posterior}}} \\, \\mathrm{d} \\mathbf{x}^{k-1}}_{\\text{prior}} \\\\ \\propto~ \u0026 \\underbrace{p(\\mathbf{y}^k | \\mathbf{x}^k)}_{\\text{likelihood}} \\underbrace{p(\\mathbf{x}^{k} | \\mathbf{y}^{[k-1]})}_{\\text{prior}} \\end{align*} $$ We maintain an ensemble of particles to estimate the prior and posterior distribution throughout the assimilation process. At each step, the prior samples are obtained by running the dynamics simulation starting from the posterior particles of the last time point.\nLangevin Monte Carlo The posterior score now can be decomposed as the sum of likelihood score and prior score: $$ \\underbrace{\\nabla \\log p (\\mathbf{x}^k|\\mathbf{y}^{[k]})}_\\text{score of posterior} = \\nabla \\log \\underbrace{p(\\mathbf{y}^k|\\mathbf{x}^k)}_\\text{likelihood} + \\underbrace{\\nabla \\log p(\\mathbf{x}^k|\\mathbf{y}^{[k-1]})}_\\text{score of prior}. $$ The likelihood score can be computed with known measurement model and noises. As for the prior score, we exploit the score matching technique at each time step based on the prior ensemble.\nAfter assembling the posterior score, we can use any Langevin-type sampling method to derive samples from the posterior distribution, starting from the transitioned ensemble from last time step: $$ \\mathrm{d} \\mathbf{X}_t^k = \\nabla \\log p(\\mathbf{X}_t^k | \\mathbf{y}^{[k]}) \\, \\mathrm{d}t + \\sqrt{2} \\, \\mathrm{d} \\mathbf{B}_t, \\ \\mathbf{X}_0^k \\sim p(\\mathbf{x}^k|\\mathbf{y}^{[k-1]}), \\ t \\in [0, \\infty). $$Flow Chart We provide a flow chart below.\nPseudocode We provide the python-like pseudocode below.\n# start from an initial prior prior = sample_from_prior() for i in range(k+1): # sliced / implicit / denoising prior_score = score_matching(prior) # assemble posterior posterior_score = lambda x: grad_log_likelihood(x, y[i]) + prior_score(x) # any Langevin-type sampling method posterior = langevin(prior, posterior_score) # dynamics transition to get best guess for next step prior = dynamics_transition(posterior) Numerical Results Numerical examples demonstrate its outstanding performance in high-dimensional and nonlinear scenarios, as well as in situations with sparse or partial measurements. Please refer to our paper for more results.\nHow to Cite If you find our work useful for your research, please consider citing\n@misc{ding2024nonlinearassimilationscorebasedsequential, title={Nonlinear Assimilation with Score-based Sequential Langevin Sampling}, author={Zhao Ding and Chenguang Duan and Yuling Jiao and Jerry Zhijian Yang and Cheng Yuan and Pingwen Zhang}, year={2024}, eprint={2411.13443}, archivePrefix={arXiv}, primaryClass={math.NA}, url={https://arxiv.org/abs/2411.13443}, } ","permalink":"https://zhao-ding.com/research/ssls/","summary":"Nonlinear data assimilation within a recursive Bayesian filtering framework","title":"Nonlinear Assimilation via Score-based Sequential Langevin Sampling"},{"content":"Aim This post intends to introduce you my interpretation of the diffusion models (ODE-based, specifically), and some insights of my recent work, which develops a one-step generation scheme for diffusion models, exploiting the deterministic nature of the ODE flows. Throughout the post, I\u0026rsquo;ll not dive into the mathematical details and theorems, but focus on the intuitive ideas only.\nTask In generative learning, we aim to generate samples from a target distribution. We probably know nothing about the density of the target distribtion, but have plenty random samples. A most common example is the the distribution of images of dog faces. We certainily don\u0026rsquo;t know what the distribution looks like, but we can find tons of such images in real life.\nModel In plain word, the diffusion models bridges between a source distribution and a target distribution. The source distribution is usually easy to sample from, e.g. the standard normal distribution. It is natural to imagine the diffusion model as a river that flows through the source distribution at start, and reaches the target distribution in the end. Now what\u0026rsquo;s left is how the river flows? As you can imagine, there are infinitely many ways to build the process!\nInterpolant Viewpoint Recall the elementary math, two fixed point in 2D Euclidean space determines a line that passes throught them simultaneously. Now let\u0026rsquo;s consider the distributional counterpart of this idea. Given the source distribution $\\mu_0$ and the target distribution $\\mu_1$, we want to design a process such that its marginal distribution is exactly $\\mu_0$ at time 0, and $\\mu_1$ at time 1. A natural design is by convolution. Let $X_0 \\sim \\mu_0$ and $X_1 \\sim \\mu_1$, define a series of random variable $X_t$ by (rescaled) convolution:\n$$X_t = \\alpha_t X_0 + \\beta_t X_1,$$where $\\alpha_t$ and $\\beta_t$ are interpolant coefficients such tha, when $t=0$, $X_t$ is exactly $X_0$ obeying $\\mu_0$, and when $t=1$, $X_t$ is exactly $X_1$ obeying $\\mu_1$. As for $t \\in (0, 1)$, $X_t$ is a convoltion of $X_0$ rescaled by $1-t$ and $X_1$ rescaled by $\\alpha_t$ and $\\beta_t$ respectively. We can design infinitely many combinations of $\\alpha_t$ and $beta_t$ (satisfying some mild conditions omitted here), such as $1-t$ and $t$; or $\\sqrt{1-t^2}$ and $t$, \u0026hellip;\nLook into the series of density functions of $X_t$, denoted by $(p_t(X_t))_{t \\in [0, 1]}$. I\u0026rsquo;ll show you the result directly and omit the mathematical details here. There is a linear continuity equation corresponding to this process:\n$$\\partial_t p_t + \\nabla_x \\cdot (p_t v(t, x)) = 0, (t, x) \\in [0, 1] \\times \\mathbb{R}^d,$$where $v(t, x) = \\mathbb{E} [\\dot{\\alpha}_t X_0 + \\dot{\\beta}_t X_1 | X_t = x]$.\nBy the method of characteristics, there is a corresponding ODE system\n$$\\dot{x}_t = v(t, x_t).$$Now, assume that we can compute or approximate $v(t, x)$ by some means, then simply sampling $x_0$ from $\\mu_0$ (which is easy as mentioned before), and then solve the ODE system with some numerical solver (such as Euler\u0026rsquo;s method) from time 0 to 1, we will obtain a sample $x_1$ randomly drawn from $\\mu_1$.\nVelocity Estimation What remains now is how can we find a good way to compute/approximate the velocity field? Thanks to the great approximation power of the neural networks, and the fact that $v(t, x = \\mathbb{E} [X_1 - X_0 | X_t = x]$, we are able to train a neural network by minimizing the velocity matching loss:\n$$L(v_\\theta) = \\int_0^1 \\mathbb{E}_{X_0, X_1} \\| v_\\theta (t, X_t) - \\dot{\\alpha}_t X_0 - \\dot{\\beta}_t X_1\\|_2^2 \\, \\mathrm{d}t,$$which can be easily interpreted to its empirical counterpart.\nConnection with Denoising Score Matching The velocity matching technique is mathematically equivalent to the well-known denoising score matching technique introduced in score-based diffusion models.\nThe Stein score is the gradient of log density function of the distribution. Denote the score function of the marginal density at time $t$ by $s(t, x)$, then some calculation yields\n$$v(t, x) = \\frac{\\dot{\\beta}_t}{\\beta_t} x + \\alpha_t^2 (\\frac{\\dot{\\beta}_t}{\\beta_t} - \\frac{\\dot{\\alpha}_t}{\\alpha_t}) s(t, x).$$By this equation, the denoising score matching would be just equivalent to the velocity matching mathematically.\nIn recent works, people realize that the score function itself may changes rapidly near time $1$, which makes it harder to train neural networks. A good practice is to adopt the denoiser setting defined by\n$$D(t, x) = \\mathbb{E} [X_1 | X_t = t].$$Ideally, the denoiser should be of the same range as $X_1$, which is good as it\u0026rsquo;s finite. Just as before, there is an equivant expression between $v(t, x)$ and $D(t, x)$:\n$$v(t, x) = \\frac{\\dot{\\alpha}_t}{\\alpha_t} x + \\beta_t (\\frac{\\dot{\\beta}_t}{\\beta_t} - \\frac{\\dot{\\alpha}_t}{\\alpha_t}) D(t, x).$$In practice, to align with mainstream works, we would adopt the denoiser matching setting.\nNumerical Solver Assume that we have trained a velocity field, what remains is how to numerically solve the ODE system.\nIn the early days of diffusion models, people refer to the forward Euler method, which was sufficient to generate samples back then. However, as time goes by, people realized that the Euler method is too slow compared to other refiend method, in convergence rate. Many acceleration techniques have been developed over that past years. Among them I\u0026rsquo;d like to single out the exponential integrator, which I find most useful in practice.\nThe exponential integrator has been developed to solve differential equations in the past century. It exploits the semi-linearity of the ODE system. By the \u0026ldquo;variantion of constants\u0026rdquo;, we can solve the linear part analytically, rather than passing it to the numerical solver. In literatures, this solver has been re-discovered or re-invented many times. The first order solver of several famous literatures are exactly the exponential integrator (maybe under another name). In our setting, the exponential integrator yileds solution:\n$$x_s = \\frac{\\alpha_s}{\\alpha_t}x_t + \\alpha_s (\\frac{\\beta_s}{\\alpha_s} - \\frac{\\beta_t}{\\alpha_t}) D(t, x_t).$$It\u0026rsquo;s worth noting that the exponential integrator is a first order solver, just like the forward Euler method, but it eases the error propagation of linear terms, and have been widely chosen as the effective sampler for ODE-based diffusion models.\nCharacteristic Reviewed A good property of well defined ODE system is that: for any given initial condition, the solution trajectory is determined. That is to say, for each noise $x_1$, there is one and only one corresponding $x_1$ solved by the ODE. This deterministic property holds only on ODE-based models. For SDE models, the solution trajectory is stochastic due to the diffusion part. A natural idea is to train a new network which mimics the solution of the ODE system over any time interval. The new network, denoted by $g(t, s, x)$ should take in three inputs. Once trained, to generate a sample, we simply tell the $g$: the initial point $x$, the starting time $x_t$ and the end time $s$, then ideally, it would output the solution $x_s$.\nThe question is how to design and train this new network? There\u0026rsquo;re 3 key insights during our researches:\nA good parameterization and initialization\nWe absolutely do not want to train the new network from scratch, but reuse the pretrained denoiser network. Inspired by the exponential integrator, we parameterize $g$ by\n$$g(t, s, x) = \\frac{\\alpha_s}{\\alpha_t}x_t + \\alpha_s (\\frac{\\beta_s}{\\alpha_s} - \\frac{\\beta_t}{\\alpha_t}) \\bar{d}(t, s, x_t),$$where $\\bar{d}$ is initialized from a pretrained $d_\\theta$ with an extra temporal input $s$.\nLocal consistency\nSince the new network is an integral network, its time derivative should be identical to the velocity field. Thus we can reuse the velocity (denoiser) matching technique again to ensure the local consistency: when $s = t$, $\\bar{d}(t, s, x) = D(t, x)$.\nGlocal consistency\nThe integral operator admits a semi-group property:\n$$g(u, s, g(t, u, x)) = g(t, s, x).$$ ","permalink":"https://zhao-ding.com/posts/my-first-post/","summary":"To be improved\u0026hellip;","title":"Thoughts of diffusion models"}]