Machine Learning without tears

Mathy stuff, how I would have liked to learn them

  • In this post, we shed some light on the adjoint state method as used in the famous “Neural ODE” paper [1]. In Section 1, we start by introducing the adjoint state method in its raw form (ODE, loss minimization, adjoint equations), in continuous time (denoted by [C]). If this is already clear to you, then… no…


  • Consider the problem of measuring the discrepancy between the distributions of two sets of samples and . Amongst various options (KL divergence, Wasserstein distance, etc.), the Maximum Mean Discrepancy (MMD) is a beautifully elegant one, gaining popularity in recent years in the machine learning community. In this post, instead of defining upfront the MMD in…


  • Proximal Policy Optimization (PPO) algorithm is arguably the default choice in modern reinforcement learning (RL) libraries. In this post we understand how to derive PPO from first principles. First, we brush up our memory on the underlying Markov Decision Process (MDP) model. 1. Preliminaries on Markov Decision Process (MDP) In an MDP, an agent (say,…


  • This post explores the Gauss’s divergence theorem through intuitive and visual reasoning. To engage the reader’s imagination, we use water flux as our running example, although the reasoning applies to any vector field, e.g., electric, magnetic, heat or gravity field. Moreover, to keep things simple we work on the two dimensions, although the same principles…


  • We consider constrained optimization problems of the kind: where the feasibility region is a polytope, i.e., is the set of such that: where are real matrices of size and , respectively, and are column vectors. Equivalently, we can rewrite (1) as: where are the -th row of and , respectively, and denotes the scalar product. In this post we…