Experiments and Gradient Descent
How to find the best version via iterative experiments via analogy to gradient descent
This post will be about experimentation, but there’s a wind-up.
Today, we’re going to look at a class of experimentation problems that we might call “tuning.” A problem where there is a wide variety of plausible choices for a parameter, and we’re trying to find the best one among nearly infinite plausible choices.
Common parameters to optimize include:
Price
Promotional Discounts
Hyperparameters of ML models (search, recommendations, etc.)
Suppose you want to maximize a function f(u,t). At any point in time, you know f(ut, t), but you can’t evaluate f at any other points but whatever u happens to be “active”. For example, you know your current prices (ut) and you know revenue today, but you don’t know what revenue would be at other prices.
How do we learn the best u if we can only evaluate one u at a time?
Let’s think about it as an optimization problem:
Suppose f is strictly concave in u and its gradient satisfies a Lipschitz condition (we’ve decided to be formal-ish today—suit and tie). Then, we could solve this problem via gradient descent, i.e.:
For some suitable sequence of (a).
The problem is: we don’t know the derivative of f, and we don’t know how to evaluate f at other points.
The trick is that we can use experiments to iterate on the gradient descent problem. An experiment is just like computing the objective function and its gradient. It takes a few weeks to compute, though, and our data limits how many points we can compute at once.
The starting point is the relationship between treatment effects and derivatives in this context. Suppose u is a number (not a vector), and you expose half the population to u and the other half to u+h. Well, viola, you have an estimate of the derivative:
[You can expand this to multi-dimensional u by adding more arms to the experiment or by adding structure to f that lets you learn other derivatives in a neighborhood of the current values of u you’re testing.]
First, suppose f doesn’t depend on time, and also, for the sake of simplicity, u is not a vector. Then, the optimal u doesn’t change. It’s static. A stable target you’re trying to hit. Using the gradient descent analogy, we can try to find the best u like so:
When the treatment effect from increasing u is positive, increase u. When it is negative, decrease u, and so on.
There is a vast literature on step sizes that work for gradient descent with a strictly concave f. The key property we need is a method that doesn’t require us to evaluate the function at many points because experimentation limits the number of points we can evaluate. A straightforward approach to choose a, assuming f is strictly concave, using a modern “two-point” method from the literature that works for general concave functions, c/o Malitzky and Mischenko:
Where Dft is the current derivative estimate, and Dft-1 is the prior derivative estimate. I’ll call this the MM-rule.
[The MM-rule formula treats dividing by 0 as infinity, so you’ll always choose the other branch of the min{*,*}.]
The main advantage of making this choice instead of the classical gradient descent step size is that you don’t need to run experiments with a ton of arms to approximate the maximization problem:
You’d operationalize this like so:
Initially: Run an experiment with the status quo “champion” u vs a “challenger” u + h. Choose h large enough that you’ll see some real action statistically, but small enough that you’re reasonably approximating a derivative. Choose an arbitrary a to get a new challenger. The right choice here is a bit of an art, but it’s fine to just choose it so the treatment looks “reasonable.” This is just an initial condition.
Do this again to have enough initial conditions to start using the MM-rule.
Now, run the best arm of the previous experiment versus the new challenger generated by the MM-rule.
Keep iterating, using the MM-rule above. Repeat step 3 forever.
The nice part about this method of experimentation is that it allows you to make careful, small but continuous iterations toward the truth instead of making large, costly attempts to jump to the optimal value (usually based on strong functional-form assumptions). You’re always testing your latest proposal against the data, so you can’t drift too far.
Practical issues you’ll run into: the formula for a might produce too small of a step size for you to see any real action. The treatments are too close together. It’s fine to put a lower bound on step size, even if it violates some guarantee of convergence. The precision the optimizers look for isn’t attainable if we’re estimating functions from noisy data.
Now, what if the function’s value changes over time, i.e., we have f(u,t)? That’s why we’d run experiments in the first place, instead of choosing different u’s and trying them out to trace out the function. We need to run experiments because otherwise we’ll confuse variation in different u’s for variation in t.
When we experiment, we learn:
So, we can trace out the derivative with respect to u from the data, as in the static function case, but the optimal u we’re chasing is changing as well.
I’m proposing using the same algorithm above to solve the problem when t changes the function (and so, the optimal u)—with a small twist.
Of course, if we’re completely general about how t changes f, this approach won’t work, because the optimal u could jump around wildly, but that’s not really what we think will happen.
Essentially, we’re going to assume that the optimal u varies “slow enough” with t that we can track it by accounting for changes in the gradient with respect to u.
The main practical change is that the MM-rule now depends on function values that we can’t just keep in memory. We need to evaluate the gradients of f(*,t), not the gradients we previously computed of f(*,t-1). So, here’s my simple, practical iteration strategy:
Run a four-arm experiment. The experiment has ut (the current iteration), ut-1 (the previous iteration), ut + h, and ut-1 + h (the h’s can be different if you’d like).
Use the results of this experiment to construct Lt.
Now, if you look at the formula, you’ll see that technically it depends on at all the way back and prior Lt’s that are function evaluations will have been evaluated on f(*,t-2), for example, instead of f(*,t). My sort of pragmatic approach here is to ignore that and use the old at’s where they appear as they were when you computed them.
I think this is a reasonable, practical tradeoff to avoid exploding the number of arms the experiment needs while still accounting for the fact that the function has changed, moving from t-1 to t.
Experimentation can test more than button colors. We can use it to understand the models that drive the modern biz. Models that target certain customers with promotional offers or specifically tailored marketing strategies. The number of reasonable models is vast, and finding the right one requires forcing proposed models to confront the truth about their performance. Experiments provide the ground truth and structure the search. It’s a nice pairing.
Thanks for reading!
Zach
Connect at: https://linkedin.com/in/zlflynn

