Domain Specific Languages for Statistical Modeling
I'd like a STAN for moment-based models
I’ve been thinking about the fact that statistical modeling doesn’t really have a standard declarative language for specifying a model—and how there’s nothing really preventing that from being so. I have this hope that were such a language to exist, people would finally understand the difference between an estimator and an estimand.
I started thinking about this recently when I wrote a little something for a side project—coming soon—in STAN.
[I style the name in all-caps for unknown reasons. It’s not an acronym (I think).]
If you’ve never used it before, STAN is a domain-specific language for likelihood models, primarily for Bayesian inference. The core idea is that you can specify the model in this declarative syntax, and the STAN program computes derivatives and the path for Markov Chain Monte Carlo iterations of the posterior distribution using automatic differentiation in a fairly well-optimized way (so, it’s nearly as fast as this sort of computationally-intensive thing can be). It’s super nifty.
To see what it looks like, here is an example of a STAN specification from a model I wrote for an old project.
[If you’re curious: the “old project” is an app that creates an “optimal” schedule for working towards a long-term goal based on your current, observed work behavior (i.e., you like to work on Tuesdays, you like to work in big chunks, you like to frontload or backload work relative to deadlines, etc). You don’t enter your preferences. The app learns from your actual work behavior to help converge on an “ideal” work schedule across projects over time. It never really got any traction, but I liked the idea.]
data {
int<lower=0> N; // number of days so far
int<lower=N+1> T; // when the work has to be done
array[N] real logit_remaining_share; // logit{W / (Y-work prior to W)}
array[T] int G;
}
parameters {
vector[6] dow;
real<lower=0.0> sd;
real weekday;
real<lower=0.0,upper=1.0> r;
real<lower=-0.20,upper=0.20> b;
}
transformed parameters {
vector[7] K;
for (i in 1:7) {
if (i <= 5) {
K[i] = exp(dow[i] + weekday);
}
else {
if (i==7) {
K[i] = 1;
}
else {
K[i] = exp(dow[i]);
}
}
}
}
model {
vector[N] future;
vector[N] present;
for (i in 1:6) {
dow[i] ~ normal(0, 1);
}
weekday ~ normal(0, 1);
r ~ beta(1,1);
sd ~ gamma(1,1);
b ~ uniform(-0.20,0.20);
for (i in 1:N) {
future[i] = 0.0;
for (j in (i+1):T) {
future[i] = future[i] + ((1+sqrt(j-i))^b)*(K[G[j]]^(1/r));
}
present[i] = K[G[i]]^(1/r);
}
for (i in 1:N) {
logit_remaining_share[i] ~ normal(log(present[i])-log(future[i]), sd);
}
}
The key idea here is that you can just describe the prior distributions [weekday ~ normal(0,1)] and the likelihoods [logit_remaining_share[i] ~ normal(log(present[i]) - log(future[i]), sd)] and then STAN simulates the posterior distribution using a fancy Monte Carlo. This makes it easy to specify flexible, realistic priors and likelihoods without relying on computational tricks (like conjugate priors). So you don’t specify models based on computational concerns, but on how well they describe the data-generating process. Plus—and I like this part a lot—the model’s description is fully separate from the details of how it’s estimated.
[Yes, I frequently write code without descriptive variable names. I think this is the good and proper way to write math code when you have a model you’ve written out somewhere in math, i.e., the names here match the LaTex, which is the source of truth, and the real documentation. But your mileage—and ability to get such things past engineers reviewing your PR’s—may vary.]
But I don’t do many Bayesian projects. I mostly work firmly in frequentist land. I don’t do much likelihood-based estimation either. Broadly, I use moment-based estimators, but I really like the idea of a domain-specific language for modeling that’s portable across R, Python, Fortran, Julia, whatever.
I’ve been sketching out what this would look like (I think the cool kids call this “building in public”).
Here’s my current draft:
data
N;
K;
J;
X[N,K];
Y[N];
Z[N,J];
T[N];
end
parameters
b[K];
te[1];
end
latent
# parameters can show up here
Y{T} := start |T, t|
Y if T = t else unobserved;
end
end
assumptions
# no parameters or latent variables can appear on the right side of the := operator
moment(Z @ (Y - X @ b)) := 0; # iv assumption to get b
moment(te - (Y{1} - Y{0})) := 0; # define treatment effect
moment(Y{T}) - moment(Y | T) := 0; # exogenous treatment assignment
# alternatively, define some CIA-like assumption
moment(Y{T} | X) := moment(Y | T, X)
endAnother example for a discrete choice demand model:
data
J; # number of products
K; # number of characteristics
N; # number of customers
X[J,K]; # product features
B[N,J]; # dummies for whether each customer bought
end
parameters
b[K]; # characteristic parameters
a[J]; # product fixed effect
choice_prob[N,J]; # customer probability of choosing each option
end
assumptions
do i from 1 to N
do j from 1 to J
choice_prob[i,j] - 1/(1+exp(-1*cross(X[j,:], b) - a[j])) := 0;
end
end
do j from 1 to J
moment(choice_prob[:,j] - B[:,j]) := 0
end
moment((choice_prob - B) @ X) := 0
end
A key point here is that “moment(…)” is not just a function being applied to an input. It’s part of the language, and the language knows the properties of expectations. So it can, for example, know that:
E[Y{1} | T = 1, X] := E[Y{1} | X] => E[Y{1}] = E[E[Y{1}|T=1, X]]
You can also centralize moments with “central_moment(…)” and get higher order moments by using “moment(X, 2)” or “central_moment(X,2)” (i.e. the variance).
This isn’t a project I’ve written code for yet. I’m still trying to decide whether this makes sense and is useful enough to spend a decent chunk of time on, but in principle, I think having a declarative syntax for moment-based modeling makes sense.
The core, workhorse estimator behind it all would be the Generalized Method of Moments, because it’s flexible enough to cover all of these cases. Maybe it can even include inequality restrictions…
Anywho, thank you for listening to me ramble about ideas I’ve been bouncing around to improve statistical programming. I really like the idea of more firmly decoupling the estimand from the estimator in the code itself, so I’ll probably take a shot at making this a reality once I’ve thought about the problem more precisely and have a good grasp of how to implement it right.
What do you think are the biggest gaps we have in statistical programming?
Thanks for reading!
Zach
Connect at: https://linkedin.com/in/zlflynn

