Will Scaling Improve Social Simulation with LLMs? A Study of 85 Models

Stanford / Columbia / Tsinghua

Research official 1 src. ~1 min

An empirical study using 85 transformer models up to 70B parameters across three task families: opinion modeling, behavioral simulation, and longitudinal forecasting. Scaling generally helps for well-represented populations but consistently fails to improve calibration with human cognitive biases such as risk aversion; underrepresented demographic groups see substantially slower gains.

Why it matters

Clear empirical finding that scale does not fix bias calibration or minority-group fidelity in social simulation — an important boundary for the growing application area of using LLMs as stand-ins for human survey respondents.

Importance: 2/5

85-model study (Stanford/Columbia/Tsinghua) identifying scaling limits for social simulation and bias calibration

scaling benchmark evaluation reasoning alignment bias

Sources

official Will Scaling Improve Social Simulation with LLMs? (arXiv)