Ongoing developments in artifical intelligence, particularly in AI linguistic communication, will affect various aspects of our lives in various ways. We can't foresee all of the uses to which technologies such as large language models (LLMs) will be put, nor all of the consequences of their employment. But we can reasonably say the effects will be significant, and we can reasonably be concerned that some of those effects will be bad. Such concern is rendered even more reasonable by the fact that it's not just the consequences of LLMs that we're ignorant of; there's a lot we don't know about what LLMs can do, how they do it, and how well. Given this ignorance, it is hard to believe we are prepared for the changes we've set in motion. By now, many of the readers of Daily Nous will have at least heard of GPT-3 (recall "Philosophers on GPT-3" as well as this discussion and this one regarding its impact on teaching). But GPT-3 (still undergoing upgrades) is just one of dozens of LLMs currently in existence (and it's rumored that GPT-4 is likely to be released sometime over the next few months). The advances in this technology have prompted some researchers to begin to tackle our ignorance about it and produce the kind of knowledge that will be crucial to understanding it and determining norms regarding its use. A prime example of this is work published recently by a large team of researchers at Stanford University's Institute for Human-Centered Artificial Intelligence. Their project, "Holistic Evaluation of Language Models," (HELM) "benchmarks" 30 LLMs. One aim of the benchmarking is transparency. As the team writes in a summary of their paper: We need to know what this technology can and can't do, what risks it poses, so that we can both have a deeper scientific understanding and a more comprehensive account of its societal impact. Transparency is the vital first step towards these two goals. But the AI community lacks the needed transparency: Many language models exist, but they are not compared on a unified standard, and even when language models are evaluated, the full range of societal considerations (e.g., fairness, robustness, uncertainty estimation, commonsense knowledge, disinformation) have not be addressed in a.

