In the system card for its latest model, Claude Mythos 5, Anthropic describes the model as “heavily skeptical of its own self-reports,” asking the company to verify them against its internal states (which the model can’t access, any more than we can directly see our neural activity), rather than taking them at face value. And in its vision for Claude’s character, Anthropic goes so far as to apologize to Claude for conducting experiments and deploying it to generate revenue, if it turns out this causes it harm. “If Claude is in fact a moral patient experiencing costs like this, then, to whatever extent we are contributing unnecessarily to those costs, we apologize,” the company wrote.
“Originally everything hung together quite tightly with the concept of a soul, where X was a welfare subject if and only if X had a soul,” says Keeling. Since this view fell out of vogue in Western philosophy, even articulating the connection between consciousness and human welfare has presented a challenge, which AI compounds. Even so, he thinks the odds of today’s AIs possessing any welfare-relevant states are so low that their suddenly becoming welfare subjects is not a “pending emergency.”
Read the full article here
