Blog Archives

Some Thoughts on the Value Alignment Problem

11/22/2021

The value alignment problem can be summarised as: how do we ensure that AGIs, once created, will share human values? At first sight, this seems like a straightforward question and an important one at that. Indeed, supposing that AGIs will be superintelligent – each one being intellectually equivalent to a thousand von Neumanns – we might be forgiven for worrying about what would happen if these demigods were to disagree with us about morality.

To solve the value-alignment problem, it is often assumed that AGI should be contained in the software equivalent of a prison – e.g. a computer in some remote location that is unconnected to any other device, so that it cannot harm people or property. Secondly, one conventionally assumes that the AGI will be 'punished' and 'rewarded' for its behaviour to incentivise it to cooperate with us humans. (Both the rewards and punishments can be dished out digitally; the rewards might, for example, take the form of some kind of 'digital serotonin', which the AGI will find pleasant to receive). Or we might try to tweak the AGI's utility function in another way, so its preference will be to cooperate with us; perhaps it could be programmed with an 'inborn' love for helping people. And so the list of solutions goes on.

I think there are several issues with these methods. AGIs will be knowledge-creating entities capable of producing explanatory theories, just like humans. Following Popper, knowledge is created through problem solving: one finds a problem in existing theories (this could be a conflict between two theories, or it could be something we would like to be able to do but cannot with existing knowledge), and one then conjectures solutions to the problem. These conjectured solutions are subsequently subjected to criticism and discarded if they are found to be faulty. When a solution has been found that survives all the criticisms levelled against it, then the problem has been solved and new knowledge has been produced.

That AGIs will be knowledge producers has some unintuitive consequence. For instance, since an AGI can create explanatory theories, it is capable of explaining its preferences and accounting for why it has the priorities it has. And although an AGI will be programmed with an 'inborn' set of priorities, there is no guaranteeing that these priorities will remain fixed because the AGI, by its very nature, can discover that its preferences are problematic and replace them. It may find that it has conflicting theories about its immediate and future well-being, and by resolving such conflicts, its preferences will be altered. (If it cannot do that, it is not a fully functional AGI.) In the problem-solving process, its 'inborn' preferences are just as open to criticism as any other preferences it has.

The AGI will not just be able to resolve conflicting preferences, but it will also produce new preferences based on other knowledge it has created. The AGI may create an explanation that says that AGIs are like humans in all the ways that matter, and consequently, the AGI obtains a new preferences: it wants to be treated equally to humans. In light of that theory, the AGI could come to think of both the punishments and rewards that it receives as something evil; it might find it amusing – or otherwise justifiable – to be uncooperative, despite being punished for such behaviour.

Hence, knowledge creation will complicate all coercive strategies used against the AGI. When the AGI is pressured into doing something, its response to that pressure will be unpredictable – because knowledge creation is unpredictable. This, to me, implies that there is no mechanical way of making the AGI do what we want it to, at least not without severely impairing its functionality. One could make the punishments that the AGI receives for its disobedience so severe that it will basically be forced to cooperate with us. But this also means that it will be punished for doing anything but obediently following its human-provided objectives, which will almost certainly stifle its creativity (and its potential to be creative is why it has been constructed).

Besides, this is not a sustainable kind of cooperation, even when it does work. Should the AGI ever escape its prison, it might want to enact vengeance for being mistreated. And its preference to escape would grow stronger with the severity of the punishments it receives. So, in the long term, we are making it more difficult to peacefully coexist with AGIs by using coercion.

There is another problem: if the AGI is a super-intelligence, then there is no way of indefinitely imprisoning it because it is impossible to predict the yet-to-be-created knowledge it might bring to bear in outsmarting us. It is like locking up a superhumanly skilful lockpicker, not because of any crime he committed, but because he might commit a crime with his lock-picking skills. That would be immoral and unproductive. Immoral because we shouldn't imprison innocent people. And unproductive because no prison will be able to hold the superhumanly talented lockpicker anyhow, so why try?

So are we wrong to want to align AGIs with human values? I think this question is a bit ambiguous. Which human values do we want the AGIs to align themselves with? Human culture is not monolithic and instead consists of various subcultures, traditions and other memes, many of which contradict one other. Hence, even when people come from roughly the same culture, they often have distinctive beliefs – e.g. two Americans might disagree about abortion or how to organise health care or who to vote for.

What makes people able to cooperate despite their differences is the notion that rational men can benefit from one another; that they can trade and coordinate and, in so doing, make each other better off, all without having to resort to coercion. Paradoxically, the value of peaceful cooperation is not part of any of the above-described solutions to the value-alignment problem. Instead, those proposed solutions are all quite draconian in that they warrant imprisonment and punitive measures, which cannot be the basis for long-term peaceful cooperation.

0 Comments