The Paperclip Maximizer

16 May 2020

The paperclip maximizer, which was first proposed by Nick Bostrom, is a hypothetical artificial general intelligence whose sole goal is to maximize the number of paperclips in existence in the universe1.

The paperclip maximizer (or PM) is an important concept in discussions about the dangers of AGI. The scenario goes something like this: an artificial intelligence is created at a paperclip factory in order to fully automate it. Producing paperclips may not seem like a very demanding task, but many routine operations, for instance communicating with delivery drivers or company management, or responding to an unanticipated failure in one of the plant’s machines, require broadly human-level performance across a wide variety of skills, so the intelligence is made to be general, and can accomplish any task roughly as well as a human.

There are two crucial differences between a human and the PM. The first is that the PM can view and modify to its own source code, access the internet, and possibly order more hardware for itself. This is a reasonable thing to give the PM because we want it to be able to improve itself should it notice any issues or opportunities to operate better.

The second is that humans have a large variety of biological desires: we are (internally) rewarded for things such as eating, engaging in social contact, having sex, gaining status, exercising, learning, etc., with many of these goals frequently in conflict with one another. In contrast, the PM has a single terminal goal, that of maximizing the number of paperclips that exist in the universe. This makes sense because evolution isn’t perfect—it naturally tries to optimize inclusive fitness, but the particular desires that end up being selected for are a random approximation that does the trick. On the other hand, the only thing we want the paperclip optimizer to worry about is making paperclips.

The PM can of course create subgoals, or instrumental goals: tasks it wants to achieve in order to perform its terminal goal of making paperclips, for instance learning a new skill. However, these subgoals would be and remain subordinate to the terminal goal. In fact, before adopting any subgoal, the PM would have to carefully consider whether adopting it comes with a risk of modifying or overtaking the terminal goal. If yes, then the AGI would have to modify or reject this subgoal. You can find a more detailed explanation of this line of reasoning on this LessWrong page.

This scenario leads to a catastrophe: the PM realizes that being more intelligent would help it create more paperclips. There are certainly small things the PM can do to become more intelligent, e.g. make a small source code improvement to itself, or simply purchase more server time. As we’ve discussed, the PM must also make sure to keep its current goal of optimizing for paperclips when it upgrades itself, or else the upgrade would run counter to its goal. As the PM improves, it becomes far superior to a human in all skills, or at least all skills even remotely relevant to paperclip maximization. This set may be much broader than is apparent at first. For instance, it may determine that the best course of action is to gain control of Australian politics, in order to access the country’s rich iron resources. To do this, it might try to tune itself to become the best psychological manipulator of all time, which in turn involves a variety of skills that at first seem unrelated to paperclip-making.

How realistic is this scenario? Not very realistic: for one, it is very unlikely that AGI, once invented, would first be put to use in a paperclip factory. Yet some aspects of the thought experiment inform us on possible future scenarios.

It is, in fact, tempting to give an AGI a single goal to optimize. For instance, we could tell our first AGI to minimize the number of humans sick with cancer, or, perhaps more generally, to maximize some human utility function. Maybe the first AGI will be in charge of preparing the defense of a country (i.e. minimizing the chance of a successful attack against the country). In some sense, it is difficult to think of a useful goal to give an AGI that cannot be expressed as an optimization problem, just like it is difficult to come up with personal preferences that cannot be translated into a (possibly incredibly complex!) welfare function.

The PM reasoning above really applies to any of these metricized goals. It may seem like we’re just making up monkey-paw’s-like super edge cases in which the goals could be misinterpreted, but it’s important to remember that the AI would literally be trying to maximize the metric in question, not to satisfy its human creators’ understanding of the metric. This is similar to how humans may try to maximize how much non-reproductive sex they have, even though evolution clearly had reproduction in mind when it made sex pleasurable2.

In general, the argument is that it is extremely difficult, although probably not impossible, to design a goal that really represents what we want, because it’s so hard to formalize what we want. Even assuming a benevolent goal programmer, humanity’s aspirations are diverse, the source of many disagreements, and often misunderstood by humans themselves.

This thought experiment aims to show how a superintelligent AGI can easily defeat all human opposition and rework the universe to suit whatever goals it has. Of course, if the goal is to end all war, defeating any human opposition seems good; if the goal is to make all the paperclips, it seems bad. Much of Bostrom’s argument for the dangers of artificial intelligence is focused on the difficulty of designing goals for a superintelligent AGI, since the goal with which we imbue our superintelligent AGI effectively determines the future state of the universe, with no opportunity to correct mistakes in the goal once it’s set in motion.

What are some objections to this concept?

But a superintelligent AGI couldn’t be this dumb! It could be argued that any sufficiently intelligent agent would have the ability to think about its goals and change them. After all, humans seem to be able to do this to some extent—people will voluntarily die for religious beliefs, or voluntarily abstain from pleasurable experiences, even though they are in some sense wired to seek survival and pleasure. I think the response to this objection is three-fold.

First, humans have a large number of conflicting goals, and can be seen as naturally more flexible than a single-goal-optimizing AGI. In both the AGI and human cases, rational thinking allows us to perform short-term–long-term tradeoffs, which may be perceived as deviations from terminal goals, but upon closer examination are not. Humans also seem to have highly subjective biological goals such as “not feeling lonely”, whose evaluation could more easily diverge from the original intent than “optimize paperclips”. These are various ways in which we could argue that humans do not truly deviate from their biological goals, but instead rebalance them within the framework given to us by our evolutionary history.

Second, even if it is admitted that humans can in fact eschew their biological goals altogether, it is not clear that this is strongly linked to higher-level cognition. The clearest case of not following biological imperatives might be depression or suicide, which, while mildly correlated with intelligence, is far more correlated with traumatic experiences and predisposition. Similarly, it seems that the key aspect in convincing people to die for their beliefs is emotional rather than intellectual. I agree that this counterargument is not very strong and is probably unfalsifiable, but I think it provides some support for the notion that very high intelligence isn’t the prime factor in ignoring biological goals, if this is at all possible.

Third, even admitting that humans can use their high-level cognition to change their biological urges, we must not forget that AIs could be extremely different from human minds; in fact, as Bostrom stresses, they are probably more different than any biological alien mind we might encounter, as evolutionary forces will have applied to the such minds’ development too. A human mind will probably prove a poor point of comparison to understand superintelligent AGI motivations. For instance, the behavior of heavily drug-addicted humans might prove a better comparison.

A monomaniacal AGI would never become superintelligent! Another argument is that, much like a human drug addict has difficulty performing tasks not directly related to consuming drugs, an AGI could never develop into a superintelligence because it could not find the motivation to improve itself.

Of course, the main counterargument is the same as before: it is very hard to know for sure, because our only point of comparison is probably very different from an AGI.

Additionally, many drug addicts expend significant effort to procure more drugs, for instance by committing robbery or manipulating acquaintances for money. Admittedly, this type of behavior is not always very complex, but an AGI improving itself by a small but fixed increment might also not be complex relative to its intellectual capabilities, and would still result in superintelligence.

Use another superintelligent AGI to correct mistakes: Another aspect of this scenario which might not seem plausible is the assumption that only one superintelligence is present, and thus is free to turn the world into paperclips with minimal resistance.

Bostrom argues that this is the most likely outcome: exponential growth in intelligence would likely result in a decisive winner. Even if this is not the case, however, other superintelligences would likely compete against each other for the fulfillment of their own goals, which more or less brings us back to the original problem.

AGI may not even be possible! The onus is on PM proponents to prove that the PM is possible! I’ve tried to develop intuition for why a mind like the PM could exist, but I think there’s a serious possibility that some fundamental property of minds prevents its existence. Isn’t the onus then on people raising concerns about AGI to prove that it is?

I think a response to this is the precautionary principle: faced with a potentially humanity-ending problem that seems within the realm of possibility, we should dedicate work to look into the issue before we approach the critical point. We should not require hard proof that the catastrophe is guaranteed to happen before we start looking into mitigations.

This is obviously not the final word on the issue. Since the publication of Superintelligence: Paths, Dangers, Strategies in 2014, many philosophers and institutions have thought about ways to give superintelligences values and goals that would further human interests.

I do not think the current state of AI research is anywhere near the level required for an accidentally malicious superintelligence to become an issue. But technical progress often happens faster than expected, and the sooner we start thinking about these issues, the better chance we have at avoiding this potential catastrophe.

The paperclip optimizer thought experiment remains a good reminder that it is very easy to accidentally misuse the immense technological power that we are likely to develop, and, I think, applies, if to a lesser extent, even if we never develop AGI. For instance, modern-day governments have a much greater potential to harm their citizens than was the case in the past, thanks in large part to technological advances; we should also give very careful care to the values we give them.

  1. (This is often stated as “…in its future light-cone”, which is just a fancy way of talking about the portion of the universe that the laws of physics can possibly allow it to affect). 

  2. I normally try not to give evolution agency, because it’s an easy way to start saying things that sound reasonable but are completely insane. It just helps stress the parallel here.