Alignment Not Control

Let's abandon meaningful human control in favor of alignment

Sep 25, 2023

Wow, Lena we are jumping right into it! We’re trying to keep these posts fairly concise so let me stir the pot by saying I don’t think “meaningful human control” is the right concept. For me the difficulty with MHC is that we’re trying to gain the advantages of autonomy without assuming the inherent risks. Anytime we employ an autonomous agent there will always be risk of unintended actions. This is true even with autonomous human agents. The difference is that we are comfortable with risks in autonomous human agents because we always have the fallback position that humans are inherently responsible for their actions and can be held accountable. Not so for machines. So my position is that we need to work on the engineering problem of aligning autonomous machine actions with human goals while understanding that no system is zero-risk and we will have to accept some level of failure for machines depending on context and circumstances. To start, let’s consider human autonomy first.

In warfare we have had lethal autonomy since the first hominid leader sent another hominid out to bash somebody’s skull with a boar femur like in 2001: A Space Odyssey. That leader assumed, or assured, that their hominid subordinate (Grog) understood who the target was (Lunk) and that Grog could and would actually whack the right target (Lunk again) according to governing norms and goals. This basic concept, human agents killing things others told them to kill, has followed us through the centuries. I’ll call it human autonomy (put lethal on the front if you want). In available records of warfare there has always been some level of human autonomy on the battlefield. Different societies and cultures have made more, or less, use of human autonomy in warfare but the concept is always present in some form. In modern Western militaries the need for human autonomy has been woven into operations so deeply that tight control and reduced autonomy are seen as poor warcraft and the sure sign of a terrible warrior. The best military units are often measured by how much autonomy members are granted resulting in positions in those units being highly sought after. But even in the best of situations, Grog doesn’t always whack Lunk, sometimes Grog whacks Gunk. And sometimes Grog might whack Gunk for unacceptable reasons. Why?

A basic framework to explain the occurrence of unacceptable uses of lethal force could be three M’s: Miss, Mis-identify, and Malice. First, Grog could miss the intended target, Lunk. That miss could be due to any number of factors including environmental conditions, weapon malfunction, poor or incorrect employment of the weapon, Lunk’s desire to not be struck and thus evade, or even a miss by design (meaning a fragmentation weapon whose effects are not guaranteed to strike the target Grog aimed at). The result is Grog correctly identified and engaged Lunk but hit Gunk instead.

The second possibility is that Grog mis-identifies Lunk and Gunk, classifying Gunk as the target and Lunk as not-the-target. This could also be due to environmental conditions, Lunk’s camouflage, Grog’s malfunctioning sensor array (perhaps Grog needs glasses), or Grog just failed to correctly interpret the data and determined that Gunk was Lunk, so essentially a cognition failure.

Third, Grog for reasons that may or may not be clear, correctly identifies both Lunk and Gunk, but despite guidance and norms, chooses to strike Gunk. I’ll call this one malice. Grog chose to break the rules and act independently for their own reasons.

In all three possibilities, the actions were under human control, but was that control meaningful? It depends on who you ask. Grog might think so but from our point of view, Grog did not do the right thing, so clearly there was some kind of lapse. We must need to impose more meaningful control to ensure Grog’s actions meet our intent and values. Where and when should we impose that control? During training? During operations? All of the above? Is it possible we could impose so much control that Grog is no longer really acting on his own? Is that good? Don’t we want Grog to apply lethal force without confirmation or approval from leadership? Can we ever be sure Grog will always do what we want them to do? If we can’t be sure, how much risk are we willing to accept and how can we keep that risk to a minimum?

Let’s train Grog. We indoctrinate them into a culture and ethos reflective of our intent, values, and norms. We provide doctrine for a basis of behavior that can be drawn upon within the complexity of the real world. We provide equipment that works. We test Grog’s performance. We put Grog inside a context that facilitates action, and ultimately, we impose consequences if Grog’s actions are mis-aligned with these guardrails. We can be comfortable with that. This seems to work well enough for most militaries. The trouble comes when we try to transplant these regimes to machine autonomy.

We can train machines, we can encode constraints that emulate values-based actions, and we can test machines to see if they work. But similarly for humans we can’t anticipate every scenario in which they might perform. That’s ok for humans because we always have our fallback position of human agency and accountability. But we haven’t yet figured out a way to punish machines when they do something we don’t like, and this seems unsatisfying. Especially when human deaths are involved. This disrupts our risk acceptance calculation.

For humans we know that the risk of employing human autonomy on the battlefield will never be zero. Sometimes the wrong person will be killed. We’re generally ok with that because we are empathetic to the difficulties of other humans, and if all else fails we can hold the offending human accountable for their own actions. Not so with machines. Is there such a thing as empathy for machines? Further, if an autonomous machine uses lethal force incorrectly there is this sense that determining responsibility is somehow very difficult. Enter Meaningful Human Control.

MHC is the first attempt to adapt human autonomy controls to machine autonomy when there is an unacceptable use of force by ensuring that human control is sufficiently tied to each action. This assumes that humans are better at minimizing unacceptable engagements (not clear) and ensures humans don’t get a pass on responsibility when the machine does something untethered from human intent. The danger is that MHC will constrain machine autonomy to the point where it is no longer meaningful.

More on meaningful autonomy in another post. For now, let’s accept that the point of machine autonomy is to gain the benefits of human autonomy but at machine speed and with greater resilience. Imposing control of action, rather than alignment of action, erodes those benefits. Of course, we could always say we’ll just have to kill each other at human speed rather than machine speed, but no society that believes their physical or cultural survival to be at stake will forego what could be a decisive security advantage. So where do you go from there? This brings me back to alignment.

We know we can build machines that align actions to human intent within a specific context. We also know that as that context becomes more complex, the ability to predict alignment becomes more difficult. This is where edge cases start to pop up and confound people. Think of autonomous vehicles driving into the side of white trucks because the model thought it was sky. As complexity disrupts alignment, risk mitigation and acceptance must take over.

For example, if I’m operating a lethal autonomous weapon over the skies of Taiwan right now, my risk tolerance for unacceptable engagements would be very low indeed. However, if People’s Liberation Army landing craft are hitting the beaches of Taiwan tomorrow that same system would likely be acceptable, unaltered, because the context, and therefore risk acceptance, has changed. The trick is having a testing and evaluation regime in place that can allow a reasonable calculation of risk acceptance based on alignment performance within context, rather than trying to impose a broad regime of meaningful human control outside of specific context.

I agree that agonizing over the definitions inside the MHC construct is not useful, but I want to throw out that whole school of thought. I think there is some movement on this as the U.S. Department of Defense is embracing “appropriate human judgment” as a governing principle, but I think the most promise is coming from those working on alignment. It is an emerging field but to me seems the best fit within a military system that is already adept at calculating, mitigating, and accepting risk.

Alignment Not Control

Let's abandon meaningful human control in favor of alignment

Discussion about this post