Mike Ambinder – Valve
For the last four years Dr. Mike Ambinder has been using his expertise in experimental psychology to better understand how people interact with the games they play. Dr. Ambinder’s current work focuses on data analysis, hardware research, and developing methods to foster different behaviors in games. He’s also done research on biofeedback and helped expand Valve’s playtesting methodologies. As a result of his pioneering work, players might one day find that games respond to their emotional state as well as their actions. Physiological responses to gameplay, like heart rate and skin conductivity, may inform the experience and allow for a heightened connection between player and game. Although he’s extremely busy, Dr. Ambinder was kind enough to take the time to answer our questions about the intersection of psychology and game design.
This interview was originally published on 5/5/2012
You have a unique position, both at Valve and in the industry at large. Briefly tell us about your educational background and what your current duties are at Valve.
I have an undergraduate degree in Computer Science and Psychology from Yale and a PhD in Psychology (studying Visual Cognition) from the University of Illinois. My role at Valve is essentially to apply knowledge and methodologies from psychology to game design and all aspects of our products/company. In practice, I spend a lot of time on data analysis, hardware research, playtesting methodologies, and anywhere knowledge of human behavior could be useful—how to foster cooperation/competition among players, creating reward/reinforcement rations, manipulating visual attention onscreen, designing experiments for the TF2 economy, evaluating the bias in our internal review systems, etc.
What types of biometric data do you collect from playtesters and why? What equipment do you need to collect that data?
We’re playing around with a variety of methodologies to acquire physiological signals from our players. We’ve conducted experiments with heart-rate, eye-movements, skin conductance, pupil dilation, facial expressions, EEG signals, posture, gestures, tone of voice, muscle contraction, respiration rate, and possibly a couple others that I’m forgetting. Measuring each of these signals requires different equipment, but generally speaking, you need a specific sensor which measures the particular signal you’re trying to acquire and reliable algorithms to parse the data. The reliability and intrusiveness of each sensor varies across signals as well.
We’re still exploring which signals are most viable for our uses, but we’re focused primarily on measuring skin conductance these days, as it is highly correlated with physiological arousal, fairly reliable, and provides easily detectable phasic (transient) and tonic (long-tem) changes to the level of arousal. To measure skin conductance, we place two metal contacts on the skin and measure the resistance of the current passing between the two contacts. You can build a sensor that does this for about $5.
In previous interviews and talks you’ve described emotion as a vector. Tell us about that concept and why it’s important in game design.
When people describe a particular emotion, it can be difficult to articulate the specifics of what they mean. To counter this, describing emotion as a vector is a useful abstraction. It may not be completely accurate, but it provides a basis of common understanding.
A vector is a mathematical entity with both a magnitude—essentially its size—and a direction. If you think of emotion as a vector with the intensity or arousal level of the emotion representing the magnitude and the valence (positivity or negativity) of the emotion as its direction, you can create a mapping of magnitudes and directions to particular emotional states (as you can see on the attached graph).
In your talk at the Game Developers Conference last year you mentioned that arousal is easier to measure than valence. What methods do you use to measure valence? How might this become easier in the future?
Valence is tricky—it is extremely difficult to infer from a waveform at the moment, so our best methodology is figuring out a way of parsing facial expressions or, failing that, to try and be smart about the inferences we draw from particular game events.
There are companies working on technology to parse facial expressions with web cameras, so if they continue to make progress, that would be one reliable way. Beyond that, we need to start applying some advanced mathematics to our signal processing to see if, for example, the spike in arousal for a negative event reliably differs from that for a positive one.
How do you adjust your playtesting strategies for games of vastly different genres? What changes in the approach for playtesting games as different as Left for Dead 2 and Portal 2?
We understand that everything we design is simply a hypothesis, and that our users will provide the data letting us know if we’ve made the right decision or not. With that underlying philosophy, a lot of our playtesting practices do not vary from game to game. We use a combination of direct observation, question and answers, design experiments, analysis of game metrics, surveys, massive beta testing , and sometimes we’ll add in physiological measurements, eye-tracking, or another methodology that is suited for acquiring the particular data we need.
What varies from game to game are the specific questions we ask, e.g. does this mechanic aid the ability of players to play cooperatively in Left 4 Dead or how difficult is it to solve a particular puzzle in Portal 2. To that end, we try and choose the most appropriate playtesting methodology to answer any specific question. For some questions, direct observation will give you everything. For others, eye-tracking may help us pin down some answers. We try to develop a fundamental understanding of the pros and cons of each technique to determine how to best answer a particular question.
The single-player/multi-player dimension does change how we run playtests. With multi-player games we have to determine how well players interact across a variety of dimensions (skill level, communication, competitiveness, etc.) that are not present with single-player titles. We tend to run individual Q&As first after multiplayer tests (in group discussions, the group tends to anchor on the first response spoken aloud), but beyond that, the same philosophy described above applies.
What are the unique challenges involved in playtesting Dota 2? How has your playtesting approach for Dota 2 differed from that of other games?
With DOTA 2, we’re creating a product that has a massive playerbase with built-in expectations about content and gameplay. To that end, the tone of the experiments we design is modified somewhat as we have to both meet an existing bar and improve upon it as opposed to validating completely novel ideas. We started out by bringing in expert DOTA players internally and then began running a private beta where long-time participants in the DOTA community could send feedback. Once we had established that we were meeting (and exceeding) expectations, we decided to expand the playerbase with a more public beta where we aggregate gameplay behavior and feedback over millions of games. This is a departure from previous practice (although we are doing something similar with CS:GO), but it made sense for the goals of the product.
How does a player’s proficiency with a game affect the results of their playtesting? Do you find that experienced players are less apt to become emotionally aroused?
Typically, the more experience a playtester has had with a particular franchise or genre correlates highly with the precision of their feedback but not necessarily their utility (the game may be designed for a wide range of skillsets, and the pros and cons of appeasing the experts at the expense of the novices or vice versa is a question that is determined by the goals of the product). This is not a hard and fast rule, and we certainly want to extract data from a broad swath of the potential audience, but as a general rule, it seems to apply.
We haven’t really noticed any systematic distinction between the arousal levels of novices vs. experts. Experts may have more muted responses on average, but it seems that personality type plays a larger role than skill with the game. That said, we do see much greater novelty effects for novice gamers when surprising events occur; the expert players have adapted to these and tend not to exhibit much arousal in these circumstances.
Can you tell us about any instances where playtesting metrics have affected a fundamental part of a game’s design?
The quick and honest answer is that they have always affected fundamental aspects of the design of any game we have made. If you see game design as a series of choices we could make, we strive to make as many correct ones as possible. The most straightforward way to make correct choices is to base them off of valid data, so if we have a way to measure the impact or outcome of a particular choice, we will do so and apply the results to our decision-making.
Valve has been outspoken about its commitment to research and development, both in design and engineering. Has valve created any of its own input devices which measure physiological responses?
We’ve built all our own hardware (save for the eye-tracker and EEG headsets) to measure physiological responses, and we’re continuing to iterate on future variants. There are a variety of companies out there with impressive technology to measure signal, but when we build things internally, we have total control over the hardware design, so we can make any necessary modifications/iterations/improvements without being dependent upon a 3rd party.
What are the benefits of creating games which react to a player’s psychological state? How can this change game design?
Honestly, we are not sure. The hope is that if you can take advantage of this new dimension of player experience, you can create more calibrated experiences (think dynamic difficulty adjustments, NPCs responding to player state, etc.) as well as create qualitatively different gameplay experiences that rely on physiological signals as gameplay inputs.
Our initial experiments have shown that the potential exists, and we’re going to keep performing them until we create a viable product, or we end up realizing that we’re wasting our time.
As far as incorporating the measurement of physiological signals into our playtesting practices, if you are getting reliable data, what you are given is a real-time look into a player’s emotional response second-by-second. Armed with that information, it is likely you will be able to pick up more sensitive responses to gameplay events. As a consequence, we may gain a better understanding of how to make a bad game good and a good game great.