UX Engineer II, Microsoft, 2009-2010
I launched Kinect for Xbox 360, designing for both the Xbox Dashboard (OS team) and Kinectimals (Microsoft Game Studios). Kinect brought natural user interfaces to the living room by enabling speech and gesture interactions. When Kinect began selling during the 2010 Holiday season, it received excelled reviews. It even broke records for being the fastest-selling electronic device in history.
The Problem
Kinect as a concept pushed the boundaries of tech and interactions. I was brought onto the team in December 2009 as a speech interaction expert with Tellme, a Microsoft subsidiary at the time. Our task was to bring the best practices and prior knowledge from designing voice systems for telephones and for multimodal phone applications to the 10-foot UI of a TV.
We had to work with significant technical constraints. Added to those were the typical challenges of a 1st-generation product: limited time, changing plans in light of technical developments, and changing points of view from the various teams building the first-of-its-kind experience.
We identified 3 areas of focus for the Dashboard:
- What can you say? What speech commands would enable the best speech experience? Our hard technical limit was only 10 commands active at any one time.
- How good is good enough? Xbox was, and still is, a premium brand with a loyal and demanding audience. What accuracy bar would we have to meet in order to maintain the brand loyalty and wow gamers?
- What will the international reception be? The country we were most concerned with was Japan, where Nintendo and Sony PlayStation were born. Culturally, there were also concerns especially about using speech recognition, which was not a common interaction in Japan at the time.
What I Did
What can you say?
My first task was to help define the point of view on which commands we’d support in order to answer the “What can you say?” question. My proposal, which the Tellme team lobbied for, was to focus on the end-to-end experience. I identified media control (playing games from the disc or launching and controlling movies and music) as the core use case for speech. This “jump into content” strategy conflicted with the Xbox team’s proposal of “see it, say it” where speech allowed users to select items shown on the screen.
We decided to test “see it, say it” with users to determine its utility, since it went against speech best practices for multimodal applications on mobile phones.
Avatar Editor Speech Wizard-of-Oz test
Working with our development counterparts, we created a Wizard-of-Oz (WOZ) prototype where we faked a speech-enabled Avatar Editor. We used the existing Avatar Editor application and created a PC-based interface that allowed a wizard (a person pretending to be the system) to select items from the 3×3 grid as well as navigate left and right, as if using the controller. To simulate “see it, say it,” the developers overlaid the metadata name of the feature onto the selection item. The user would read off the screen, speak their request to an inoperable Kinect sensor, and the wizard in the observation room would select the requested item through the WOZ interface. The user believed the system did the selection. This allowed us to test the user’s perception of the usability and value of “see it, say it.”
I was either the moderator (above) or the wizard, depending on the session.
How good is good enough?
We also tested accuracy. To simulate various accuracy levels, the system would select the wrong item or no items at all. We tested three accuracy levels with an usually high number of participant (~40 users) and then gave them a satisfaction survey to correlate user satisfaction with accuracy.
We learned two things. First, users were so excited to use their voice that they found “see it, say it” in the context of the living room compelling.
Second, we learned that there was an inflection point for accuracy and therefore set our accuracy benchmarks accordingly. This study was referenced throughout the project and for years to come as we worked on additional improvements to both Kinect and Xbox.
What will the international reception be?
I traveled to Japan so we could conduct a few focus groups on the acceptance of speech in the living room by Japanese users.
I attended each session, coached the moderator to ensure our specific questions were addressed, and summarized the results so the rest of the team also could learn from the international study.
I also created point-of-view documents, wireframes, specs, and executive briefing presentations. In addition, I conducted a ton of UX QA, filing bugs and working through issues with developers and PMs.
What Shipped
Concurrent with the development of the speech perspective, a separate team was determining what was possible for launch with gesture. Based on the constraints of both natural user interface (NUI) modalities, the Kinect Hub was born. Instead of speech- and gesture-enabling the entire Dashboard, a separate area was created for NUI interactions.
“See it, say it” shipped in the initial Kinect. A user saw a tip to “Say ‘Xbox’.” Once they said “Xbox,” they were prompted to say “Kinect” by reading off the screen.
This launched the Kinect Hub, where “If you see it, just say it” appeared on the screen along with labels showing the commands on the content and other commands inside the bar at the bottom.
This UI was carried over to NUI-enabled partner applications, such as ESPN 3.
Additionally, the accuracy targets set by the usability study were met. Kinect for launch was well received by the American audience, as well as internationally.
I shared my work in two presentations and publications. I wrote the only design whitepaper on speech interactions for Kinect v1 and spoke about the process of designing for the living room at SpeechTEK 2012 in New York City.