VoiceAvatar - A 3D Natural Emulation Design Approach To Virtual Communities

Research Collaborators
Steve DiPaola , David Collins

The design goal of this project was to develop avatars and virtual communities where the participants sense a tele-presence – that they are really there in the virtual space with other people. This collective sense of “being-there” does not happen over the phone or with teleconferencing; it is a new and emerging phenomenon, unique to 3D virtual communities. While this group presence paradigm is a simple idea, the design and technical issues needed to begin to achieve this on internet-based, consumer PC platforms are complex. This design approach relies heavily on the following immersion-based techniques:

  • 3D distanced attenuated voice and sound with stereo “hearing”,
  • a 3D navigation scheme that strives to be as comfortable as walking around,
  • an immersive first person user interface with a human vision camera angle,
  • individualized 3D head avatars that breathe, have emotions and lip sync, and,
  • 3D space design that is geared toward human social interaction.

These techniques, which borrow from disciplines such as group dynamics, facial animation, architectural design, virtual reality and cognitive sciences, allow the system to draw from the natural social neural programming inherent in all of us rather than creating artificial, social-enabling user interface mechanisms. The main goal of all of these techniques is to support multi-participant communication and socialization.

3D voice with 3D navigation
The structural process of a community, whether real or virtual, is communication, of which the most natural human form of communication is verbal. Verbal communication has both the explicit and the implicit message encoded in it. We therefore designed 3D spatial multi-participant voice with distance attenuation and stereo positioning. Avatars closest to you are heard the loudest; those to your right, louder from your right speaker. Using this approach the user interface mechanism becomes as simple as navigating towards the avatars you want to talk to and, thereby, away from those you no longer want to talk to, just like you would at a real cocktail party. By using spatial sound with 3D navigation, natural group dynamics situations occur; that is, several small circular conversational groups of 3 to 6 avatars form and dynamically reform depending on individual and group social preferences

Avatar design: Binding the pair
Given the finite CPU/polygon/bandwidth resources, we invested them first in face-based avatars. The body with its hand gestures and body language is secondary for human communication and can be added as our resource limitations improve. The goal for us is what we call “binding the pair” — binding the real person at the computer with his virtual avatar in cyberspace so she experiences this feeling of tele-presence, of really being there. You cannot believably bind a person with an inanimate object or a texture mapped photograph that does not emote. We tried to achieve “life” and believability with avatars that have autonomous blinking and facial movements (e.g. “breathing”), that lip sync to their voices and can display (at user control) a range of emotions.

We now have some early positive results that this technique is working because it has been noticed that users make “eye contact” with each other; they turn towards the speaking avatar and can feel uncomfortable when another avatar comes too close and “invades their personal space”. This last point was very encouraging considering our goal of “binding the pair”. If someone in real life comes within too close a proximity of you, you feel an uncomfortableness along with a physical tightening of your stomach muscles. This same sensation happens in the Traveler worlds showing that users perceive at some level that they are really there with other people — avatars are perceived as beings not as objects being manipulated by other users on their home computers.


This work was first created at the start-up Onlive! which initially implemented the Traveler software and communities (DiPaola and Collins were among the main architects). DiPaola has continued this research at Stanford (see presentations) and now here at the iVizLab. Traveler is now an open software effort (Creative Commons license) and we continue to investigate natural emulation techniques with the evolving Traveler community and software.

Technical Issues

The basic hypothesis in implementing Traveler was that the use of human voice is the most natural way to carry on shared conversation. The implementation of an effective multi-voice audio environment was the primary design target.
Traveler was intended to allow for a virtual multi-way conversation, with participants contributing randomly, spontaneously and in arbitrarily shifting combinations. Since the use of voice in communication is fundamentally an interactive one, it was considered essential to allow for interjections, overlapping commentary, encouraging responses and other natural elements of verbal communication. To achieve this, it was necessary to provide a mixed stream of audio on the down channel. To create this effect in a limited bandwidth environment, Traveler provides each client with up to two audio channels on the downlink, chosen from all the available up-linked audio streams. Each client receives a different set of audio channels, based on a number of heuristics, taking into account proximity to other speakers and which of the other participants are speaking at any given moment. Since the downlink stream set is reevaluated every 60 milliseconds, the resulting voice environment appears to be perfectly fluid and arbitrarily complex.

Figure: Multi-point, full duplex voice codec with additive bridging where the server
can add incoming compressed streams from speakers A and B; f(A) + f(B) = f(A+ B).

The richness of the audio environment is further enhanced by localization of the audio data. The client software uses its knowledge of the relative positions of avatars to individually attenuate and stereo-locate the corresponding voice channels. This allows the user A to manage the influence of their voice on individuals and groups by approaching or retreating from other avatars. For example, a user listening to another speaker is vaguely aware that another group of users are speaking at some distance away. As a member of the distant group approaches, his or her audio becomes increasingly loud, combined with the voice of the original speaker. The direction and distance of the new speaker can be intuited from the attenuation and stereo queues.

Figure: With Spacialized (3D) multi-point audio, avatar C hears others with distance
attenuation, and stereo positioning, hearing avatar A louder and more to the right than avatar B.

By seamlessly combining these various audio techniques, Traveler provides users with a broad range of natural social behaviors in the shared environment. Mixed audio allows users to interrupt and interject as well as defer or refuse to defer to new speakers. These behaviors are all managed with standard social conventions, as opposed to artificial techniques, such as HAM-radio-style queues. Spatialized mixing allows natural and fluid formation of groups as well as smooth transition from one group to another.

Download and Links
Papers and Links
Siggraph’03 Paper PDDF Paper: “A Social Metaphor-based 3D Virtual Environment” from Siggraph03 education conference
Siggraph Presentation  Siggraph’99 talk: Authoring the Self: Identity and Role-Playing in Virtual Communities.
Stanford Presentation  Stanford Learning Lab’ 00 talk: A Real-time Group Communication System using Immersive Natural Metaphors
Stanford Research Site  Chasing Alice – A 3D narrative art experience. DiPaola’s research at Stanford on Interactive Narrative.
Traveler Software Site  Site of Digital Space Traveler software and communities. Software is now open via Creative Commons license.