‘UTAU Voice Synth Anarchy’: Experiencing Gender through Digital Materiality

April Wei-West

PhD Researcher, Music
SOAS, University of London
668211@soas.ac.uk
 

On the discussion-based website Reddit, readers can find a forum “UTAU: VOICE SYNTH ANARCHY,”[1] dedicated to the vocal synthesizer program UTAU, released in 2008. UTAU is a free-to-download shareware which facilitates creation of both synthesized vocal music and production of a user’s own synthesized voicebank. This essay introduces two reflexive moments from my ethnography of creating a voicebank using my own voice and producing a song cover in UTAU. Complementing this, I draw from interviews with UTAU producers to explore how digital creativity and experiences produce affective dimensions of “life” in virtual communities. I characterize life as the presence of agency, and I explore how use of UTAU demonstrates this by affording particular gender subjectivities that are distributed across the physical body, the digital realm, and the synthesized voice.

UTAU was modelled after the arguably more well-known Yamaha program, Vocaloid, released in 2004. Vocaloid is a software that allows users to synthesize vocal melodies, including lyrics, from purchased voicebanks, which are marketed as virtual “singers.” The most famous of these is called Hatsune Miku, who also has an associated animated visual character created to accompany the voicebank. UTAU and the more recently released Synthesizer V have been taken up by a wide number of Vocaloid producers, and thus I treat “vocaloid” (lowercase) as a genre encompassing the creative practices and fan culture shared among these three software programs. Since vocaloid is distributed across various digital and material realities such as software, music uploads, social media, producers, fans, and so forth, Nick Prior (2021, 213) has aptly theorized vocaloid as “assemblage,” meaning it is made up of various heterogeneous parts which produce particular meaning when brought into contact with each other. However, where Prior exclusively refers to Miku as “she,” Matthew Thibeault and Koji Matsunobu (2020) note that vocaloid cannot be tied to one essential identity and thus suggest that vocaloid can be at once a “he,” “she,” “they,” and “it.” My ethnography of UTAU digital creativity draws attention to this multiplicity of actors in the construction of gender identities. Since gender in vocaloid has often been discussed in terms of Miku, I offer a different perspective – of the “he,” “they,” and “it” – by focusing on how vocaloid “producers” (applied in the vocaloid community to describe anyone who creates music using the software) experience gender through software use.

In existing research on Hatsune Miku, a virtual singer meant to give visual representation to synthesized voice(s) produced by software such as Vocaloid and UTAU, scholars have given significant attention to the implications of Miku’s holographic performance as simulacrum of the human body (Conner 2016, Prior 2021, Michaud 2022). A frequent criticism of Miku has been that “she” represents the performance of gender as determined by patriarchal social systems (Black 2012, Lam 2016, Sabo 2019). However, other scholars have acknowledged vocaloid as a multi-faceted, vastly shifting entity as a result of being used by a variety of musicians, including both amateur fans and professionals (McLeod 2016). This realization of Miku as constructed through manifold fan-made perceptions directs the discussion beyond Miku as product of constructed femininity, and into an instability in which Miku is at once synthesized voice, holographic image, software, fandom, and more. Indeed, one of my interlocutors frequently referred to Miku as both “she” and “it” interchangeably, stating that Miku can be “whatever the individual is willing to place on it.”[2]

The popularity of Miku as a virtual singer has led to the emergence of human singers who perform using virtual avatars. For example, utaite are singers who perform vocaloid covers with their human voices, and popular utaite such as Eve and Ado have gained mainstream success whilst remaining anonymous behind their animation-style character avatars. Similarly, VTubers (standing for “Virtual YouTuber”) are online entertainers who present themselves through a fictional animated avatar, many of whom sing on livestream. Compared to Miku, these virtual singers present a more direct through-line between the human producer and their virtual embodiments, crafted in the digital realm. However, in the same way as Miku, these virtual singers rely on collective belief and emotional investment into the virtual space inhabited by these digital avatars. Anthropologist Tom Boellstorff (2008, 4), in his ethnography of the “virtual world” platform Second Life, explores how users become embodied subjects in online space. Boellstorff employs “techne” (2008, 237) to illustrate that the self is formed through creativity of virtual engagements. Such methodology has foregrounded the real, tangible social effects of virtual space, thereby problematizing the assumption of “the real” as existing in the physical world (Boellstorff 2016, 387). This is especially true for Hatsune Miku, who is often marked as “unreal” in comparison to physical singers despite digital practices being shaped by the physical world, and vice versa (Duggan 2017; de Seta 2020). Virtual worlds, then, offer legitimate ways of forming human subjectivity. However, in this essay I take a posthuman approach to virtual singers and VTubers as the human subject is embodied both through organic matter and digital materiality, extending the body as virtual prosthesis.

In this digital ethnography, I use UTAU and Discord, a livestream- and discussion-based social media platform, as my primary fieldsites. I developed an understanding of UTAU production processes through my own creative practice using UTAU, and through engagement with the online community on Discord. My approach to digital ethnography maintains a “focus on how people experience—and invest power and meaning in—communicative technologies,” (Cooley, Meizel and Syed 2008, 91). This is not to neutralize technology as simply a set of tools that people use, but rather to emphasize the very culturally, socially, and politically situated ways in which technologies gain use and meaning. For example, Snape and Born (2022), in their ethnography of Max music software program, trace agency starting at the materiality of the digital system, zooming out to wider institutional mediations of music. In doing so, the authors reveal the ways in which creative practice through digital Max is influenced by technological possibilities determined by algorithms, aesthetic conventions for the genres engaged through the software, and socio-economic factors such as the software distributors deliberately creating a monopoly within the digital audio production economy. Such methods reject any simplistic argument of technological voluntarism in which technology assumes a neutral position or technological determinism, which suggests that technology wholly and irreversibly is changing our lives (Taylor 2001, 26). Instead, noting the subtle and multiple ways in which aesthetic and social meaning is marked by both human and technological actors, they disrupt binaries between the real and virtual, online and offline, human and non-human which frequently emerge in discussions of digital media.

PERFORMING GENDER THROUGH THE VOICEBANK

Through my auto-ethnographic experience of creating a voicebank, I explore how gender is performed stylistically according to genre conventions, and how the software produces digital representations of a gender spectrum which open up the potential for non-binarized performances of voice. Following tutorials on the websites YouTube and UtaForum, I created a Consonant-Vowel voicebank in Japanese, rite of passage of sorts for UTAU production. The language choice is in part due to Japan being the cultural epicenter of vocaloid, but also the ways in which the language is particularly phonologically conducive to concatenative vocal synthesis. To record my voicebank, I sang phonemes on a single pitch near the lower end of my vocal register, which I would normally sing with a warm vibrato due to my choral vocal training. However, I felt I was not singing with Japanese-sounding vowels and wanted my voice to be brighter, so I re-recorded my samples aiming for a breathier and sweeter vocal quality. To access this register I admittedly played into what I had imagined J-pop female singers to stereotypically sound like. I based this on popular renderings of Miku’s voice,[3] as well as early 2000s J-pop vocal performances by groups such as AKB48 which I associate with the “idol” genre. An idol is a type of singer in the Japanese popular music industry, identified through stylistic singing, dance, and fashion styles which contribute to a “kawaii” (“cute”) aesthetic presentation. The concept of “kawaii,” as Keith and Hughes (2016) show, is produced in line with normative expectations of youthful femininity, and female popular singers are trained to perform with a vocal style consisting of soft, high-pitched tone in the idol context specifically.

As Rafal Zaborowski (2023, 22) has noted, audiences relate to and invest emotionally (and financially) in J-pop idols through their “ordinary” vocal and dancing ability; thus, breathy phonation is key to a sense of youthful and untrained voice. However, upon importing the sound files of my sung phonemes into UTAU (Figure 1), I discovered that the resampler – the plugin that transforms the vocal sample into a synthesized sample – removes vibrato and individual vocal qualities, transforming the voice into a stereotypically robotic, digitally-altered one. This episode was only the beginning of a conscious process of materializing the voice, and subsequently gender, outside of the body with the manual manipulation of voice inside the UTAU software.

Figure 1. Screenshot of "Voice Configurations" popup window on UTAU, showing the samples of phonemes that make up my voicebank, represented by both Roman script and Japanese hiragana.

Scholars such as Nina Eidsheim (2019) and Amanda Weidman (2021) have noted that the assumed unity between voice, body, and self has led to the conception of the voice as a marker of individual identity and patriarchal agency. This vocal ontology is evident in criticism of microphones for live vocal performance as inauthentic and emasculating in North American popular music (Ribac 2021, 56); it is further evident in discourse surrounding the emergence of the singer-songwriter trope, in which the voice marks the song with one’s unique self (Weidman 2021, 7). Vocal quality, then, can be a site of ossified assumptions about the physical body, such as youth and femininity in J-pop. But as Eidsheim (2019) has explained, this is not a process relating to essential biology, but rather social conventions for both audition and vocal production. If the voice is not merely a mediator of language, as voice studies scholar Cavarero (2005) suggests, then the digital voice marks one of the ways in which the voice, not as a discursive but a performative entity, materializes the gendered body.

The process of rematerializing the digital voice is reflected in UTAU’s “tuning” practices, which engage with both vocal quality and the cultural associations of such qualities. For example, the first step is to make the sung voice sound more “human” by adding micro-tonal transitions between notes (Figure 2). The other aspect of tuning is the adjustment of “flags,” which are parameters increased or decreased on a scale to bring out certain qualities in the voice, including “gender” and “breathiness.” As previously noted, the feminine voice in J-pop idol music is typified by high pitch; however, in UTAU, the fundamental sung pitch cannot be altered when tuning. Thus, the gender flag alters the brightness of the vowels, with the feminine end of the spectrum sounding comically squeaky, and the masculine morphed into a dark, rounded sound.

Although a binary definition of vocal gender is presented in UTAU, it is mapped upon a sliding scale, and notably, neither end of the scale is labelled as masculine nor feminine. In fact, it was communal associations of vocal quality with gender that led to the gender flag – originally the “g flag” – being known as such. This is likely because the “gender factor” parameter in the related software Vocaloid adjusts the quality of the synthesized voice in the same manner, reinforcing the association between bright vocal quality and femininity. However, the tuning process enables flexible variations of vocal quality through the necessary adjustment of multiple parameters at once, meaning that characters such as Miku, with an officially ascribed gender, are not tied to an essential gendered vocal performance. Whereas before, when I was recording my samples, I recalled popular Miku songs that fed into my perception of the “kawaii” idol vocal stereotype, several tunings of Miku’s voice exist which challenge a vocal gender binary. For example, trends in vocaloid tuning have included raspy, almost pitch-less vocals,[4] as well as warm, sustained vocal quality[5] that reflects contemporary popular singers such as Ado. As I will explore in the following section, producers engage with the synthesized voice in relation to personal gender expression which further affords gender non-conforming possibilities for the synthesized voice.

Figure 2. Screenshot of tuning in Utau, in which original vocal samples are modified through elongation, as well as by adding extra peaks and valleys in the pitch bends which join the notes together and adding vibrato on the す[su] note.

GENDER SUBJECTIVITIES THROUGH EXPERIENCING MATERIALITY

In this digital ethnography, I focus on the idea of synthesized vocal music as experience, in which the affective processes of creation takes priority over the produced song. This is evidenced in the intermittent ways in which users engage with UTAU. One of my interlocutors, Kai, remarks: “What I do love about it [UTAU], is that you don’t have to keep up with it […]. It’s not something physical” (interview with author, December 2022). Still, digital systems are comprised of distributed materialities spanning beyond physical objects, encompassing code, electromagnetic waves, and sound itself in complex and multiple communicating layers. If, as Boellstorff (2008, 249) argues, cultures are “always already virtual” in their constructedness, UTAU should not be diminished as incorporeal software. We might then allow UTAU to take on a vibrancy that resonates as “a life,” recognised as such through its material potency. More than a life as anthropomorphized virtual singers, “life” in UTAU denotes an ontological vitality. What it means to be alive is problematized within the material philosophical tradition which explores how human subjects are influenced by the material world, arguing for the dissolution of hierarchy between objects and humans in configurations of agency.

In critical organology, Eliot Bates (2012, 373–374) has adopted a similar approach, employing Actor–Network Theory to highlight how musical instruments can embody social agencies. In this approach, the musical instrument as a bounded object can be envisaged as an actor within a network alongside instrument players, makers, listeners, and other social actors. However, in digital music, various open-ended mediations form the synthesized voice, such as software, hardware, code, and so forth. Thus, scholars such as Georgina Born (2005) and Paul Théberge (2017) use assemblage to describe digital instruments, defined by the ad hoc assembly of human and non-human subjects to produce agential potential.

In UTAU, voicebanks are characterized as virtual singers who, although only enlivened with voice when programmed to sing, challenge what it means to be alive due to their digital materiality. As materialist philosopher Jane Bennett writes, “a life inhabits that uncanny nontime existing between the various moments of biographical or morphological time” (2010, 53). This is reflected in Kai’s voicebank, Karui Yami, whom he created in 2009 and has continuously updated to the present day. Kai described his relationship with Karui Yami as follows:

My UTAU is kind of like my son. […] I wanna see him grow and flourish. […] Because I have my own voice in him, it gives him a little more importance to me than just an OC [Original Character]. So it’s a little hard to completely separate your UTAU from yourself. (interview with the author, December 2022)

The life trajectory of Karui Yami shows its independence from his maker, existing through the nontimes and generating some resistance against Thibeault and Matsunobu’s suggestion that virtual singers are contingencies, only alive when in direct use by humans (2020). However, it remains reliant on the human agent, as a virtual prosthesis, a concept explored by posthuman feminist authors as an exploration of how agency manifests beyond the organic body (Hayles 1996).

My experience with UTAU was not immediately prosthetic, as once I began producing with my UTAU, I did not feel like it had a personality. I never heard myself in it, and I never anthropomorphized it; it felt horribly alien to me. But this was completely reversed at the last moment when I radically changed the timbre of the backing vocal by adjusting the g flag, writing in my fieldwork journal:

This whole time I was struggling to hear the VB [voicebank] as my own voice/my own, but as I set my resampler to g-10 [hyper-feminine] flag, I was washed over by a feeling of the uncanny, hearing that voice that really was no longer my own.    

I was caught off-guard by the metamorphosis that occurred when I entered vocal territory that I would not be able to produce with my “real” singing voice. The voicebank suddenly gained a vibrancy of its own and that unsettled my preconceived notions of the agency of virtual singers. However, this was also the synthetic process of mutation which acts as a mark of the posthuman, the voice extending into the Other and usurping any presupposition of vocal subjectivity as a given to humanness, as the virtual voice gained its own identity.

For two of my interlocutors who are transgender, UTAU was a way of taking on new modes of identification. One Discord user I communicated with was anxious of being mis-gendered when recording his own voice. For him, UTAU acted as something of a voice changer, allowing him to perform his desired vocal gender.[6] Similarly, Kai made his UTAU character sing with a lower voice range was a process that prompted him to realize his transgender identity: “I was like ‘well, what if I tried being a guy like Sora?’ […] I ended up being like ‘wow, that feels great’” (interview with the author, December 2022). Different from a voice changer which would edit the voice live, crafting a voice bank is a continuous practice of creating a separate vocal entity with various virtual potentials. As a result, the UTAU voicebank has the ability to act upon the subject, enabling material change by way of my interlocutors’ transgender identity.

CONCLUSION

In this paper I have employed a posthuman conception of the synthesized voice in UTAU to show how gender subjectivities are produced. In my own production of vocal samples to create a voicebank, I navigated my own culturally produced biases of gendered vocal production and perception in J-pop idol music. This revealed that the vocal quality that I deliberately put on to imitate J-pop idols gains its female body not by way of any essential biology, but through the socially constructed expectations of idols’ appearance, personality, and voice. Thus, in line with posthuman thought, the voice does not automatically represent a priori self and agency as located in the human body. The materialization of gender outside of the body is reflected within UTAU digital creativity, with my transgender interlocutors’ experiences highlighting how subjective experiences of gender take place beyond the boundaries of the embodied self. Indeed, voicebanks in UTAU not only translate the human body into data but seem to take on a life of their own, as I felt about my own voicebank. This is not to suggest a complete material exchange in which I no longer had control over the voice, nor that there is no presence of my voice in it, but to draw attention to the fact that the voice in UTAU does not flow – and never has – neatly from body to world, and is instead spun out into various mediations. Through these mediations, both technological practicalities of how vocal quality is configured as parameters, as well as cultural aesthetic conventions of trends in vocaloid music produce gender in UTAU. Though I caution against a hierarchy of human above non-human, since the synthesized voice has proven its agential potential, I ultimately find joy in UTAU because of its ability to cast producers as more human, by offering manifold, gender non-conforming possibilities for creative acts of self.


NOTES

[1] Anonymous user, Discord direct message to author, November 29, 2022.

[2] For example, “Gomenne Gomenne” produced by Kikuo.

[3] A cover of “Melt” but with an updated version of the Miku voicebank presents a warmer vocal quality: Eji Warp. 2016. “Melt with Miku V3 (from Cillia).” YouTube, August 2, 2016. https://www.youtube.com/watch?v=vRmI3e5kJy0&ab_channel=EjiWarp

[4] https://www.reddit.com/r/utau/

[5] Anonymous user, Discord direct message to author, December 10, 2022.

[6] For example “Melt” produced by Ryo from Supercell; “Miku” produced by Anamanaguchi; “Aishite, Aishite, Aishite” produced by Kikuo.

REFERENCES

Bates, Eliot. 2012. “The Social Life of Musical Instruments.” Ethnomusicology 56 (3): 363–95.

Bennett, Jane. 2010. Vibrant Matter: A Political Ecology of Things. Duke University Press.

Black, Daniel. 2012. “The Virtual Idol: Producing and Consuming Digital Femininity.” In Idols and Celebrity in Japanese Media Culture, edited by P. W. Galbraith and J. G. Karlin, 209–228. Palgrave Macmillan.

Boellstorff, Tom. 2008. Coming of Age in Second Life: An Anthropologist Explores the Virtually Human. Princeton University Press.

Boellstorff, Tom. 2016. “For Whom the Ontology Turns: Theorizing the Digital Real.” Current Anthropology 57 (4): 387–407.

Born, Georgina. 2005. ‘On Musical Mediation: Ontology, Technology and Creativity’. Twentieth-Century Music 2 (1): 7–36.

Cavarero, Adriana. 2005. For More than One Voice: Toward a Philosophy of Vocal Expression. Translated by Paul A. Kottman. Stanford University Press.

Conner, Thomas. 2016. “Hatsune Miku, 2.0Pac, and Beyond: Rewinding and Fast-Forwarding the Virtual Pop Star.” In The Oxford Handbook of Music and Virtuality edited by Sheila Whiteley and Shara Rambarran, 129–146. Oxford University Press.

Cooley, Timothy J., Katherine Meizel, and Nasir Syed. 2008. “Virtual Fieldwork: Three Case Studies.” In Shadows in the Field: New Perspectives for Fieldwork in Ethnomusicology edited by Gregory F. Barz and Timothy J. Cooley, 90–107. Oxford University Press.

Duggan, Mike. 2017. “Questioning ‘Digital Ethnography’ in an Era of Ubiquitous Computing.” Geography Compass 11 (5): 1–12.

Eidsheim, Nina Sun. 2019. The Race of Sound: Listening, Timbre, and Vocality in African American Music. Duke University Press.

Hayles, Katherine. 1999. How We Became Posthuman: Virtual Bodies in Cybernetics, Literature, and Informatics. University of Chicago Press.

Jackson, Louise, and Mike Dines. 2016. “Vocaloids and Japanese Virtual Vocal Performance: The Cultural Heritage and Technological Futures of Vocal Puppetry.” In The Oxford Handbook of Music and Virtuality edited by Sheila Whiteley and Shara Rambarran, 101–110. Oxford University Press.

Keith, Sarah, and Diane Hughes. 2016. “Embodied Kawaii: Girls’ voices in J-Pop.” Journal of Popular Music Studies. 28: 474–487.

Lam, Ka Yan. 2016. “The Hatsune Miku Phenomenon: More Than a Virtual J-Pop Diva.” The Journal of Popular Culture 49 (5): 1107–1124.

McLeod, Ken. 2016. “Living in the Immaterial World: Holograms and Spirituality in Recent Popular Music.” Popular Music and Society, 39: 501–515.

Michaud, Alyssa. 2022. “Locating Liveness in Holographic Performances: Technological Anxiety and Participatory Fandom at Vocaloid Concerts.” Popular Music 41 (1): 1–19.

Prior, Nick. 2021. “STS Confronts the Vocaloid: Assemblage Thinking with Hatsune Miku.” In Rethinking Music through Science and Technology Studies, edited by Antoine Hennion and Christophe Levaux, 47–66. Routledge.

Ribac, François. 2021. “Is DIY a Punk Invention?: Learning Processes, Recording Devices, and Social Knoweldge.” In Rethinking Music through Science and Technology Studies, edited by Antoine Hennion and Christophe Levaux, 213–226. Routledge.

Sabo, Adriana. 2019. “Hatsune Miku: Whose Voice, Whose Body?.” INSAM Journal of Contemporary Music, Art and Technology, 1: 65–80.

Schlichter, Annette. 2011. “Do Voices Matter? Vocality, Materiality, Gender Performativity.” Body & Society 17 (1): 31–52.     

de Seta, Gabriele. 2020. "Three Lies of Digital Ethnography." Journal of Digital Social Research 2 (1): 77–97.

Snape, Joe and Georgina Born. 2022. "Max, Music Software and the Mutual Mediation of Aesthetics and Digital Technologies" in Music and Digital Media, edited by Georgina Born, 220–266. UCL Press.

Taylor, Timothy. 2001. Strange Sounds: Music, Technology and Culture. London and New York: Routledge.

Théberge, Paul. 2017. “Musical Instruments as Assemblage.” In Musical Instruments in the 21st Century, edited by Till Bovermann, Alberto de Campo, Hauke Egermann, Sarah-Indriyati Hardjowirogo, and Stefan Weinzierl, 59–66. Springer.

Thibeault, Matthew D., and Koji Matsunobu. 2020. “Learning From Japanese Vocaloid Hatsune Miku.” In The Oxford Handbook of Social Media and Music Learning edited by Janice L. Waldron, Stephanie Horsley, and Kari K. Veblen, 511–527. Oxford University Press.

Weidman, Amanda. 2021. Brought to Life by the Voice: Playback Singing and Cultural Politics in South India. University of California Press.

Zaborowski, Rafal. 2023. Music Generations in the Digital Age. University of Amsterdam Press.