Collective Craft: How Artists Collaborate to Train AI-based Audio Synthesis Model for Music
DOI: https://doi.org/10.1145/3803784.3807533
C&C '26: Creativity and Cognition, London, United Kingdom, July 2026
Recent advances in generative AI have enabled new forms of audio synthesis for musical creation, yet model training remains underexplored as an artistic and collaborative practice. While prior work has focused on composition tools and live performance interfaces, how artists train their models have received less attention. This paper presents a qualitative study of collaborative training practices around audio synthesis models. We conducted semi-structured interviews with 13 artists across 8 musical projects relying on custom-trained models for musical performance or installations. Using thematic analysis, our findings show that artists approach training as a situated, craft-like practice rather than a purely technical task. Collectives develop shared languages and embodied practices to communicate and guide training, using iterative feedback and dataset sculpting to shape model behavior, with musical engagement as a primary method for steering models. We discuss implications to better support model training as key site of interaction and collaboration.
ACM Reference Format:
Théo Jourdan, Jules Françoise, and Frédéric Bevilacqua. 2026. Collective Craft: How Artists Collaborate to Train AI-based Audio Synthesis Model for Music. In Creativity and Cognition (C&C '26), July 13--16, 2026, London, United Kingdom. ACM, New York, NY, USA 13 Pages. https://doi.org/10.1145/3803784.3807533
1 Introduction
Research on Artificial Intelligence (AI) for music encompasses a wide range of applications, from real-time interaction and gesture analysis to symbolic composition and audio generation. Early approaches relied on feature-based classification and regression models to map gestures to sound supported by various tools [8, 21] that framed their uses as an interactive and creative process. With the rise of Deep Learning (DL), models increasingly operate directly on low-level musical data, enabling symbolic music generation and audio synthesis [5]. Unlike handcrafted approaches such as rule-based music generation systems [46], Neural Audio Synthesis (NAS) models do not rely on explicit compositional rules predefined by the artist, but instead learn musical structures directly from data. Over the past few years, a growing number of works have explored the use of these models for real-time music performance, investigating ways of controlling NAS through tangible or embodied instruments [39, 45], or visual interfaces [7, 49]. Nevertheless, the adoption of such techniques by artists remains confined to musical communities with strong ties to scientific research. This raises questions about the barriers preventing wider use and appropriation of these technologies.
To analyze these factors, previous research has examined ethical issues related to data collection [14, 43] or authorship and copyright [2, 13], yet the training of NAS models by artists or within artist-centered collectives remains largely underexplored. In this study, we consider the training phase as encompassing data curation, model optimization, and model evaluation. Training is often treated as a background technical step, minimally described in artistic and research projects, and is typically carried out through code-based workflows on GitHub repositories or notebook environments. Yet, from a creative perspective, we argue that training is a pivotal phase of artistic production and collaboration, often involving diverse actors – such as artists, engineers, researchers or instrument designers – and during which aesthetic intentions are negotiated and gradually embedded into the model. From this standpoint, training is not merely a technical prerequisite but a central site of collective creative decision-making. Accordingly, this work is guided by the following research questions:
- How do artists collectively make sense of opaque NAS training processes?
- How are artistic intentions communicated, negotiated, and inscribed into models during training?
To address these questions, this paper presents a qualitative study of collaborative training practices around NAS models. We conducted semi-structured interviews with 13 artists involved in 8 musical projects that relied on custom-trained models for performance or installation. Through thematic analysis, we identify how artists approach training not as a purely technical operation, but as a situated, craft-based, and collaborative practice. Our findings show that artists navigate epistemic tensions between machine learning (ML) assumptions and embodied musical knowledge by developing shared languages, embodied frameworks and musical engagement strategies. Rather than seeking large, all-purpose models, artists tend to prioritize controllability, playability, and expressive precision, which leads them toward iterative strategies of specialization. Thus, training is shaped through dataset sculpting, collective listening, improvisation, and performance, which function both as methods for understanding model behavior and as criteria for model steering. Finally, contrary to narratives that frame AI as democratizing music-making, our findings suggest that these processes remain complex and frequently require mediation through technical expertise. By foregrounding training as an artistic and social process, this work contributes a missing perspective to research on creative use of AI-based music systems.
The paper is structured as follows: Section 2 reviews related work on NAS models and their creative applications, and situates our contribution within research examining how AI/ML technologies are adopted in artistic practice. Section 3 outlines our methodological approach, introducing the interviewees and artistic projects involved in the study. Section 4 presents the results of our thematic analysis, structured around three main themes and seven subthemes. Finally, Section 5 discusses the implications of these findings for collaborative dynamics, roles and forms of expertise, and argues for the development of more interactive approaches to model training.
2 Background and Related Work
Research on AI for music making and performance spans a wide range of uses and applications, from movement-based interaction [10, 20] to symbolic composition [36, 50] and audio generation [6, 17]. In this article, we focus on ML-based sound synthesis techniques, in particular those based on neural network architectures, rather than issues of mapping or symbolic music generation. In this background section, we first review NAS techniques and discuss their use in musical performance. Then, we situate the present study within related research on qualitative AI-art inquiries.
2.1 Models for Neural Audio Synthesis
Advances in DL have enabled sound synthesis models capable of operating on complex, low-level audio representations. Unlike handcrafted or rule-based music generation systems [38, 41], these models learn directly from musical corpora rather than relying on explicitly defined compositional rules, allowing them to generalize across styles and generate novel musical material from data.
DL-based sound synthesis generally relies on approaches that generate either waveforms (e.g., autoregressive models such as WaveNet [47]) or time–frequency representations (e.g., spectrogram-based approaches such as RAVE [6]). We now outline the main neural synthesis methods used for musical performance. For a full technical review of neural audio generation, see Božic et al.’s review [3].
A large body of work on NAS is often framed through generative models that map latent or stochastic representations to audio, enabling both high-quality synthesis and expressive interaction. Autoencoder-based approaches, such as Variational Autoencoders (VAEs) [30], structure generation around a learned latent space that supports smooth interpolation and controlled variation. Adversarial extensions such as Generative Adversarial Networks (GANs) [25] further improve perceptual realism by training a generator network in competition with a discriminator that distinguishes real from synthesized audio. Models such as RAVE [6], WaveGAN [15], GANSynth [17], and VQ-VAE [11] exemplify different trade-offs between temporal resolution, spectral structure, and controllability. In performance contexts, these systems are compelling because the latent space functions as an instrument, enabling performers to navigate, interpolate, and shape timbre in real time through gestures or control signals.
Autoregressive models based on RNNs [34], such as SampleRNN [32], generate audio sequentially, capturing both long-term structure and fine temporal detail. More recently, diffusion models [40], including DiffWave [31] and AFTER [12], have enabled high-fidelity synthesis through iterative denoising. Complementing these data-driven approaches, DDSP models [18, 26] combine neural networks with explicit synthesis modules, mapping musically meaningful features such as pitch or loudness to interpretable controls, and are particularly suited for real-time interaction and instrument modeling.
2.2 Applications of Neural Audio Synthesis for Musical Performance
From an application standpoint, artists appropriate these techniques in diverse ways, reflecting different aesthetic goals and performance contexts. We identified two main paradigms, where NAS models are used either as a controlled synthesizer, potentially to build embodied instruments, or as (semi-)autonomous agents.
A first broad category may be described as AI used as a controlled synthesizer, in which neural models operate as sound engines driven by explicit parameters and visual representations [42, 48], external controllers [29], or gestural input [33]. In this configuration, AI systems replace or extend traditional synthesis methods while remaining largely subordinate to human control, with an emphasis on responsiveness, stability, and performative playability. Such approaches are frequently extended to create embodied or augmented instruments, in which sound generation, control strategies, and physical interaction are tightly coupled, strengthening the immediacy of the performer–instrument relationship. Representative examples include AI-terity, developed by Tahiroğlu et al. [45], a deformable musical instrument whose audio synthesis engine is based on GANSynth, allowing performers to shape sound through physical deformation, and Stacco, proposed by Privato et al. [39], a digital musical instrument specifically designed to provide intuitive access to the latent space of a RAVE model via the displacement of magnetic objects on a tangible interface.
By contrast, a second category encompasses AI/ML systems designed as autonomous or semi-autonomous agents, characterized by a higher degree of generative independence. In these settings, models produce sound material or musical structures with minimal or indirect human intervention, often through high-level conditioning or initialization rather than continuous control. In this vein, Erdem et al. [19] proposed an agent-based audiovisual live processing instrument that monitors muscle and motion data streamed from a Myo armband worn on the performer's forearm. These systems are commonly framed as creative partners or co-performers, shifting the performer's role toward curation, supervision, or dialogue with the algorithm. As a result, such practices foreground broader questions of authorship, agency, and evaluation within musical performance and composition [27].
The growing adoption of NAS has been facilitated by the availability of creative coding environments and artist-oriented tools. Compared to the large number of contributions focused on model development in this field, relatively few have been deployed within software environments. Max/MSP 1 has become a central platform for deploying neural audio models in musical contexts, with several widely used implementations such as RAVE [7] and DDSP [24] that are also distributed as audio plug-ins or standalone applications, lowering the barrier to entry for composers and producers accustomed to digital audio workstations.
Although several artists are sometimes involved in the projects described, collaborative dynamics are rarely explored in depth, as publications typically prioritize outcomes and creative production over the underlying processes. As a result, little is known about how collaborators interact within these artistic projects, highlighting the need for a focused analysis of their collective practices.
2.3 AI Art Studies
Alongside technical developments in NAS, a growing body of work examines how AI/ML technologies are adopted, interpreted, and contested by artists and researchers working at the intersection of art, music, and AI. Rather than treating AI solely as a set of computational tools, this literature approaches AI as a cultural, aesthetic, and political phenomenon, foregrounding questions of meaning-making, artistic agency, and creative practice.
Several influential studies document the experiences of AI-artists by relying on first-person and reflexive methodologies to articulate long-term engagements with ML in artistic work. Fiebrink and Sonami [23] document parallel perspectives of an artist and a technology designer, reflecting on decades of ML-based music practice to identify what makes such systems valuable in composition and performance, as well as how their usefulness evolves over time. Their dialogical format reveals artists’ implicit strategies for navigating uncertainty, training constraints, and system behavior, informing the design of ML-based musical tools. Similarly, Caramiaux and Donnarumma [9] adopt a research-through-practice approach and analyze a long-term collaboration showing how ML systems shift from tools to active performance participants, highlighting epistemological implications for hybrid art–science practices.
Beyond individual case studies, recent work has sought to situate AI-based music and instrument research within broader cultural and political frameworks. Jourdan and Caramiaux [27] conducted an interview-based qualitative inquiry within the NIME community to explore how practitioners engage with ML, revealing ambivalent adoption patterns and resistance to technological determinism. Kala et al. [28] extend prior research by conducting interviews with artists, including, but not limited to, musicians, and examine various frictions encountered when working with creative AI tools such as limitations of tools and resources, collaborative constraints, and shifts in agency and authorship. Divakaran et al. [14] broadened the discussion outside the Euro-Western context by interviewing Indian artists and foregrounding the socio-cultural contexts shaping their engagement with AI.
Despite a rapidly expanding ecosystem of NAS models, a significant limitation persists in both practice and discourse: the training of models itself remains under-examined. In practice, trainable synthesis models are distributed through research-oriented GitHub repositories or notebook-based workflows. Consequently, relatively few tools exist to support artists in shaping, understanding, or intervening in this process. This situation stands in marked contrast to earlier research on interactive machine learning for music, where dedicated systems, most notably platforms such as Wekinator [21], successfully lowered the barrier to training and experimenting with classification and regression models. While ethical and curatorial questions surrounding dataset construction have received considerable attention, little work has explicitly examined what occurs during the act of training from the artist's perspective. We argue that a more detailed and in-depth understanding of the creative dynamics unfolding during the model training process would provide valuable insights for designing tools that more effectively support artists in this task. Finally, to the best of our knowledge, no study has specifically examined the dynamics of collaboration within artistic collectives using AI models, despite such collaboration being commonplace in practice.
3 Method
3.1 Recruitment of the participants and ethical statement
Participants were selected from collaborative artistic projects involving ML–based sound synthesis for performance or music production. Projects were required to include at least two collaborators and to rely on sound synthesis models trained specifically for the work. The identification of participants was conducted in two stages. First, we surveyed artistic productions published, exhibited, or performed within music technology contexts, including festivals and conferences such as MUTEK, NIME, S+T+Arts, Ars Electronica, and AIMC. In a second stage, to mitigate the risk of overlooking relevant projects, we distributed a public call for participation through 12 mailing lists dedicated to ML, music technology, and artistic research. This process resulted in 13 participants across 8 projects. For some projects, only one person was interviewed in the group due to planning availability, but each time this person was the main contributor to the training part of the project. All participants received detailed information about the study and provided informed consent, with the option to withdraw at any time. Table 1 summarizes the main information for the study. It lists the interviewed participants as well as other collaborators involved in each project. Interviewed individuals are identified in the findings using acronyms derived from their first and last names. Detailed descriptions of each project are provided in the Appendix along with an overview of stakeholder roles and the background of each interviewee. Each project description is followed by a URL linking to a page that provides additional information and includes illustrations to help visualize the developments carried out across the various projects. The projects involve a range of autoregressive and DDSP-based models commonly used in practice, along with diverse interaction modalities, including voice, dance, musical instruments, and controllers.
| Pieces | Interviewees | Model used | Interaction Type | Collaboration |
|---|---|---|---|---|
| Bla Blavatar vs Jaap Blonk | Jonathan Chaim Reus (JCR) Victor Shepardson (VS) | RAVE, Text-to-voice | Voice | Jaap Blonk (Poet, Performer) |
| Digitizing Chinese Erhu | Wenqi Wu (WW) Hanyu Qu (HQ) | DDSP | Gesture | |
| Enacteur | Farzaneh Nouri (FN) Hugo Liorret (HL) | Autoregressive handmade | Controller, Instrument | |
| Prelude | Sarah Nabi (SN) | RAVE | Gesture, Controller | Marie Bruand (Dancer) |
| Motherbird | Jack Armitage (JA) | RAVE | Instrument, Controller | Jessica Shand (flutes) Manuel Cherep (controller) |
| Approximations | Molly Jones (MJ) | WaveNet | Instrument | Louis Pino (percs) Matti Pulkki (accordion) |
| Latent Terrain Synthesis | Shuoyang Jasper Zheng (SJZ) | Music-to-Latent Stable Audio Open | Controller | Keigo Yoshida (controller) |
| DAIM™ | Hugo Scurto (HS) Axel Chemla Romeu Santos (ACRS) Kevin Raharolahy (KR) | RAVE | Controller |
3.2 Interviews
We conducted one interview per artistic project, involving multiple members when possible, or a single participant with substantial involvement in model training when necessary. Whenever feasible, interviews took place in a focus group format to encourage interaction and collective reflection. All interviews were semi-structured, conducted remotely via video conferencing tools, and lasted approximately one hour, beginning with general questions about the project, its objectives, and the choice of ML models for sound synthesis. The interviews then focused on exploring three key phases of the artistic projects: data collection (how datasets were constructed and curated), model training (the training process, challenges, and evaluation criteria), and collaboration (how participants interacted and communicated during data collection and training). All interviews were audio-recorded, transcribed locally using Whisper 2, and manually proofread for accuracy. Once validated, the audio recordings were permanently deleted.
3.3 Data analysis
We conducted a thematic analysis following Braun and Clarke's framework [4]. Interview transcripts were read multiple times, then independently coded by all authors using an inductive approach, with codes derived directly from the data [37]. The codes were compared, discussed, and consolidated, before being grouped into broader themes. Through iterative refinement and collaborative discussion, this process led to the identification of three themes, each comprising two to three sub-themes, presented in Section 4.
4 Findings
Our analysis revealed three interconnected themes that shed light on how artists collaboratively engage with the training of neural audio synthesis models. First, artists bridge the epistemic and aesthetic gaps between ML systems and their own musical practices by developing new modes of communication, both within collectives and with the models themselves. Second, artists develop know-how through iterative, craft-like processes of refining and sculpting datasets. Third, musical engagement – through improvisation, performance, and collective evaluation – serves as the primary method for understanding, evaluating, and guiding models, transforming training into a dynamic, embodied, and collective practice.
4.1 Collectives Develop their Own Modes of Communication
This theme explores how artists confront the gap between reductive ML assumptions and embodied, culturally situated musical knowledge, prompting new forms of communication within collectives and with the model.
4.1.1 Epistemic and Aesthetic Mismatch: AI's Reduction of Embodied Musical Knowledge. Participants consistently described a mismatch between the epistemic assumptions embedded in ML models and the forms of musical knowledge mobilized in artistic practice. Sound synthesis systems often privilege abstract, quantifiable parameters, marginalizing embodied, culturally situated forms of musical knowledge. As a result, artists often experience AI not as neutral tools, but as systems that actively reshape creative possibilities. As JCR explains, the vocal synthesis model they employed encodes musical features according to criteria that artists themselves do not consider relevant, resulting in aesthetic biases:
They have implicit modeling of speech features, and anything around voice synthesis almost always involves some kind of F0 estimation. Jaap's music is not about pitch, and I wanted to make sure I didn't bias the model.JCR
This highlights how engineering conventions can constrain artistic intentions from the outset of model design and training. This reduction becomes even more pronounced when dealing with non-Western or traditional musical practices, where expressivity often lies in subtle variations. As reported by WW, working with traditional Chinese music:
Chinese music pitch is not exact. We value raw natural sound. We had to be careful because models assume a very different tuning grammar. If pitch and timbre are too stable, the instrument sounds dead to a professional erhu player. Traditional Chinese music includes many subtle noises, variations, and environment-dependent resonances. These are often one-time experiences that don't translate well into model training. WW
Artists appear to adopt two distinct attitudes toward this epistemic and aesthetic mismatch between optimization-driven learning and artistic intention. The first corresponds to what ACRS describes as a convergent approach: guiding the model towards a result or aesthetic that aligns with the artist's expectations, while negotiating with the technical constraints of the system to seek a compromise. MJ articulates this perspective through her collaboration with two instrumentalists, where models were intended to approximate each performer's style:
When I trained the models on his data, it took much longer to converge than with Matty's data. I created a WaveNet model for each performer, trained on their own data, using Google Cloud. Some generated sounds were still a bit strange, but most sounded like the performer MJ
Artists strive to build models that capture or support their personal musical practice, and they underline the difficulty of translating between epistemic frameworks, particularly when communicating across artistic and technical domains.
By contrast, other artists deliberately move away from this principle of convergence, favoring a divergent approach. Such attitudes can even seek to displace the artist from their habitual sonic identity, using AI to generate forms that exceed or diverge from their established musical vocabulary. ACRS emphasizes this approach by deliberately embracing failure and incorporating artifacts as part of the work itself:
I really distinguish between a convergent use, which is like: AI is a tool, there's this positivist discourse where it reconstructs parameters extremely well, it's absolutely amazing, it has to work that way. And then there's a divergent approach. In fact, the divergent approach even includes artifacts and playful uses, and you integrate those into the evaluation of the object. And in our case it's completely divergent, because in the end we train a model on DAIM, but if it doesn't produce DAIM, we don't careACRS
The training of sound synthesis models is fundamentally based on optimization algorithms designed to drive the model toward convergence on a specific solution. As JA notes, these systems become particularly valuable when approached as “tools for misinterpretation and weird extrapolation”, opening up alternative aesthetic possibilities beyond convergent optimization goals.
4.1.2 Establishing Communication within the Collective to Align Meaning, Interpretation, and Artistic Intent. As shown above, ML models’ representations of sound often clash with artists’ aesthetic practices. When AI enters artistic collectives, technical vocabulary proves inadequate for communicating artistic intent and interpretation of model behavior. As SJZ notes, “explaining technical concepts in accessible terms is hard.” and requires “constantly adapting the language”. Similarly, there is a “barrier in terms of vocabulary” for SN, even considering that “the language itself, verbal language, was not a support for communication.”
We found that artists employ diverse strategies to improve communication within their collectives. In the piece entitled “Prelude”, a duo consisting of a musician and a dancer controls a ML model together through two modes of interaction, a computer and a controller for the musician and movements for the dancer. SN explained that they needed to “communicate through examples rather than verbal explanation”. She developed a visual representation based on 2D Uniform Manifold Approximation and Projection (UMAP), to make data and model behavior intelligible:
We almost developed our own way of communicating. It's not telepathy, but a mode of communication specific to us that gradually emerged. What happened was that I projected my interface with the UMAP, because she really didn't understand the latent space at all, what was happening, what the descriptors were. Having a visual support, where you can see things moving, curves evolving, allowed her to see the correlation between a gesture I make and what happens on the interface.SN
Understanding is not achieved through verbal communication alone, but through interactions between embodied exploration, visual and auditory feedback.
One project developed a shared language in a particularly explicit form through a dedicated notation system. In “Bla Blavatar vs Jaap Blonk”, the artists used a custom real-time voice synthesis instrument (Tungnaa) trained on recordings of Jaap Blonk's poetic performances. Because Blonk's vocal practice was highly codified, JCR explains that they needed to “develop a new notation system”, which was continuously refined throughout the project:
The first notation system I proposed was very abstract. [...] It was describing physiological forms of the vocal tract so it's like open voice, closed voice, upper palate resonance, forward palate resonance, back palate resonance, and then he would had some phonetic cue, saying what to perform, maybe like a “u” and double dot or something like that, and then this long string of weird characters. And he was like: “this is too open. This is too flexible”. So I said, Ok Jaap, Can you make a set of symbols, like no more than 20 symbols, that would capture 95% of what you do, the sounds that you make”JCR
These examples show that artistic collectives actively invent new visual, embodied, and symbolic languages that make collaboration possible, allowing meaning, intent, and aesthetic judgment to circulate between humans in ways that standard technical vocabularies cannot support.
4.1.3 Co-Constructing New Frameworks and Notations to Communicate with AI. These modes of communication not only facilitate collaboration among humans but also enable communication with the ML models themselves. SN used their UMAP visualization to directly shape the model training process through spatial and visual interaction rather than automation and fixed metrics: “I started dissecting each latent space to understand what we needed, what we should keep, what we shouldn't keep. And for that, I used our UMAPs, exploring them manually. That's how I re-trained the model.” The artist does not merely inspect the model but actively constructs an interpretive framework that allows them to decide what parts of the model's internal representations are meaningful, undesirable, or worth preserving.
Similarly, in other collectives, the construction of shared frameworks for communicating with AI is inseparable from cultural mediation. WW highlights how collaboration with HQ was essential to aligning the model's training process with the aesthetic and cultural values of Chinese traditional music:
Hanyu has helped me a lot especially in terms of cultural knowledge and the development of Chinese traditional culture. Her insights into what is unique in Chinese traditional music, including music theory and aesthetics, helped me understand how culture is embedded in sound. So we were very careful in how we trained the model, especially around pitch and expression.WW
In this case, training the model becomes a site where cultural knowledge is actively negotiated and inscribed into the system.
The notation system described by JCR to transcribe the musical style of Jaap Blonk is also employed in the training process, and more specifically in the creation of scores that serve as the foundation for data collection and model refinement: “With this phonetic alphabet, the collaboration loop would be, I would write a score, Jaap would record it, and then we would retrain the model or fine tune the model. [...] This notation might make the model more controllable.” Controllability does not emerge from direct parameter tuning, but from the establishment of a common language across performers, notation, and model. In this sense, notation is less a static representation than an operational language: it structures how the model learns.
Together, these examples show that co-constructing languages, notations, and interpretive frameworks is a central strategy through which collectives make AI models musically meaningful, culturally situated, and responsive to artistic intent.
4.2 Artists Develop Situated Knowledge Enabling Appropriation of Model Training
This section examines the situated practices through which artists gain actionable expertise to shape AI training processes.
4.2.1 Training models: an iterative process guided by aesthetic goals. For all collectives, training a model is never a linear process in which data collection is completed first, followed by a single round of training that produces a model immediately fit for use. In practice, participants rarely know in advance what to expect from a model, and because training is undertaken within an artistic context, model quality is not determined by technical benchmarks but by aesthetic and artistic criteria. As VS illustrates: “ Individual training runs are opaque and unpredictable, but by repeatedly training, evaluating, and adjusting parameters or datasets, we could gradually guide the system in a useful direction.” Evaluation criteria emerge through embodied experimentation, making training an iterative feedback loop between data collection, training, and evaluation:
It's better to quickly make some shitty data and train a model than spend months trying to make the perfect dataset and then train. Because the training process and interacting with the result teaches you how to make the dataset. You need the full feedback loop. JA
JA emphasizes that experimentation is more than just refining outcomes; it is a critical step in which the model reveals its potential uses to the practitioner, shaping the evolution of the training process. To guide this process, evaluation is often collective. Members share examples, conduct in-person testing sessions, and discuss results together, as illustrated by MJ: “During training, I periodically pulled down the model, generated audio, listened, and sent examples to Matty and Louis. [...] I visited them around January 2023 with early sketches. They tried them, gave feedback (“this works, this doesn't”), and sometimes sent recordings. I revised based on that.” These collective feedback practices reinforce the idea that training is not an individual technical task, but a shared, conversational, and exploratory practice through which both the model and the artistic intentions are progressively shaped. Stopping this iterative process is itself an artistic decision, one that is difficult to quantify or formalize. In practice, pragmatic constraints such as time limitations imposed by residencies or production contexts often bring experimentation to an end. As SN explains, the interruption of experimentation was primarily dictated by time pressure: “That stopping point was mainly determined by time constraints, unfortunately, because we only had three months”.
The only exception to this iterative logic appears in the DAIM project, where the artistic approach often deliberately consists of using one-shot trained models regardless of the outputs they produce. Here, failure, artifacts, and malfunction are not signs of inadequacy but central components of the artistic evaluation itself as ACRS explains: “We make a lot of sense of listening to the waste, to what is considered not to work”
4.2.2 Sculpting the dataset. Artists engage in what can be described as a meticulous practice of dataset sculpting. Rather than treating the dataset as a fixed input prepared once and for all, collectives describe an ongoing process of manual sorting and refinement, work they consistently considered tedious and time-consuming. As JCR notes, “the limitation is the amount of labor required to clean up the datasets and notate and like go back and fix annotation errors and stuff like that. That takes a long time. and a lot of energy”. Similarly, SJZ describes this process as “trial-and-error” where the artist “spend a lot of time testing different configurations”. Concretely, this can consist in removing peripheral sounds like “ambient sound or other sound sources” (JA), discarding “sounds that were too quiet” (MJ), or pursuing a reduction and clarification of musical information, as articulated by WW: “I refined my dataset and recorded one technique per pitch range, trying to keep each sound as consistent as possible. I wanted to minimize the number of concepts the model had to learn. After retraining the model with this cleaner dataset, I was more satisfied”. Here, dataset sculpting operates as a form of conceptual compression, where artistic decisions about what matters musically are directly translated into the structure of the training data. In other cases, artists reshape the corpus by introducing specific sound qualities, as VS describes: “I suggested reducing long, repetitive utterances and adding more phonetic variation. The process evolved continuously.”
Crucially, this manual and empirical approach is not framed as a flaw to be eliminated, but as a form of creative craftsmanship. This hands-on process of dataset sculpting is clearly illustrated by SN's account:
I kept removing data little by little. It was very tedious and very time-consuming. And at the same time, it was enjoyable, I think, because there's still that craft-like, hands-on aspect to it. I think it was both frustrating and also quite pleasant to keep that manual side of things, to be able to say, “oh right, I matter in this process too.” It's not just the model learning everything on its own
SN's description emphasizes the affective ambivalence of the process, simultaneously “tedious” and “time-consuming”, yet also “enjoyable” and “pleasant”, suggesting that value lies not only in the resulting model but in the sustained interaction with the material itself. Importantly, this perspective underlines how such hands-on labor functions as a way of asserting artistic agency over an otherwise automated pipeline. By insisting that “I matter in this process too”, SN reframes training as a dialogic relationship between the artist and the model, resisting narratives in which learning is entirely delegated to the algorithm. JA's account further clarifies how artistic agency is exercised at the level of dataset design: “when you create the dataset, you establish which patterns you're interested in. Some patterns won't be semantic, but if you wanted to, you could make a flute dataset where you only play the note C. So the agency is treating it as pattern recognition: as an artist, I want to do X, Y, Z, what data patterns bring about that possibility space?”.
Although model training may appear to be a highly automated process with limited opportunities for direct artist intervention, our findings show that dataset sculpting serves as a crucial way for artists to reassert agency within this automation. By focusing attention on the selection, pruning, balancing, and transformation of training data, artists intervene upstream in the learning pipeline, where their actions have decisive influence over the model's behavior. Seen through this lens, dataset sculpting constitutes a key site of appropriation and a means of maintaining authorship, transforming training from a background operation into a performative and expressive component of the creative process.
4.3 Musical Engagement as a Situated Method for Understanding and Guiding AI Models
This theme examines musical engagement as a way to (1) build a practical understanding of model properties and (2) evaluate and guide the training process. In this sense, training and evaluation are not external technical phases, but are embedded within rehearsals, improvisation, discussion, and performance.
4.3.1 Building a Practical Understanding of Model Properties. The sound synthesis models used by artists are often complex DL systems whose behaviors and limitations are not immediately transparent. Regardless of whether artists initially have technical expertise in ML, all participants develop a practical understanding of model properties through use. FN describes this learning process as inseparable from collective musical practice, where abstract ML concepts become intelligible:
Because, as I said, it was a new thing for me and I was trying to isolate all this very complex, you know, personal processes. Like, for example, training is a very complex process that one needs to understand. And for me, it was during this feedback or this musical discussion. [...] What I understood specifically was this idea of overfitting and underfitting, but in a very tangible sense.FN
Here, notions such as overfitting and underfitting are no longer treated as technical concepts, but as audible and experiential phenomena that contribute to building a tacit, practical form of ML knowledge. WW reports a similar form of learning grounded in listening and experimentation, where model behavior is understood through its sonic outcomes: “Overtraining can also degrade sound quality. If you train on C3–C5 and play C2, the output becomes noisy.” Training ranges, extrapolation limits, and generalization capacities are learned as material constraints that shape musical possibilities.
Similarly, artists experiment with the models’ ability to produce sounds that are more or less faithful to the training material, as well as with their capacity to generate diverse or unexpected outputs. For example, HS articulates a trade-off between variability and sound reproduction: “There wasn't much variability, which I actually found funnier. The model worked well for reconstruction, but exploring the latent space was actually less interesting for us”. SN describes how this understanding of model's affordances emerged gradually during rehearsals, guiding them to use various specialized models crafted with different datasets:
We had an initial phase where we trained on a large dataset that, in our view, contained a lot of diversity. But we realized by playing with that we couldn't actually control that diversity, it became a kind of large cacophony. In a second phase, we separated three complementary sound environments and trained a specific model for each one, in order to guarantee a quality of expression and a degree of expressive freedom for the performance. That's a constraint that emerged gradually over the course of training the models. SN
This realization reflects an experiential discovery that a single model could not meaningfully capture heterogeneous sound categories while maintaining effective control. Rather than being a design decision made in advance, model specialization emerged as a practical response to the limitations of RAVE models, revealed through use.
WW describes a complementary strategy to regain control over the system's expressivity. Rather than modifying the training corpus or architecture, the artist focused on shaping the interaction layer, “filtering motion signals, stabilizing gesture-to-sound mappings, and introducing different performance modes” to adapt the behavior of the instrument in real time. One such mode “quantized pitch to a scale but still allows glissando between notes”, enabling the performer to move between “tremolo, glissando, or more stable tones” by adjusting parameters and switching modes. As HQ notes, this “rule-based approach relies heavily on knowledge of instrumental technique” illustrating how domain-specific performance knowledge can be embedded into control structures to make neural synthesis systems both expressive and playable.
In sum, artists develop a practical, situated understanding of models’ properties and behaviors. This knowledge emerges through aesthetic judgment and musical engagement, as they navigate competing priorities: synthesis quality, sonic diversity, and model controllability.
4.3.2 Musical practice as a method to Evaluate Model Behavior and Guide the Training Process. For artists, musical practice serves as a method to evaluate ML models using situated, aesthetic, and embodied criteria, which, in turn actively guide the training process. By playing with, listening to, and responding to the system, artists establish what counts as a successful or problematic behavior for them, and translate these judgments into concrete training decisions.
A recurrent configuration is that a performer within the collective explores the system through improvisation, while other collaborators attend to this interaction and interpret it as evaluative feedback. FN describes how feedback from her collaborator's felt experience of playing with the system directly informed retraining choices:
“it was mostly, we would do something, and I would ask, okay, did you feel like you were improvising? Was it following you too much? Was it not following you too much? And then these, I think the first version of Enacteur developed from all these feedback that I got, and 90% of them were from Hugo.’’FN
Here, criteria such as feeling followed, excessive reactivity, or freedom to improvise function as tacit evaluation metrics. In other collectives, this evaluative process is more explicitly collective and relational. In the DAIM project, training is explored through what the artists describe as jamming with the model: three collaborators gathered around a single computer, experimenting live and reacting to each other's responses. Rather than pursuing fixed objectives, musical interaction serves to elicit reactions that function as shared indicators of value:
“In general, we're three people around one computer. It's very jam-like, but in front of each other. And because of that constraint, we tend to experiment live, which makes things even more chaotic [...] But since we're very oriented toward music production, we only have a single screen. So we sit next to each other, and usually, if Axel or Hugo laughs, that means I'm probably heading in the right direction”KR
In this setting, affective cues, laughter, surprise, shared excitement, act as informal but salient evaluative signals. Musical practice becomes a way of collectively calibrating what the system affords and what directions are worth pursuing. Training is less guided by abstract performance metrics than by the model's capacity to generate responses that resonate with the group's shared sensibilities. Evaluation is enacted socially, through musical interaction, and training decisions follow from these moments of collective recognition.
Performance itself can become the primary site of evaluation and guidance. JCR described a workflow in which dataset construction, model training, and performance are folded into a single continuous practice:
“I make a score, we perform it live, we record it live. That way all this work that's going a dataset becomes a performance. Whatever the latest model is that's what we perform with [...] It has also been really interesting in terms of influencing how I compose the pieces, because I need to compose them in a way that they have an arc to them. They have a story that they're telling. So I can't just randomly generate lines of notation. As I really think about how this tells a story to the audience, how it gives us sonic possibilities that make sense in a sort of meta-compositional structure. And also to communicate to the audience what dataset making looks like or sounds like”JCR
Here, performance conditions introduce narrative, temporal, and aesthetic constraints that become evaluative frameworks for both the model and the training process. The presence of an audience foregrounds questions of coherence, expressivity, and meaning, which shape decisions about what data is collected and how the system is refined. Rather than converging toward a finalized or optimized model, training is guided by the ongoing assessment of what the system can meaningfully support in performance. Evaluation and training are inseparable from artistic presentation.
Across these cases, musical practice operates less as a means of mastering the model than as a method for defining, testing, and refining criteria to evaluate it. Improvisation, jamming, and performance serve as embodied evaluation procedures through which artists assess how the system listens, responds, and transforms sounds.
5 Discussion
5.1 Collective Practices for Regaining Agency over AI Pipelines
The findings presented in the previous section and organized around three themes ultimately describe three key moments of collaboration within artist collectives. First, collaboration often crystallizes around the construction of shared modes of communication, in order to overcome the limitations of technical language and the formalism implicitly embedded in ML models. This effort becomes particularly salient given the automated nature of sound generation in DL systems: by moving from rule-based synthesis, where compositional rules are explicitly defined by the artist, to NAS, these rules are implicitly delegated to the model, resulting in a voluntary loss of direct control over sound production. Re-establishing forms of communication rooted in artistic judgment thus emerges as a necessary first step for artists to reclaim agency over the system.
The second key collaborative moment concerns the development of technical know-how, where the gradual refinement of datasets was frequently described by participants as craft-like. These practices often involve asymmetries of expertise, where hands-on technical work is carried out by one or a few individuals more expert in using AI models, leading to the distribution of roles within the collective. While technical knowledge may be concentrated, artistic decision-making remains collaborative, guiding how technical practices are oriented and evaluated. From this perspective, existing divisions of labor commonly found in non-AI artistic production are often reproduced rather than dissolved. Contrary to narratives that frame AI as democratizing music-making, our findings suggest that these processes remain complex and frequently require mediation through technical expertise.
The third collaborative moment unfolds through situated and embodied musical practice. Collective playing, improvisation, and performance function as primary means for evaluating model behavior and for steering training within the iterative feedback loop. These moments constitute a critical site where the collective reasserts control, using musical interaction to orient the model toward specific artistic outcomes and to negotiate its expressive boundaries.
These collective strategies for regaining agency over AI systems resonate with the critical analysis of generative AI music proposed by Morreale et al. [35]. The authors draw on Stiegler's concept of grammatisation [44] – a process that transforms continuities (e.g. gestures, sounds) into discrete elements (e.g. notation, writing, recording) – to argue that contemporary generative AI models enact reductive and normalising processes. Our work can be read as a situated response to these issues. By foregrounding training as a collaborative and interpretive practice, we show how artists actively negotiate and reconfigure these processes from within. In doing so, they develop new forms of grammatisation and associated know-how, through which musical knowledge is selectively formalized, rearticulated, and inscribed into the model. These practices align with Morreale et al.’s call to rethink generative AI beyond replication and normalisation, positioning training as a critical site where alternative epistemologies and aesthetic values can emerge and be actively shaped.
5.2 Supporting Interactive, Iterative Model Steering
Our findings point to the need to reconsider how model training is designed and supported within artistic practice. Rather than seeking large, all-purpose models, artists tend to prioritize controllability, playability, and expressive precision, which leads them toward iterative strategies of specialization. Training unfolds as a progressive process of dataset refinement and model adjustment, closer to a form of sculpting than to a single act of optimization. Despite its centrality to creative work, this craft-like engagement with models remains poorly supported, as training workflows are still largely restricted to individual, code-based environments and expert-oriented infrastructures that offer little support for collaboration, musical evaluation, or interpretability. This limitation is closely tied to a broader disconnect between technical representations of models and artistic ways of understanding sound. Training interfaces typically foreground parameters, losses, or embeddings that are legible to engineers but difficult for musicians to relate to their aesthetic intentions. Addressing this gap calls for a shift away from solitary, code-centered workflows toward shared rehearsal-oriented training spaces, where listening, discussion, experimentation, and retraining are tightly intertwined. Hereafter, we outline several directions for exploring the design of tools intended to better support artists’ creativity.
Because standard ML representations and vocabularies often fail to capture artistic intention, collectives actively invent their own communication systems to align meaning and interpretation. Creativity support tools should support custom grammars through labels, symbols or visual markers attached to datasets, latent spaces, or training timelines. Musical representations and notations should be able to support negotiation of meaning rather than enforcing standardized technical language.
Artists also develop their understanding of model affordances through iterative cycles of training, listening, and collective experimentation. Training interfaces could be developed to support rapid, low-cost iteration through fast retraining or fine-tuning, immediate auditory feedback, and reversible actions, following principles of Interactive Machine Learning [16, 22]. This perspective invites reconsideration of more frugal or lightweight models that may be less optimized for realism but far more interactive, acknowledging that perceptual fidelity is not always the primary artistic goal.
Finally, evaluation in these contexts is grounded in musical practice itself. Improvisation, rehearsal, and performance function as evaluative methods, relying on affective and embodied responses rather than formal technical metrics. Thus, training tools should ideally integrate live performance and rehearsal modes, allowing training and playing to coexist. Interfaces could capture non-technical feedback by tagging moments during improvisation that could be used to retrain models. More broadly, this implies replacing offline evaluation dashboards with performative evaluation spaces, where model assessment and refinement are continuous, situated, and inseparable from musical engagement.
6 Limitations
In this study, we interviewed 13 artists working in musical performance, composition, and sound art, most of whom are connected to scientific communities focused on AI for music. This reflects the marginal adoption of these tools within music communities, but also underlines a bias in our recruitment strategy, which targeted academic publications, research-driven festivals, and open calls primarily reaching scientific networks. For some projects, only a single member of the collective was interviewed; in such cases, we took care to interview participants who were the most directly involved in the model training process. Our sample overrepresents autoencoder-based sound synthesis models, particularly RAVE. This broader adoption can be attributed to RAVE's explicit design for real-time use and the availability of supporting tools such as Max/MSP integrations and DAW plugins. Additionally, while some projects address non-Western musical practices, our sample is predominantly Western. Further research is needed to include perspectives from the Global South and a broader range of cultural identities.
7 Conclusion
This paper has examined how artists collaboratively engage with the training of neural audio synthesis models, reframing it as a situated, craft-based, and collective practice rather than a purely technical task. Through interviews, we showed how training becomes a site where aesthetic intentions, embodied musical knowledge, and technical constraints are negotiated through shared languages, dataset sculpting, and musical engagement. By foregrounding training as a moment of creative agency and collaboration, our findings challenge prevailing tool-centric accounts of AI-based music systems and point toward the need for more interactive, collaborative, and musically grounded approaches to model training. While our study is rooted in AI-based music, it also points to broader implications for AI research. Artistic approaches to model training, characterized by rapid, iterative experimentation and willingness to work with imperfect data and intermediate outputs, offer alternative perspectives that may complement conventional training methodologies. These insights invite exploration into how artists’ practices can inform more human-centered approaches to AI development.
Acknowledgments
This work was supported by a French government grant managed by the Agence Nationale de la Recherche as part of the France 2030 program, reference ANR-22-EXEN-0004 (PEPR eNSEMBLE / PC3 MATCHING). We want to express our sincere gratitude to the artists and researchers interviewed for their time and their thoughts.
References
- Jack Armitage, Victor Shepardson, and Thor Magnusson. 2024. Tölvera: Composing With Basal Agencies. In Proceedings of the International Conference on New Interfaces for Musical Expression, S M Astrid Bin and Courtney N. Reed (Eds.). Utrecht, Netherlands, Article 42, 10 pages. https://doi.org/10.5281/zenodo.13904854
- Federico Bomba and Antonella De Angeli. 2025. Agency and authorship in AI art: Transformational practices for epistemic troubles. International Journal of Human-Computer Studies 205 (2025), 103652. https://doi.org/10.1016/j.ijhcs.2025.103652
- Matej Božić and Marko Horvat. 2024. A Survey of Deep Learning Audio Generation Methods. arxiv:2406.00146 [cs.SD] https://arxiv.org/abs/2406.00146
- Virginia Braun and Victoria Clarke. 2006. Using thematic analysis in psychology. Qualitative Research in Psychology 3, 2 (2006), 77–101. https://doi.org/10.1191/1478088706qp063oa
- Jean-Pierre Briot, Gaëtan Hadjeres, and François-David Pachet. 2019. Deep Learning Techniques for Music Generation – A Survey. arxiv:1709.01620 [cs.SD] https://arxiv.org/abs/1709.01620
- Antoine Caillon and Philippe Esling. 2021. RAVE: A variational autoencoder for fast and high-quality neural audio synthesis. arxiv:2111.05011 [cs.LG] https://arxiv.org/abs/2111.05011
- Antoine Caillon and Philippe Esling. 2022. Streamable Neural Audio Synthesis With Non-Causal Convolutions. arxiv:2204.07064 [cs.SD] https://arxiv.org/abs/2204.07064
- Baptiste Caramiaux, Alessandro Altavilla, Jules Françoise, and Frédéric Bevilacqua. 2022. Gestural Sound Toolkit: Reflections on an Interactive Design Project. In NIME: Proceedings of the International Conference on New Interfaces for Musical Expression. Auckland, New Zealand. https://hal.science/hal-03800322
- Baptiste Caramiaux and Marco Donnarumma. 2020. Artificial Intelligence in Music and Performance: A Subjective Art-Research Inquiry. arxiv:2007.15843 [cs.HC] https://arxiv.org/abs/2007.15843
- Baptiste Caramiaux and Atau Tanaka. 2013. Machine Learning of Musical Gestures. 513–518.
- Ondrej Cifka, Alexey Ozerov, Umut Simsekli, and Gael Richard. 2021. Self-Supervised VQ-VAE for One-Shot Music Style Transfer. In ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. https://doi.org/10.1109/icassp39728.2021.9414235
- Nils Demerlé, Philippe Esling, Guillaume Doras, and David Genova. 2024. Combining audio control and style transfer using latent diffusion. arxiv:2408.00196 [cs.SD] https://arxiv.org/abs/2408.00196
- Junwei Deng, Xirui Jiang, Shiyuan Zhang, Shichang Zhang, Himabindu Lakkaraju, Ruijiang Gao, Chris Donahue, and Jiaqi W. Ma. 2025. Computational Copyright: Towards A Royalty Model for Music Generative AI. arxiv:2312.06646 [cs.AI] https://arxiv.org/abs/2312.06646
- Ajay Divakaran, Aparna Sridhar, and Ramya Srinivasan. 2023. Broadening AI Ethics Narratives: An Indic Art View. In 2023 ACM Conference on Fairness, Accountability, and Transparency(FAccT ’23). ACM, 2–11. https://doi.org/10.1145/3593013.3593971
- Chris Donahue, Julian McAuley, and Miller Puckette. 2019. Adversarial Audio Synthesis. arxiv:1802.04208 [cs.SD] https://arxiv.org/abs/1802.04208
- John J. Dudley and Per Ola Kristensson. 2018. A Review of User Interface Design for Interactive Machine Learning. ACM Transactions on Interactive Intelligent Systems 8, 2 (June 2018), 8:1–8:37. https://doi.org/10.1145/3185517
- Jesse Engel, Kumar Krishna Agrawal, Shuo Chen, Ishaan Gulrajani, Chris Donahue, and Adam Roberts. 2019. GANSynth: Adversarial Neural Audio Synthesis. arxiv:1902.08710 [cs.SD] https://arxiv.org/abs/1902.08710
- Jesse Engel, Lamtharn Hantrakul, Chenjie Gu, and Adam Roberts. 2020. DDSP: Differentiable Digital Signal Processing. arxiv:2001.04643 [cs.LG] https://arxiv.org/abs/2001.04643
- Cagri Erdem and Benedikte Wallace. 2022. CAVI: A Coadaptive Audiovisual Instrument–Composition. https://doi.org/10.21428/92fbeb44.803c24dd
- Sidney Fels and Geoffrey Hinton. 1995. Glove-TalkII: an adaptive gesture-to-formant interface. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (Denver, Colorado, USA) (CHI ’95). ACM Press/Addison-Wesley Publishing Co., USA, 456–463. https://doi.org/10.1145/223904.223966
- Rebecca Fiebrink and Perry Cook. 2010. The Wekinator: A System for Real-time, Interactive Machine Learning in Music. Proceedings of The Eleventh International Society for Music Information Retrieval Conference (ISMIR 2010) (01 2010).
- Rebecca Fiebrink, Perry R. Cook, and Dan Trueman. 2011. Human model evaluation in interactive supervised learning. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (Vancouver, BC, Canada) (CHI ’11). Association for Computing Machinery, New York, NY, USA, 147–156. https://doi.org/10.1145/1978942.1978965
- Rebecca Fiebrink and Laetitia Sonami. 2020. Reflections on Eight Years of Instrument Creation with Machine Learning. In Proceedings of the International Conference on New Interfaces for Musical Expression, Romain Michon and Franziska Schroeder (Eds.). Birmingham City University, Birmingham, UK, 237–242. https://doi.org/10.5281/zenodo.4813334
- Francesco Ganis, Erik Frej Knudsen, Søren V. K. Lyster, Robin Otterbein, David Südholt, and Cumhur Erkut. 2021. Real-time Timbre Transfer and Sound Synthesis using DDSP. arxiv:2103.07220 [cs.SD] https://arxiv.org/abs/2103.07220
- Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative Adversarial Networks. arxiv:1406.2661 [stat.ML] https://arxiv.org/abs/1406.2661
- Ben Hayes, Jordie Shier, György Fazekas, Andrew McPherson, and Charalampos Saitis. 2024. A review of differentiable digital signal processing for music and speech synthesis. Frontiers in Signal Processing Volume 3 - 2023 (2024). https://doi.org/10.3389/frsip.2023.1284100
- Théo Jourdan and Baptiste Caramiaux. 2023. Culture and Politics of Machine Learning in NIME: A Preliminary Qualitative Inquiry. In New Interfaces for Musical Expression (NIME). Mexico, Mexico. https://hal.science/hal-04075438
- Anna-Kaisa Kaila, André Holzapfel, and Petra Jääskeläinen. 2024. Gardening frictions in creative AI: Emerging art practices and their design implications. In 15th International Conference on Computational Creativity, Jun 17-Jun 21 2024, Jönköping, Sweden.
- Chris Kiefer and Andrea Martelloni. 2025. Towards an ecosystem of instruments of tunable machine learning. (9 2025). https://sussex.figshare.com/articles/conference_contribution/Towards_an_ecosystem_of_instruments_of_tunable_machine_learning/31123192
- Diederik P. Kingma and Max Welling. 2019. An Introduction to Variational Autoencoders. Foundations and Trends® in Machine Learning 12, 4 (Nov. 2019), 307–392. https://doi.org/10.1561/2200000056
- Zhifeng Kong, Wei Ping, Jiaji Huang, Kexin Zhao, and Bryan Catanzaro. 2021. DiffWave: A Versatile Diffusion Model for Audio Synthesis. arxiv:2009.09761 [eess.AS] https://arxiv.org/abs/2009.09761
- Soroush Mehri, Kundan Kumar, Ishaan Gulrajani, Rithesh Kumar, Shubham Jain, Jose Sotelo, Aaron Courville, and Yoshua Bengio. 2017. SampleRNN: An Unconditional End-to-End Neural Audio Generation Model. arxiv:1612.07837 [cs.SD] https://arxiv.org/abs/1612.07837
- Joseph Meyer, Nick Bryan-Kinns, Sarah Fdili Alaoui, Mick Grierson, and Rebecca Fiebrink. 2025. Interactive Movement-to-Audio with Pre-Trained Neural Networks. In Proceedings of the 2025 Conference on Creativity and Cognition(C&C ’25). Association for Computing Machinery, New York, NY, USA, 491–493. https://doi.org/10.1145/3698061.3734415
- Ibomoiye Domor Mienye, Theo G. Swart, and George Obaido. 2024. Recurrent Neural Networks: A Comprehensive Review of Architectures, Variants, and Applications. Information 15, 9 (2024). https://doi.org/10.3390/info15090517
- Fabio Morreale, Marco A. Martinez-Ramirez, Raul Masu, WeiHsiang Liao, and Yuki Mitsufuji. 2025. Reductive, Exclusionary, Normalising: The Limits of Generative AI Music. Transactions of the International Society for Music Information Retrieval (Sep 2025). https://doi.org/10.5334/tismir.256
- Ali Nikrang and Susanne Kiesenhofer. 2025. Human–AI Co-Creation in Contemporary Composition: Interaction and Artistic Strategies with Ricercar. In Proceedings of the Conference on Animation and Interactive Art(Expanded ’25). Association for Computing Machinery, New York, NY, USA, 65–73. https://doi.org/10.1145/3749893.3749961
- Lorelli S. Nowell, Jill M. Norris, Deborah E. White, and Nancy J. Moules. 2017. Thematic Analysis: Striving to Meet the Trustworthiness Criteria. International Journal of Qualitative Methods 16, 1 (2017), 1609406917733847. https://doi.org/10.1177/1609406917733847
- Ender Özcan and Türker Erçal. 2008. A Genetic Algorithm for Generating Improvised Music. In Artificial Evolution, Nicolas Monmarché, El-Ghazali Talbi, Pierre Collet, Marc Schoenauer, and Evelyne Lutton (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 266–277.
- Nicola Privato, Victor Shepardson, Giacomo Lepri, and Thor Magnusson. 2024. Stacco: Exploring the Embodied Perception of Latent Representations in Neural Synthesis. In Proceedings of the International Conference on New Interfaces for Musical Expression, S M Astrid Bin and Courtney N. Reed (Eds.). Utrecht, Netherlands, Article 62, 8 pages. https://doi.org/10.5281/zenodo.13904899
- Flavio Schneider. 2023. ArchiSound: Audio Generation with Diffusion. arxiv:2301.13267 [cs.SD] https://arxiv.org/abs/2301.13267
- Ilana Shapiro and Mark Huber. 2021. Markov Chains for Computer Music Generation. Journal of Humanistic Mathematics 11 (07 2021), 167–195. https://doi.org/10.5642/jhummath.202102.08
- Victor Shepardson, Halla Steinunn Stefánsdóttir, and Thor Magnusson. [n. d.]. Evolving the Living Looper: Artistic Research, Online Learning and Tentacle Pendula. https://api.semanticscholar.org/CorpusID:280314396
- Luke Stark and Kate Crawford. 2019. The Work of Art in the Age of Artificial Intelligence: What Artists Can Teach Us About the Ethics of Data Practice. Surveillance & Society 17 (09 2019), 442–455. https://doi.org/10.24908/ss.v17i3/4.10821
- Bernard Stiegler. 1998. Technics and Time. Stanford University Press, Stanford, Calif.
- Koray Tahiroǧlu, Miranda Kastemaa, and Oskar Koli. 2020. Al-terity: Non-Rigid Musical Instrument with Artificial Intelligence Applied to Real-Time Audio Synthesis. In Proceedings of the International Conference on New Interfaces for Musical Expression. Zenodo, 337–342. https://doi.org/10.5281/zenodo.4813402
- Charlotte Truchet and Gérard Assayag. 2011. Constraint programming in music. ISTE-Wiley.
- Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. 2016. WaveNet: A Generative Model for Raw Audio. arxiv:1609.03499 [cs.SD] https://arxiv.org/abs/1609.03499
- Gabriel Vigliensoni and Rebecca Fiebrink. 2023. Steering latent audio models through interactive machine learning. https://doi.org/10.5281/zenodo.8087978
- Gabriel Vigliensoni, Louis McCallum, and Rebecca Fiebrink. 2020. Creating latent spaces for modern music genre rhythms using minimal training data. (2020).
- Li-Chia Yang, Szu-Yu Chou, and Yi-Hsuan Yang. 2017. MidiNet: A Convolutional Generative Adversarial Network for Symbolic-domain Music Generation. arxiv:1703.10847 [cs.SD] https://arxiv.org/abs/1703.10847
- Shuoyang Jasper Zheng, Keigo Yoshida, Nico García-Peguinho, Jiatong Liu, Dan Hearn, Anna Xambó Sedó, and Nick Bryan-Kinns. 2026. Latent Terrain: Adapting Neural Audio Autoencoders as Design Materials in NIME. In Proceedings of the International Conference on New Interfaces for Musical Expression. London, UK.
A Project information
A.1 Bla Blavatar vs Jaap Blonk
This piece is an experimental collaboration between artist Jonathan Reus and sound poet Jaap Blonk exploring voice dataset making and voice improvisation, focusing on situational absurdity, improvisation and bespoke scoring systems. In an uncanny duet, Reus and Blonk call and respond to one another in dynamic crescendos. Blonk's real voice plays against, with, within, and without a computer model of his own voice. The model, a bespoke real-time performance system called Tungnaa developed in collaboration with Victor Shepardson collapses transients and fragments of vocalization into relentless sound forms. Bla Blavatar vs Jaap Blonk emphasizes liveness, and takes iterative growth as the basis for an autoregressive musical cycle, with remnants of all previous performances carrying forward into the next. The dataset grows, the next model is different, the next score is generated, a process of organic development in a perpetually unfinished vocal performance. Jaap Blonk was not directly involved at all in the technical development. Only indirectly, in the sense that Jaap's repertoire of sounds and his ability to perform scores produced was a large part of how the scores and notation systems developed. Jonathan Reus and Victor Shepardson developed the software together. For Reus, the work has split between the data work and software development, whereas Shepardson was split between software development and model training. Reus has a background in electronic music composition and fundamental AI and he started to focus on voice and AI in 2018. Shepardson has a PhD in New musical instruments and phenomenology of AI and is a musician by performing experimental music. 3
A.2 Digitizing Chinese Erhu
This piece uses a gesture-controlled digital Erhu system that merges traditional Chinese instrumental techniques with contemporary ML and interactive technologies. By leveraging the Erhu's expressive techniques, the artists develop a dual-hand spatial interaction framework using real-time gesture tracking. Hand movement data is mapped to sound synthesis parameters to control pitch, timbre, and dynamics, while a differentiable digital signal processing (DDSP) model, trained on a custom Erhu dataset, transforms basic waveforms into authentic timbre which remains faithful to the instrument's nuanced articulations. The system bridges traditional musical aesthetics with digital interactivity, emulating Erhu bowing dynamics and expressive techniques through embodied interaction. Wenqi Wu developed the model and interactive interface and also performed the piece, while Hanyu Qu contributed cultural and instrumental expertise during the early stages of the project, including initial development and prototype testing. 4
A.3 Enacteur
Enacteur is an AI-driven improviser. Énacteur is programmed to actively participate in electroacoustic improvisational settings, fulfilling real-time analysis, decision-making, sound generation, and spatialization tasks. The artistic motivation is to expand the possibilities of electroacoustic composition via collaborative methods and to enrich sonic textures and compositional structures by recognizing the aesthetic affordances of human-machine interaction. Énacteur can listen to audio signals, extract audio descriptors, make a compositional decision according to the descriptors, and generate sound in real-time without the necessity of human intervention. Some of the synthesis techniques employed by Énacteur were waveform generation, frequency modulation synthesis, and granular synthesis. In addition, Énacteur uses the data from machine listening to cross-synthesize and the extracted data from analysis to synthesize sounds. For the performance, Hugo Lioret contributed with computer music, using computer-generated sounds and processed field recordings as material. Énacteur was then trained with 622 samples before the rehearsal. Hugo's sounds in this performance covered a wide range of variations, including dynamic, spectral and timbral fluctuations throughout the improvisation 5. The project (between 2021-2023) had three main stakeholders, Farzaneh Nouri as the developer and artistic author of Énacteur, Hugo Lioret as the main performing collaborator whose creative input and feedback informed the system's early development and evaluation, carried out during Nouri's studies at the Institute of Sonology under the mentorship of Bjarni Gunnarsson that had a strong influence on project's trajectory. Farzaneh Nouri has been working with machine learning since 2020, primarily in the context of live performance and free improvisation. Hugo Lioret has a background in music composition and electroacoustic improvisation. As he had no prior experience with AI, he chose to preserve his exchanges with Nouri to focus on an “ideal-type” listener—one unfamiliar with the system and thus engaging through symbolic and aesthetic listening alone. By withholding information and privileging listening during improvisation and training sessions, he fostered a fundamentally acousmatic, programmatic approach aimed at foregrounding musical poetics.
A.4 Prelude
Prelude is a dance and music performance in which the dancer Marie Bruand can be seen generating music through her movements. Immersed in a contemporary soundscape, we witness how the dancer tames this musical body, thus rediscovering the close bond between dance and music. The artists implemented their own motion-sound interactive system in Max/MSP. They used R-IoT IMU motion sensors composed of accelerometers and gyroscopes with the MuBu library and the Gestural toolkit for real-time motion capture and analysis. For deep audio generation, they relied on the RAVE model which enables fast and high-quality audio waveform synthesis in real-time. The piece unfolds a metaphorical "liberation" of the dancer's body. Connected at the beginning of the piece with fake cables, the dancer progressively embraces a new "musical body". It stages diverse qualities of embodied exploration of sound spaces as she navigates them through her movements under the guidance of the musician. The performance is structured into three parts, one for each exploration method, in the following order (I2, I1, I3). Although structured with the choice of specific interaction method and audio spaces, each part contains a varying degree of improvisation for both the dancer and musician who interact together through and with the system 6. Sarah Nabi led the technical development of the tools, as well as the design and training of the models. Marie Bruand contributed to the model training through collaborative experimentation. The performance of the piece was carried out by both stakeholders.
A.5 Motherbird
Motherbird for flutes, electronics, and artificial life simulation reimagines the centuries-old flute-as-bird archetype in a 21st-century context in which anthropogenic climate change has drastically altered the soundscapes of the natural world. An early instantiation of the Tölvera system [1] is a mode of score creation which challenges entangling human performers and musical instruments with artificial life simulations. The piece positions the flute as one organism within an indeterminate global ecosystem in which changing sonic textures mediate the flocking behaviors of birds, or Boids, in real time. In MOTHERBIRD, reality is front-and-center, urging listeners to engage critically in becoming-with non-human and more-than-human worlds. The work represents the first iteration of an international collaborative effort toward a new piece for augmented flute by Jessica Shand, live electronics, and artificial life simulation, with visuals by Lingdong Huang, live programming by Jack Armitage, and electronics by Manuel Cherep 7. Both Shand, Huang, Armitage and Cherep equally contributed technically and musically to the project. The interviewee Jack Armitage has a background in composition and music production as well as creative Ai for musical performance and instrument design.
A.6 Approximations
Approximations for accordion, percussion, and neural networks is a collection of five short movements featuring the sounds of two audio-generating neural networks Wavenets trained on the sounds of the two performers, Louis Pino and Matti Pulkki. The neural networks (named The Accordionator and The Great Percussionist In The Sky by GPT-3) are AIs who have only ever had one sensory input: the sound of their assigned performer. They're quirky robots trying to imitate the performers who are, in turn, trying to interact with and imitate the robots. Movement i, “from the fog of randomness,” mimics the initial stage of training a neural network. At the beginning of the training process, the network generates statistical noise. Over hundreds of thousands of training steps, the network learns to generate sounds that more and more resemble the instruments. Movement ii, “making friends with the robots,” takes us from rumbling machine noise to a robot dance party. The robots and performers become friends as they create a beat together, closing with a courteous, courtly dance. Movement iii, “call + response,” gives the performers a chance to imitate the robots who have been trained to imitate them. Movement iv, “mutual listening,” is a speculative representation of the inner/spiritual experience of the networks. These AIs aren't conscious, though. Or are they? The title of movement v, “educated guessing,” refers to the way networks generate audio. A network looks at a chunk of audio samples and guesses what samples should come next 8. In terms of roles in the project, Molly Jones was the composer, selected and trained the models, while Louis Pino and Matti Pulkki were the performers and the data contributors. Molly Jones has a background in music composition and improvisation, completing a PhD focused on machine learning for creative audio applications.
A.7 Latent Terrain Synthesis
Latent terrain is a tool to build corpus-based sound spaces/maps/materials to steer neural audio autoencoders/codecs (such as RAVE, Stable Audio Open (codec), Music2Latent). A terrain is a surface map for the autoencoder's latent space, taking coordinates in a control space as inputs, and produces continuous real-time latent trajectories that can be used for sound synthesis. Latent terrain aims to open up the creative possibilities of latent space navigation, allowing one to adapt the latent space of an autoencoder to easy-to-navigate interfaces (such as gestural controllers, stylus and tablets, XY-pads, and more), explore it like walking on a terrain surface, and build new musical instruments that compose and interact with AI audio generators 9. The artists Shuoyang Jasper Zheng and Keigo Yoshida collaborate together for performance called Repressive Terrain [51] in which the audience's time-varying Electroencephalography (EEG) data is sonified to a real-time synthesised soundscape. It exploits the interactive sound space adaptation aspect of the Latent Terrain toolkit to create a real-time evolving soundscape. The ideation and conceptualisation of each work is done by the artist-researcher Yoshida after their first meeting, whereas Zheng developed Latent Terrain and had a role of a facilitator who briefs and explains the use of the tool, signposts relevant resources and documentation, and provides technical support throughout the process. Zheng has a background in technical AI/ML development, and have been building AI music tools for two years. He also has a background as an electronic music producer.
A.8 DAIM™
DAIM™is a transdisciplinary music project initiated in 2016 by artist-researchers Axel Chemla–Romeu-Santos, Kevin Raharolahy and Hugo Scurto, blending electronic music and sound art practices through post-internet memeification and maximalist aesthetics. Their music productions draw from beat-based, hyperpop and experimental music, but also from video games, doomscrolling and advertising. Their live performances were featured at Ars Electronica 24h Rave (2021) and ACIDS Workshop Gamma (2023), and mix codes of clubbing and improvised music, by immersing audiences in musical cut-ups of miscellaneous genres and memetic audio/visuals, reminiscent of Internet aesthetics. Their recent work focused on interconnecting AI music with sound art practice, by steering AI music tools outside their engineered aesthetics. To this end, they practiced with both prompt-based AI audio generators from the industry, and custom deep generative models for real-time interaction. 10 Model training in the project is primarily carried out by Axel Chemla Romeu Santos. The exploration of these models, as well as the performance, is more collaborative. Hugo Scurto and Axel Chemla Romeu Santos are both artists and researchers with a technical background in AI for music, while Kevin Raharolahy is a designer.
Footnote
2 https://github.com/openai/whisper
3 https://jonathanreus.com/portfolio/bla-blavatar-vs-jaap-blonk/
4 https://nime2025.org/proceedings/222.html
5 https://sonology.org/wp-content/uploads/2025/04/Enacteur_Composing-with-artificial-improvisers-.pdf
6 https://ircam-ismm.github.io/embodied-latent-exploration/
7 https://www.jessicashand.com/portfolio/motherbird-4gall
8 https://tapirlab.music.utoronto.ca/original-works/
9 https://jasper-zheng.github.io/nn_terrain/
10 https://aimc2024.pubpub.org/pub/d9jivdh6/release/1
This work is licensed under a Creative Commons Attribution 4.0 International License.
C&C '26, London, United Kingdom
© 2026 Copyright held by the owner/author(s).
ACM ISBN 979-8-4007-2583-8/26/07.
DOI: https://doi.org/10.1145/3803784.3807533