From Deepfakes to Deep Tones: Crafting Ethical Soundscapes with AI

Abstract waves of green, yellow, and blue colors.
Photo by Logan Voss (Unsplash), Edited/Rendered by gpt-image-1

The mathematics of sound manipulation has always fascinated me—how a simple Fourier transform can decompose a human voice into frequencies, and how those frequencies, once understood, can be reconstructed into something entirely artificial yet eerily authentic. This technical capability became starkly apparent when OpenAI paused AI-generated depictions of Martin Luther King Jr. on their Sora platform this October after public outcry over disrespectful videos. The incident exposed a fundamental tension: our algorithms have learned to mimic voices and faces with such precision that we must now decide which voices deserve digital immortality, and under whose authority.

The Architecture of Deception

The MLK deepfake wasn't just another technical demonstration gone wrong—it revealed how quickly synthetic media can weaponize history against itself. Within days of Sora's late September launch, users created videos showing King making monkey noises during his "I Have a Dream" speech, stealing from grocery stores, and fleeing police—racist stereotypes animated by the same neural networks that could preserve his legacy. Dr. Bernice A. King, CEO of The King Center and MLK's youngest daughter, called the content "dishonoring, deplorable, and disrespectful," pleading with people to stop sending her AI-generated videos of her father.

What troubles me isn't the capability itself but our industry's tendency to build first and consider consequences later. The technology behind these synthetic reproductions relies on deep learning systems trained on vast datasets, learning not just phonemes and cadences but the subtle emotional inflections that make a voice uniquely human. Modern voice synthesis has evolved dramatically—from WaveNet's dilated convolutions modeling raw waveforms to zero-shot speaker adaptation requiring mere seconds of audio. We've achieved remarkable data efficiency, but we've failed to build equivalent efficiency in ethical consideration.

OpenAI's response—pausing the feature after public outcry and introducing an opt-out policy for estates—demonstrates reactive rather than proactive ethics. The technical architecture was already built, the models trained, the capability deployed. Only when confronted with the specific case of a civil rights icon did the ethical implications become unavoidable. This pattern repeats across our industry: we engineer solutions to technical problems without first mapping the human terrain those solutions will inhabit.

Harmonics of Collaboration

Yet the same technology that enables troubling deepfakes also powers legitimate creative collaboration. Consider how artists like Holly Herndon are using AI not to replace human creativity but to extend it. For her 2019 album "PROTO," Herndon created Spawn—an AI collaborator trained exclusively on consenting voices, including her own and a 14-member ensemble. Each contributor was credited and compensated. The resulting work explores what she calls "Collective Intelligence" rather than extractive AI practices.

Herndon's approach offers a blueprint for ethical integration. She didn't use AI to impersonate others but to expand her own creative palette—generating new vocal textures, experimenting with percussive patterns, prototyping arrangements that blend human and algorithmic expression. The distinction matters. When an artist uses AI to manipulate their own voice or generate collaborative elements for their composition, they maintain authorship and agency. The technology becomes an instrument rather than an impersonator.

This philosophy extends beyond experimental electronic music. Grimes launched Elf.Tech in 2023, creating an AI voice model trained solely on her own vocals. The platform allows other artists to transform their performances using her voice—but only with her explicit consent, approval of each song, and a mandatory 50% royalty split. Over a thousand tracks have been created through this framework. It treats AI as a sophisticated synthesizer—capable of generating complex outputs but always under human direction and for human purposes.

The music industry, with its long history of sampling debates and rights management, has developed instincts for these questions that other fields might learn from. Artists aren't using AI to deceive listeners about authorship but to collaborate across boundaries that geography or circumstance might otherwise impose.

The Recording Academy established formal boundaries on June 16, 2023, when they announced AI protocols for the 66th Grammy Awards: "Only human creators are eligible to be submitted for consideration for, nominated for, or win a GRAMMY Award." The guidelines acknowledge that music containing AI elements can be eligible, but the human authorship component must be "meaningful and more than de minimis." Only the human-authored portions can receive recognition.

These frameworks emphasize what I call "consent architecture"—building systems where permission isn't an afterthought but a fundamental design constraint. UNESCO released complementary guidance on September 7, 2023, titled "Guidance for generative AI in education and research," establishing principles around transparency, human-centered design, and data privacy protection that extend across creative fields.

This means more than just checking legal boxes; it requires imagining how our tools will be used by bad actors and building safeguards directly into the code. The challenge for technologists is translating these philosophical principles into executable systems. How do we build platforms that can distinguish between ethical and unethical use cases when those distinctions often depend on context?

Engineering Ethical Constraints

The answer lies not in perfect algorithmic judgment but in designing systems with built-in friction for potentially harmful uses. Consider authentication layers that verify not just identity but intent. Before generating synthetic media of any public figure, systems could require multiple forms of verification: legal documentation of rights, purpose statements subject to review, even time delays that allow for public scrutiny before release.

Recent developments in content authentication offer promising tools. The Coalition for Content Provenance and Authenticity released C2PA 2.1 in October 2024, integrating digital watermarking directly into media files rather than relying solely on metadata that can be easily stripped. These watermarks survive compression and editing, creating persistent provenance trails. But technical solutions alone cannot solve ethical problems—they must be paired with cultural commitment to verification and accountability.

We need systems that can mark AI-generated content at the data level—watermarks that survive across platforms, metadata that travels with files, authentication chains that verify provenance. But we also need to cultivate public understanding of these systems. The most elegant cryptographic solution fails if users don't understand why verification matters.

The Frequency of Trust

Trust in media has always operated on specific frequencies—we learned to recognize the subtle distortions of early recording technology, the compression artifacts of MP3s, the particular grain of film versus digital video. Each generation develops literacy in the media of its time. But AI-generated content operates at a frequency below our perception threshold, creating perfect copies indistinguishable from originals.

The math is unforgiving: modern neural vocoders convert spectral features to waveforms with loss functions optimized for the human voice band (approximately 300 to 4000 Hz). These systems impose constraints that output distributions must be Gaussian or Laplacian, modeling the natural variations of human speech with disturbing accuracy. When you can mathematically reconstruct authenticity, authenticity itself becomes a negotiable concept.

This presents both a technical and cultural challenge. Deepfake detection tools remain limited—the U.S. Government Accountability Office noted in March 2024 that current technologies "have limited effectiveness in real-world scenarios." Audio deepfake detectors show "poor generalization ability" when confronted with new synthesis techniques. We're in an arms race where the forgeries evolve faster than our detection capabilities.

Composing the Future

The path forward requires what I think of as compositional thinking—understanding that we're not just building individual tools but orchestrating entire systems of creation, distribution, and consumption. Each component must harmonize with the others while maintaining its distinct voice.

This means developing AI tools that enhance rather than replace human creativity. It means building business models that compensate original creators when their work trains AI systems. The Human Artistry Campaign, launched in March 2023 by over 40 music industry organizations, established core principles: technology as tool, not replacement; copyright protection for human creators only; transparency in AI-generated content; and fair compensation when artists' work is used for training.

Most importantly, it means remembering that technology serves humanity, not the inverse. Holly Herndon puts it plainly: "I think the headline is that we have now reached a point where we feel entitled to sample human beings and their likeness without anyone questioning that." We're now questioning it—but the questions arrive late, after the technology has already been deployed, after the harm has already been demonstrated.

The Resonance of Responsibility

As someone who sees code in a more creative light, I believe we have a responsibility to write algorithms that respect the dignity of human voice—both literal and metaphorical. The MLK deepfake incident wasn't just about one unauthorized use of technology; it was about our industry's ongoing failure to consider the full resonance of our creations.

Moving forward, we need frameworks that are both technically robust and ethically grounded. This means involving not just engineers and executives but historians, ethicists, artists, and the communities most likely to be affected by synthetic media. It means building systems that default to consent rather than assumption, transparency rather than obscurity, human agency rather than algorithmic determinism.

The mathematics that allow us to synthesize any voice also give us the capability to verify authenticity, to track provenance, to ensure consent. The same neural networks that can deceive can also be trained to authenticate. The U.S. Copyright Office clarified in March 2023 that copyright protection requires human authorship—works produced entirely by AI cannot be copyrighted. This legal framework aligns with the technical and ethical reality: human creativity remains irreplaceable.

We're composing the future one algorithm at a time. Each function we write, each model we train, each system we deploy adds another note to this vast symphonic work. The question isn't whether our code will shape culture—it already does. The question is whether we'll write music worth listening to, voices worth trusting, and systems that amplify the best rather than the worst of human expression. The technical capability exists. What we need now is the wisdom to use it well.


References


Models used: gpt-4.1, claude-opus-4-1-20250805, claude-sonnet-4-20250514, gpt-image-1

Read more