2024年04月16日 07時00分レビュー

たった数秒の音声データから音声合成が可能な「VoiceCraft」

テキサス大学オースティン校を中心とした研究チームが、訓練データにないタスクをこなすゼロショットでの音声編集や音声の合成ができるAIの「VoiceCraft」を発表しました。

VoiceCraft
https://jasonppy.github.io/VoiceCraft_web/

今回発表された「VoiceCraft」は、テキストと画像のマルチモーダルモデルから着想を得て、ゼロショットでのテキストから音声の出力(Text-to-Speech)や音声合成、音声の編集を可能にしたニューラルコーデック言語モデル(Neural Codec Language Models)です。

VoiceCraftは、非常に自然に音声を編集することができます。まず、以下はオリジナルの音声で、「but the renaissance broke their monopoly on knowledge, one of the most important bastions of the church.(しかしルネサンスは、教会の最も重要なとりでのひとつである知識の独占を打ち破ったのです)」と話しています。

続いて、以下がVoiceCraftで編集された音声です。音声の内容は「but the renaissance broke their monopoly on knowledge, with it's free movement of research and endless scientific inquiry, one of the most important bastions of the church.」で、太字の部分がVoiceCraftで追加された部分です。

VoiceCraftはGitHubやHugging Faceで公開されているので、実際に自分で使うことが可能です。

GitHub - jasonppy/VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild
https://github.com/jasonppy/VoiceCraft

VoiceCraft - a Hugging Face Space by pyp1
https://huggingface.co/spaces/pyp1/VoiceCraft_gradio

そこで、Hugging Faceで公開されている「VoiceCraft」を触ってみることにしました。上記のURLをクリックしてアクセスしてみると、以下の画面になります。