xVASynth 2 - F4VA Synth 2.1.0

xVASynth is an AI tool for generating high-quality voice acting lines using voices from video games. The app supports hundreds of voices, across dozens of games, and provides pitch, duration, and energy control at per-letter granularity.

List of voices available for xVASynth, from both myself and the community: Google doc link
You can submit models at the following link, if you train them with xVATrainer: Google forms link

Quick intro

xVASynth is an AI based app for creating new voice lines using neural speech synthesis. The app loads models individually trained on character voice data from games. The app gives users control over details such as pitch and durations of individual letters to provide control over emotion and emphasis. To see it in action, watch these short intro/tutorial videos, narrated by various supported voices:

Supported games
Skyrim (SKVASynth)
Fallout 4 (F4VASynth) <-- you are here
Oblivion (OBVASynth)
Fallout New Vegas (NVVASynth)
Morrowind (MWVASynth)
Fallout 3 (F3VASynth)
Starfield (SFVASynth) soon™
Fallout 76 (F76VASynth)
Cyberpunk 2077 (CPVASynth)
Civilization (CIVVASynth)
Mass Effect (MEVASynth)
The Witcher (WVASynth)
Humankind (HKVASynth)
Dragon Age (DAVASynth)
Overwatch (OWVASynth)
and other games/series currently without a Nexus page (Final Fantasy, Borderlands, Bioshock, GTA 4, GTA 5, GTA:SA, Resident Evil, Red Dead Redemption 2, Command and Conquer, and others)

Discord: https://discord.gg/nv7c6E2TzV
Patreon: https://www.patreon.com/xvasynth
Twitter: @dan_ruta

Preface: The tool does not re-distribute any game assets, nor does it interact with them in any way. Game assets are used only during voice training as a reference, to guide the algorithm to drive itself to a point where it can create voices that sound similar enough to the examples. Think about it as an automated digital impersonator. Regardless, avoid using the tool in an offensive/explicit manner. Make it obvious where you can, in descriptions that the voice samples are generated, and are not from real human voice actors. Any issues you cause with this are on you.

Introduction

xVASynth (or [F4]VASynth, for [Fallout4] voices) is an AI app that generates voice acting lines using specific voices from video games. It can do text-to-speech (TTS) from text input, or speech-to-speech (S2S) from audio input. The app uses FastPitch [1,2] models, which give users artistic control over pitch, duration, and energy values for every letter in the audio. They also allow generating audio with explicitly defined pronunciation via ARPAbet [3] notation.

The use of neural speech synthesis leads to natural sounding voices, something which is very difficult to do with more traditional methods involving concatenations of existing data. It also means new vocabulary can be generated, outside of what the voice actors have already read out.

Speech to speech

The app can also do speech-to-speech, rather than text-to-speech. In this mode, you can provide a reference dialogue line, and have the app try to infer all the pitch/energy/duration values from the audio, for each text character. You can provide the exact text transcript of the reference audio in the input textarea, or you can leave it blank to have the app try to infer the text also. You can provide a reference audio line by recording with your microphone (by clicking the icon), or you can drag+drop an audio file onto the icon. You must first select an INPUT voice model, which must sound as similar as possible to the reference audio, and it must be a v2 model.

ARPAbet pronunciation

You can specify exact pronunciation for words by using ARPAbet notation between { } brackets in the input, or by managing words in your own (or other people's) dictionaries. Included is CMUdict with 135k words with American-English pronunciations.

Batch Mode

For larger projects, where you need to synthesize a large amount of lines, you can alternatively use the Batch synthesis mode. You can use either a .txt file or a .csv file to batch generate hundreds or even thousands of lines, in one go, with parallelization. Although the pitch/duration/energy editor is sometimes needed to get a line sounding just right, it's sometimes not needed, and this is a good way to get an initial pass on lines. Using the GPU is especially highly recommended for this, as you can greatly parallelize the number of lines generated in one go (limited by VRAM). You should also check the various settings, such as multi-threading, to get the best possible speed out of this for your system.

3D Voice embeddings visualizer

The 3D voice embeddings visualizer is an interactive panel where you can explore in 3D all the voices in the app, as seen by an AI representation learning model, projected down to 3D. There are no axes, and this serves purely as a visualization, to enable voice discovery. You can colour the points by game, or gender, and you can enable disable specific games/voices. You can load a voice by clicking it and the "Load" button, if it's installed.

App installation

You may need to install Microsoft Visual C++ Redistributable if you don't already have it. To install the app, download it and extract it anywhere you'd like (it does not need to be in any game directory). You can optionally download the WaveGlow models (and place the files in ./resources/app/models), if you'd like more options for the vocoder used, but the bespoke HiFi-GAN vocoders included with each voice are almost always the highest quality vocoders, and by far the quickest. Launch the app by double-clicking the xVASynth.exe file. If you have any issues, try running it as admin, but be mindful that Electron on Windows has some issues with drag+drop events when running as Admin.

Important: Make sure you click "Allow" if windows asks you for permission to run the python server. I use a local HTTP server to enable communication between the python code (for the AI models) and the jаvascript code (for the Electron front-end). If there are any issues, check the server.log/app.log files (located next to xVASynth.exe) - there should be an error at the end which I'll need to see for helping with issues.

Voice installation

The recommended way to install voices is through the Nexus API integration. However, if you don't have Nexus Premium membership, or you'd prefer manual installation, you need to download the individual .zip files from the game-specific nexus pages (such as this one) and extract the voice files into the app directory, at this location: <.exe location>/resources/app/models/<game> where <game> is the game ID. The voice .zip files already contain the required directory structure, so all you need to do is drag+drop the extracted "resources" folder from the .zip files into the folder where the xVASynth.exe file is (replacing files if prompted).

To confirm, when installing voices, you should see 4 files (a .json, a .pt, a .hg.pt, and a .wav file) all named as the voice you're downloading, in <your xVASynth install directory>/resources/app/models/<game>/ (where <game> is fallout4, for models on this page).

Important: If you move the app files to a different directory, you MUST update the model paths in the settings, because these folder paths get initialized with the full path (starting from the drive letter) - basically, just make sure the app is looking in the new place where your models are, rather than the old folder. The app also allows you to set a different folder to store your voice models in, rather than nested in your app installation directory. The easier thing to do long-term would be to find somewhere not in your app installation folder to store your models, and set the app file paths to point there.