A few months ago I uploaded a document proposing a new English UTAU recording script with detailed specifications. As an attempt to basically replicate a unit-selection based speech synthesizer in UTAU, the new standard was named Arpasing. Thanks to a few users who recorded the very first Arpasing voicebanks in spite of the lack of clear instructions, we're now able to further explore the uncharted land with Moresampler 0.8.0's built-in support for Arapsing oto generation. Here I'm launching another tool, and this time for actually creating USTs with Arpasing.
Please keep in mind that Arpasing is an experiment, and we don't yet know if it's going to work well, until more efforts are made to revise the tools & voicebanks.
Before getting into details, I find it helpful to first explain the concept of "unit-selection". The "standard" way of speech synthesis prior to 1995 was concatenating samples from a fixed inventory of diphone recordings with each diphone occurring exactly once in the database. This is the same as UTAU's approach, if multi-pitch voicebanks are not considered. Circa 1995 a few papers on unit-selection, an extended form of concatenative synthesis were presented by a research group in CMU and it quickly became the new industrial standard. Unit-selection differs from the previous approach in that the database allows "redundancy" or duplication of the same diphone. Instead of looking for an one-to-one mapping between diphones and speech samples, the "best" samples are selected in run-time from a range of alternatives, according to the phonetic and semantic context. While concatenative approaches are nowadays largely replaced by machine learning approaches for speech synthesis, unit-selection remains popular in the context of singing.
Influenced by unit-selection, two ideas are central to Arpasing:
- The recording script is designed in an algorithmic manner that covers as many phonetic contexts as possible;
- There usually exists one or more alternatives to a diphone unit in an Arpasing voicebank.
The new tool Arpasing Assistant, is an UTAU plugin for converting English words into Arpabet phonemes and selecting units from the voicebank. In many sense it resembles a speech synthesizer frontend but for UTAU.
Here's a video I created to show you how these tools work together: (you may want to choose "HD 720P")
You've probably noticed that the plugin is not perfect, and probably far from being perfect in some cases. And in general making a song still requires lots of manual tweaking, which slows down audio production to roughly 2 days/song, though usually it sounds a lot better after tweaking. The reason this plugin isn't perfect can be attributed to:
- UTAU's API provides very limited information for unit-selection and calculating cross-fading parameters, since it wasn't designed for such propose in the first place;
oto.iniused in this video was completely generated by Moresampler and it contains some errors;
- This is a zeroth-generation Arpasing voicebank that comes before an Arapsing UTAUloid even exists and we can't expect everything to be perfect at the beginning.
Another drawback is that unit-selection makes it difficult to share USTs across voicebanks since the voicebanks could be based on different recording scripts. In addition the duration and envelope parameters are largely speaker-dependent. Thus, the proper way to share Arpasing USTs should be sending the UST before being processed by Arpasing Assistant, although this doesn't save much effort on tuning.
It would be great if someone could write up a summary of tips on how to manually tweak USTs for Arpasing voicebanks and even better if there's a summary also on how to record.
By the way, Arpasing Assistant also runs in standalone mode as a phoneme dictionary and it's pretty robust on borrowed words:
Arpasing Assistant, along with the
index.csv file to be placed under the voicebank directory, can be downloaded from here.
Finally here's the full version of the demo song Pure Imagination by Klad Arpasing,