Category: Corpus Design

Help me with revising the experimental Arpasing recording script

Hello folks,

After 300 CPU-hours of intense computation, a new recording script for Arpasing has been generated.

Unlike the lists you've seen before, this one consists of actually meaningful words and phrases selected from a small subset of public-domain books from Project Gutenberg, yet it is different from CMU Arctic (which is also based on Gutenberg) in that the new Arpasing script is designed for singing synthesis. You can take a glimpse at the first few lines,

1 was Matthew's consolatory rejoinder
2 he stood before the wide opening
3 we shall begin our researches here
4 with his usual craving on him
5  but this little line of dancing men

Arpasing Standard+Script+Plugin Update

This is an update to Arpasing standard (0.2.0), recording script (0.2.0) and Arpasing Assistant plugin (0.2).

Since the initial launch a few months ago I've received a lot of feedback on various aspects of the Arpasing framework. This update is a major renovation to the whole system in regard to the feedback. The major changes are,

Arpasing standard:

  • Triphone/quadphone units are removed.
  • "ax" (schwa) is added to the phone set.

Arpasing recording script:

  • The N-gram coverage script is redesigned.
  • Extension script "isolated vowels" is added.

Arpasing Assistant:

  • Bug fixes.
  • An option is added to force the note duration to be uniform (explained in detail in readme.txt).

Arpasing toolkit (version 0.2) can be downloaded from here.

I'm also going to update Moresampler very soon.


Introducing Arpasing for English UTAUloids

A few months ago I uploaded a document proposing a new English UTAU recording script with detailed specifications. As an attempt to basically replicate a unit-selection based speech synthesizer in UTAU, the new standard was named Arpasing. Thanks to a few users who recorded the very first Arpasing voicebanks in spite of the lack of clear instructions, we're now able to further explore the uncharted land with Moresampler 0.8.0's built-in support for Arapsing oto generation. Here I'm launching another tool, and this time for actually creating USTs with Arpasing.

Please keep in mind that Arpasing is an experiment, and we don't yet know if it's going to work well, until more efforts are made to revise the tools & voicebanks.


Initial Results & Facts on Arpasing

Shortly after the release of Arpasing proposal, Adlez recorded a test voicebank and sent back to me in almost no time. I'd like to thank Uchuu and BagHeadChan for their contribution as well. Those data really helped me to work out a roughly working solution to voicebank labeling so I could start actually making songs with Arpasing voicebanks.

Here are two short samples I've created with Adlez's voicebank,

Unravel (English ver, lyrics by Lucy)

("Oh won't you tell me, please just tell me, explain how this should work. I fear who could it be, that lives inside of me. My conscience cracking, mind reacting, surrounded by the world. But here you're smiling bright, completely blind to life.")

Tokio Funka (English ver, lyrics by Jayn)

("In a hazy town with deception all around, there's a dusty cloud that drifts downward to the ground.")


Proposal of Arpasing, English UTAU Reclist & Specs

At the end of this two-month initial design phase, I proudly present Arpasing, a scientifically-designed English recording script and naming standard for UTAU.

Followed immediately after Moresampler 0.7.0's release, people were asking me if the oto generation feature would ever support English. I said yes, but later found the method for Japanese wasn't directly applicable for English, because of the myriad of different recording schemes each with its own phonetic notation. Writing wrappers for each of them seems to be a quite troublesome business, so I began to try if it's possible to unify the existing solutions into one.

And I also tried to tailor and apply the corpus design methods in academic publications on this very specific problem. The result may appear a bit bizarre and intimidating at a first glance, but much more concise than ever before. It achieves 96% diphone coverage in 120 tri-syllable utterances, and 42% triphone coverage in another 100 utterances.