I'm not sure if the speech API on AWS allows tonal input, but even if they have, I see problems on this suggestion:

  1. When you run a speech block with new configurations for the first time, it takes a while before the speech is played. That's because the blocks gets the speech audio on the fly, then caches it for later. This could impose a problem for something where timing is absolutely necessary.
    We could mitigate this by getting the speeches before it's being run, but I can't imagine the amount of unused requests speeches from this…
  2. How would we handle multi-syllable words, that has different tones on different syllables? Do we:
    a. try to sound them:
    sing [day] at note (78 v) for (1) beats :: extension
    sing [zee] at note (75 v) for (1) beats :: extension // not *see, since the <s> in <Daisy> is voiced (spoken like Z)
    sing [day] at note (71 v) for (1) beats :: extension
    sing [zee] at note (66 v) for (1) beats :: extension // ditto
    sing [give] at note (68 v) for (0.333) beats :: extension
    sing [me] at note (70 v) for (0.333) beats :: extension
    sing [your] at note (71 v) for (0.333) beats :: extension
    sing [en] at note (68 v) for (0.666) beats :: extension // en [ˈɛn] is close enough to the first syllable of answer [ˈænsɚ]
    sing [sir] at note (71 v) for (0.333) beats :: extension // so is [sˈɜː]
    sing [do] at note (66 v) for (2) beats :: extension
    b. use hyphens to notate that the word is split into different syllables (this also means there's a “lookahead” between blocks)
    sing [dai-] at note (78 v) for (1) beats :: extension
    sing [-sy] at note (75 v) for (1) beats :: extension
    sing [dai-] at note (71 v) for (1) beats :: extension
    sing [-sy] at note (66 v) for (1) beats :: extension
    sing [give] at note (68 v) for (0.333) beats :: extension
    sing [me] at note (70 v) for (0.333) beats :: extension
    sing [your] at note (71 v) for (0.333) beats :: extension
    sing [an-] at note (68 v) for (0.666) beats :: extension
    sing [-swer] at note (71 v) for (0.333) beats :: extension
    sing [do] at note (66 v) for (2) beats :: extension
    c. Defenestrate this English spelling nonsense and use something like X-SAMPA or the IPA for the lyrics
    sing [d'eI] at note (78 v) for (1) beats :: extension
    sing [zi] at note (75 v) for (1) beats :: extension
    sing [d'eI] at note (71 v) for (1) beats :: extension
    sing [zi] at note (66 v) for (1) beats :: extension
    sing [g'Iv] at note (68 v) for (0.333) beats :: extension
    sing [m'i:] at note (70 v) for (0.333) beats :: extension
    sing [j'U@] at note (71 v) for (0.333) beats :: extension
    sing [\{n] at note (68 v) for (0.666) beats :: extension
    sing [s3] at note (71 v) for (0.333) beats :: extension
    sing [d'u:] at note (66 v) for (2) beats :: extension
  3. How would we handle edge cases like “singing” a sentence or two over a short period of time?
    sing [A nutshell is the outer shell of a nut.] at note (127 v) for (0.001) beats :: extension
    Or singing for a very long time?
    sing [aaaaaa] at note (0 v) for (99999) beats :: extension