I'm not sure if the speech API on AWS allows tonal input, but even if they have, I see problems on this suggestion:

When you run a speech block with new configurations for the first time, it takes a while before the speech is played. That's because the blocks gets the speech audio on the fly, then caches it for later. This could impose a problem for something where timing is absolutely necessary.
We could mitigate this by getting the speeches before it's being run, but I can't imagine the amount of unused requests speeches from this…

How would we handle multi-syllable words, that has different tones on different syllables? Do we:
a. try to sound them:

sing [day] at note (78 v) for (1) beats :: extension
sing [zee] at note (75 v) for (1) beats :: extension // not *see, since the <s> in <Daisy> is voiced (spoken like Z)
sing [day] at note (71 v) for (1) beats :: extension
sing [zee] at note (66 v) for (1) beats :: extension // ditto
sing [give] at note (68 v) for (0.333) beats :: extension
sing [me] at note (70 v) for (0.333) beats :: extension
sing [your] at note (71 v) for (0.333) beats :: extension
sing [en] at note (68 v) for (0.666) beats :: extension // en [ˈɛn] is close enough to the first syllable of answer [ˈænsɚ]
sing [sir] at note (71 v) for (0.333) beats :: extension // so is [sˈɜː]
sing [do] at note (66 v) for (2) beats :: extension

b. use hyphens to notate that the word is split into different syllables (this also means there's a “lookahead” between blocks)

sing [dai-] at note (78 v) for (1) beats :: extension
sing [-sy] at note (75 v) for (1) beats :: extension
sing [dai-] at note (71 v) for (1) beats :: extension
sing [-sy] at note (66 v) for (1) beats :: extension
sing [give] at note (68 v) for (0.333) beats :: extension
sing [me] at note (70 v) for (0.333) beats :: extension
sing [your] at note (71 v) for (0.333) beats :: extension
sing [an-] at note (68 v) for (0.666) beats :: extension
sing [-swer] at note (71 v) for (0.333) beats :: extension
sing [do] at note (66 v) for (2) beats :: extension

c. Defenestrate this English spelling nonsense and use something like X-SAMPA or the IPA for the lyrics

sing [d'eI] at note (78 v) for (1) beats :: extension
sing [zi] at note (75 v) for (1) beats :: extension
sing [d'eI] at note (71 v) for (1) beats :: extension
sing [zi] at note (66 v) for (1) beats :: extension
sing [g'Iv] at note (68 v) for (0.333) beats :: extension
sing [m'i:] at note (70 v) for (0.333) beats :: extension
sing [j'U@] at note (71 v) for (0.333) beats :: extension
sing [\{n] at note (68 v) for (0.666) beats :: extension
sing [s3] at note (71 v) for (0.333) beats :: extension
sing [d'u:] at note (66 v) for (2) beats :: extension

How would we handle edge cases like “singing” a sentence or two over a short period of time?

sing [A nutshell is the outer shell of a nut.] at note (127 v) for (0.001) beats :: extension

Or singing for a very long time?

sing [aaaaaa] at note (0 v) for (99999) beats :: extension