Deep Autotuner: A data-driven approach to natural-sounding pitch correction for singing voice in karaoke performances

Sanna Wager (Ph.D. candidate, Indiana University SICE), George Tzanetakis (University of Victoria, Department of Computer Science; Smule Inc.), Cheng-i Wang (Smule, Inc.), Lijiang Guo (Ph.D. candidate, Indiana University SICE), Aswin Sivaraman (Ph.D. student, Indiana University SICE), Minje Kim (Indiana University SICE)


Keeping the blue notes in the Blues

This page contains audio examples for our paper on pitch correction of solo voice in a karaoke setting. The project got the attention of news outlets in the UK, as described in the SICE news release. The proposed approach does not need a musical score: it predicts the amount of correction for each note from the relationship between the spectral contents of the vocal and accompaniment tracks. Hence, the pitch shift in cents suggested by the model can be used to make the voice sound in tune with the accompaniment. This approach differs from commercially used automatic pitch correction systems, where notes in the vocal tracks are shifted to be centered around notes in a user-defined score or mapped to the closest pitch among the twelve equal-tempered scale degrees. The proposed data-driven approach tries to preserve the nuanced variations of sung pitch while also estimating the amount of unintended pitch shift. This is particularly important in genres like Blues, where singers center their pitch far from the equal-tempered scale.

When training the model, we needed a large number of examples of out-of-tune performances with their in-tune counterparts. We used 4702 in-tune performances from Smule, Inc and shifted every note randomly by up to a semitone (100 cents). The program then learned to predict these shifts. On this page, we show results on validation examples: performances that the program was never trained on. However, these validation examples were also de-tuned using our random shift algorithm. For this reason, we also show results on real-world performances that we did not process in any way.

Fig 1. Comparison of the the manual pitch shifts (in cents) against the raw pitch information from the audio track. The network-predicted shifts are compared against the ground truth shifts on a validation performance (top). Every straight line segment represents a singular scalar prediction per note, and a note is spanning across several time-frequency frames. The raw pitch comparison shows that the network-predicted (autotuned) track is much closer in raw pitch to the original in-tune recording compared to the programatically de-tuned input (bottom).

Audio examples

Validation set

This first set of examples shows the model's predictions on a sample of our validation set of 500 performances. The de-tuned audio was synthesized from in-tune audio, making it possible for us to measure the accuracy of the predictions.

The audio examples are in groups of three. The first consists of an original, in-tune performance. The second is the de-tuned version that is input to the program. The third demonstrates the output of the program after applying the predicted corrections. The artifacts in the shifted audio are due to the use of a simple phase vocoder for pitch shifting and to the notes were not always parsed correctly. For example, a single note might be split into two different sections that are de-tuned independently, creating an unnatural discontinuity.

Original (in-tune)
Detuned (input)
Corrected (output)

Real-world audio examples

This second set of examples shows the model's predictions on real (not synthesized) performances. We are currently developing a way to evaluate the quality of the predictions in this situation. For now, listeners can subjectively decide which version they prefer. The program was never trained on the arrangements used in these performances: The arrangements are used in the validation set, with different performances.

We were quite happy with the results on the first three performances. The shifted versions sound accurate overall. The program sometimes offer subtle adjustments, like making the pitch slightly sharper, creating a more brilliant sound.

Original (in-tune)
Corrected (output)

In the following five performances, some sections sound better, some sound equally good as the original, and some sound worse. Given the lack of musical score, it is difficult to avoid occasional errors, where the program predicts a completely wrong pitch. These make the overall performance sound inaccurate even though many notes sound better. A positive aspect of the program that can be heard in some examples is that it accommodates harmonization and improvisation. The somewhat less reliable results here than in the validation results show that the synthesized training data behaves differently enough from real-world data that the program is not always prepared for natural singing. For example, the training only sees shifts of up to a semitone.

The predictions for these final two performances are not very good!