MuseCoco: Generating Symbolic Music from Text

Abstract

Due to the inherent ease of textual input for user engagement, it is natural to generate music from text. In this paper, we introduce MuseCoco (Music Composition Copilot), a system meticulously designed to compose symbolic music from text descriptions. It operates by utilizing musical attributes as a bridge and dividing the process into text-to-attribute understanding stage and attribute-to-music generation stage, which bestows three key advantages: First, it eliminates the need for paired text-to-music data, as text-attribute pairs for the text-to-attribute understanding stage can be automatically synthesized as many as needed, and attributes required for the attribute-to-music generation stage can be directly extracted from music data, thus alleviating the labor-intense process of human annotation. Second, thanks to the explicit attribute design, the system excels in offering precise control over musical attributes, which ensures a high degree of accuracy in shaping the musical output according to user's intentions. Third, it can offer an additional attribute-conditioned control option beyond textual input, enhancing its versatility and usability. Our experimental results demonstrate that MuseCoco significantly outperforms our top-performing baseline model, GPT-4, on musicality, controllability, and overall score, by 45.5%, 35.2%, and 47.0%, respectively. There is also a notable enhancement of approximately 20% in objective control accuracy. Additionally, we have developed a large-scale model with 1.2 billion parameters, showcasing exceptional controllability and musicality. In practical applications, MuseCoco can serve as a user-friendly tool for musicians, enabling them to effortlessly generate music by simply providing text descriptions, and offering a substantial enhancement in efficiency compared to manually composing music from scratch.

Figure 1: The two-stage framework of MuseCoco. Text-to-attribute understanding extracts diverse musical attributes, based on which symbolic music is generated through the attribute-to-music generation stage.

Please note: this work focuses on generating symbolic music, which do not contain timbre information. We use MuseScore (https://musescore.org/en) to export mp3 files for reference, musicians are encouraged to leverage the .mid files for further timbre improvement.

Comparison with Baselines

Group 1

Text Descriptions: The combination of moderate speed, balanced rhythm, and major key, along with its 4/4 time signature, makes this song a prime representation of the classical style. The song structure is made up of about 15 bars. The music is influenced by Bach. The music is given its sound through piano. The use of a specific pitch range of 3 octaves creates a cohesive and unified sound throughout the musical piece. The track lasts for about 40 seconds.

BART GPT-4 MuseCoco

Comments:

BART-base: The generated sample from BART demonstrates inadequate control over attributes such as time signature, duration, and octaves. Furthermore, the sample's musicality suffers due to a lack of harmonies, resulting in an incomplete composition.

GPT-4: The generated music from GPT-4 is controlled well. However, it tends to sound monotonous since the rhythm patterns lack of variations, which is common in most generated music from GPT-4. Furthermore, some counterpoints in the composition, specifically in the 4th and 10th bars, lack well-designed integration, detracting from the overall quality of the piece.

MuseCoco: The generated music from MuseCoco is controlled precisely. Besides, it has more complex harmonies and more diverse rhythm patterns for a better listening experience.



Group 2

Text Descriptions: 2/4 is the meter of the music and should feature grand piano. The music is composed of a dense and intricate arrangement of notes, resulting in a captivating and memorable experience with the choice of minor key. The song embodies the characteristics of classical style. 15 seconds is a relatively long time for a song, but the dynamic and diverse pitch range spanning 4 octaves makes the music an interesting and engaging listening experience.

BART GPT-4 MuseCoco

Comments:

BART-base: The generated music suffers from inaccuracies in its control of time signature, duration, octaves, and note density. Furthermore, the melody line lacks variation, resulting in a monotonous composition with repetitive rhythm patterns and harmonic progressions.

GPT-4: The generated music lacks precision in controlling duration, octaves, and note density. As a result, the composition suffers from a monotonous melody characterized by simplistic rhythms and a lack of harmonies.

MuseCoco: The generated music demonstrates precise control, particularly notable in the dense arrangement of notes. This intricate composition style contributes to its memorability and adds an interesting and captivating quality to the music.


Group 3

Text Descriptions: The moderate tempo and conventional classical roots of this song, along with its 16 ~ 30 second running time and use of major key, creates a rich and dynamic sonic palette with a limited pitch range of 4 octaves that allows for a greater emphasis on the nuances of tone and phrasing. The music is in 3/4. The use of piano is vital to the music. The music follows in Chopin's footsteps.

BART GPT-4 MuseCoco

Comments:

BART-base: The generated music deviates from the attributes specified in the text description, such as time signature, duration, and octaves. Furthermore, the style of the music does not resemble that of Chopin. Additionally, the abnormal usage of rests within each voice part raises doubts about the composition's coherence and musicality.

GPT-4: The generated music falls short in multiple aspects. It does not span 4 octaves as specified, and it fails to capture the essence of Chopin's style. The melody lacks variation, resulting from limited rhythm patterns and harmonies. Additionally, the inclusion of two identical tracks introduces redundancy and detracts from the overall composition.

MuseCoco: The generated music demonstrates precise control. Specifically, the 5th and 6th bars skillfully capture the distinct essence and emotive qualities associated with Chopin's style, effectively conveying the desired musical expression.


Group 4

Text Descriptions:The music is slow and contemplative, with a pitch range of 3 octaves adding a distinctive character and emphasizing its emotional depth. The song's length is around about 14 bars, and it is not dance-friendly due to its style. The minor key adds a unique flavor, and grand piano, bass, voice, flute and drum should be included in the music. The time signature of the music is 3/4. The song is in an intense tempo with a moderate pace of about 30-seconds. The music is a quintessential example of religious genre.

BART GPT-4 MuseCoco

Comments:

BART-base: The generated music is monophonic with the piano and lacks the bass, voice, flute and drum. And it has the wrong time signature of 6/8.

GPT-4: The generated music is in the incorrect pitch range (1 octaves) and tempo (very slow). And its melody is monotonous and repeated.

MuseCoco: The generated music from MuseCoco is controlled precisely. The moderato chanting presents a peaceful religious style.


Group 5

Text Descriptions:The song's fast-paced rhythm, unique pitch range of 4 octaves, and use of grand piano create a memorable and energetic listening experience, while the 4/4 and the minor key give it an emotional quality that is similar to Brahms's style. This song is a true representation of the classical genre. The running time of the song is 16 ~ 30 seconds.

BART GPT-4 MuseCoco

Comments:

BART-base: The generated music lasts too long with a monophonic melody and does not resemble the style of Brahms.

GPT-4: The tempo of generated music from GPT-4 is too slow and the pitch ranges less in one octave, which are not corresponding to the text. And the melody is too simple to relate to Brahms.

MuseCoco: The generated music demonstrates precise control. Specifically, it is of the 4/4 time signature, in minor and with a fast tempo. The development of the multi-layed piano shows the Brahms' style.


Group 6

Text Descriptions:The moderate tempo and conventional classical roots of this song, along with its about 20 second running time and use of major key, creates a rich and dynamic sonic palette with a limited pitch range of 3 octaves that allows for a greater emphasis on the nuances of tone and phrasing. The concert is a true celebration of Schubert's legacy. The stage is set with piano, and the beat of this song is extremely strong.

BART GPT-4 MuseCoco

Comments:

BART-base: The generated music is too long to be corresponding with the text. And it consists of discrete notes and a boring melody, which can not express the classical style of Schubert.

GPT-4: The pitch range of generated music is limited in only one octave. It simple melody lacks the classical style from Schubert.

MuseCoco: The generated music is controlled perfectly. It utilizes the piano with a moderate tempo in major, which presents a relaxed style of the 3/4 meter like the Schubert's minuetto.

Generation Diversity

We test the generation diversity with the same text conditions.
Sample 1

Text Description:

Music is representative of the typical pop sound and spans 13 ~ 16 bars, this is a song that has a bright feeling from the beginning to the end. This song has a very fast and lively rhythm. The use of a specific pitch range of 3 octaves creates a cohesive and unified sound throughout the musical piece.

Generated Samples:



Explanations:

Those generated music samples exhibit the characteristics of typical popular music. Each sample lasts approximately 16 bars and has distinct tempos of 114 BPM, 102 BPM, and 108 BPM respectively. The melodies span approximately 4 octaves, contributing to the overall range of the piece. Throughout the composition, a major chord progression is employed, creating a consistent and uplifting atmosphere from start to finish. This tonal choice effectively captures the essence of a beautiful, bright day accompanied by a clear blue sky, evoking a sense of positivity and joy.


Sample 2

Text Description:

3/4 is the time signature of the music. This song is unmistakably classical in style. This music is low-tempo. The use of piano is vital to the music.

Generated Samples:



Explanations:

Those generated music follow a 3/4 meter with a slow tempo of around 60-70 BPM. The use of piano, harmonic progression and use of counterpoints features its classical styles.


Sample 3

Text Description:

The flute creates a lively and upbeat atmosphere in this fast-paced song with just the right tempo. The time signature of this song is not usual. The music is enriched by oboe, trumpet, flute and tuba.

Generated Samples:



Explanations:

In the generated samples, oboe, trumpet, flute, and tuba are utilized, with the flute playing a prominent role. The music features an uplifting melody, harmonies derived from major chords and their extensions, and a lively rhythm. Notably, the music deviates from the common-practice tradition by employing the unusual 2/2 time signature.


Sample 4

Text Description:

The song's upbeat melody and swift pace, paired with its 3 octave pitch range and 3/4 time signature, offers a diverse and dynamic listening experience. Strings, drum, clarinet and flute are utilized in the musical performance. The song's length is around about 14 bars.

Generated Samples:



Explanations:

In addition to the suggested instrumentation, the generated samples feature additional instruments such as the harp, enhancing the expressiveness of the music. These samples adhere to the given specifications, spanning 4 octaves in range, utilizing a 3/4 time signature, maintaining a fast tempo, and consisting of 14 bars. These musical choices contribute to the overall character and energy of the compositions, creating an engaging and dynamic listening experience.


Sample 5

Text Description:

This song has a steady and moderate rhythm. The music is given its sound through saxophone, drum and trombone. You can hear about 14 bars in this song. 2/4 is the time signature of the music.

Generated Samples:



Explanations:

Controlled by Specific Attributes

The model can also take the specific attribute values rather than text descriptions as the conditions to generate the corresponding music.
Attribute Value Generated Samples

Rhythm Intensity: moderate

Bar: 16

Time Signature: 4/4

Tempo: moderato

Time: 30-45s

Instrument: violin

Rhythm Intensity: intense

Tempo: slow

Time: >60s

Artist Style: Bach

Instrument: piano drum bass

Tempo: moderato

Description Length

This section demonstrates the model ability to handle differenet description length.
Short Descriptions

Text Descriptions: This song with a moderate beat is characterized by quiet and has a duration of about 14 bars.

Text Descriptions: This 31 ~ 45-second song embodies the essence of vocal music.

Text Descriptions: The song's sound is steeped in the conventions of religious style, and its pitch range is within 5 octaves. The tempo of this song is moderate.


Long Descriptions

Text Descriptions: The music in this song is defined by its distinctive character, which is emphasized by its pitch range of 6 octaves, adding emotional depth. Furthermore, the major key adds a unique flavor to the music, while its powerful and driving beat contributes to its overall impact. The arrangement of the song omits the use of oboe and organ, and it is written in 4/4, which gives it a balanced beat. The music evokes a strong sense of gusto and lasts for 16 ~ 30 seconds, with a duration of 9 ~ 12 bars. The musical performance employs grand piano and strings.

Text Descriptions: The compact pitch range of 4 octaves results in a focused and impactful musical performance, while the choice of major key adds to the captivating and memorable experience. With a duration of about 40 seconds, the rapid tempo of this song combined with the distinct sound of grand piano, drum, guitar and bass creates its unique musical identity. The meter of the music follows 4/4, and it is played at a moderate tempo, further enhancing its overall effect. The song's style is defined by its religious influences, culminating in a compelling musical composition. The music consists of about 15 bars.

Comparison with MusicLM

While MusicLM primarily focuses on generating musical audio from text descriptions, a distinct contrast to our work which involves generating symbolic music from similar textual cues, we find value in presenting comparative results with MusicLM. This allows us to effectively highlight the strengths and limitations of both approaches.

The samples of MusicLM are generated with this link: https://aitestkitchen.withgoogle.com/experiments/music-lm

Comparison 1

Text Descriptions: 3/4 is the time signature of the music. This song is unmistakably classical in style. This music is low-tempo. The use of piano is vital to the music.

MusicLM MuseCoco

Comments: The musicality of the samples from MusicLM is close to live performance. However, it features 4/4 in time signature instead of 3/4 described in texts. The genre is more like jazz rather than classical.


Comparison 2

Text Descriptions: The song's upbeat melody and swift pace, paired with its 3 octave pitch range and 3/4 time signature, offers a diverse and dynamic listening experience. Strings, drum, clarinet and flute are utilized in the musical performance. The song's length is around about 14 bars.

MusicLM MuseCoco

Comments: The sample produced by MusicLM demonstrates commendable musicality, featuring a discernible yet somewhat subtle melody line. Notably, the time signature of the generated sample (4/4) differs from the provided description (3/4). Furthermore, it is noteworthy that the generated samples lack the presence of strings and flutes, elements explicitly outlined in the provided text.


Comparison 3

Text Descriptions: 2/4 is the meter of the music and should feature grand piano. The music is composed of a dense and intricate arrangement of notes, resulting in a captivating and memorable experience with the choice of minor key. The song embodies the characteristics of classical style. 15 seconds is a relatively long time for a song, but the dynamic and diverse pitch range spanning 4 octaves makes the music an interesting and engaging listening experience.

MusicLM MuseCoco

Comments: Both samples exhibit commendable control, but the MusicLM sample leaning more toward a major key tonality (should be minor in descriptiions). Additionally, the MusicLM-generated music resonates with a live performance quality, characterized by noticeable rubato, whereas the MuseCoco sample adheres to a more structured composition style.


Comparison 4

Text Descriptions: Music is representative of the typical pop sound and spans 13 ~ 16 bars, this is a song that has a bright feeling from the beginning to the end. This song has a very fast and lively rhythm. The use of a specific pitch range of 3 octaves creates a cohesive and unified sound throughout the musical piece.

MusicLM MuseCoco

Comments: Both samples are controlled well and have a good musicality performance.


Conclusion: The end-to-end methods encounter challenges: 1) They need lots of paired data, which is hard and costly to gather. 2) These methods compress control details into a single vector, making precise control tough. We tested controlling MusicLM (an end-to-end model) with our dataset and found it struggles with fine details like tempo and meter. Unlike end-to-end methods, our work emphasizes the composition process. It empowers users to finely control musical attributes during music creation, aligning with the iterative and expressive nature of composition. This dynamic collaboration between human creativity and AI assistance enriches the environment for music composition.