publications | Yaolong Ju

2025

CrossMuSim: A cross-modal framework for music similarity retrieval with LLM-powered text description sourcing and mining

Tristan Tsoi, Jiajun Deng, Yaolong Ju, Benno Weck, Holger Kirchhoff, and Simon Lui

May 2025

arXiv:2503.23128 [cs] Summary: This paper introduces a dual-source data acquisition approach combining online scraping and LLM-based prompting, where carefully designed prompts leverage LLMs’ comprehensive music knowledge to generate contextually rich descriptions.

Abs DOI

Music similarity retrieval is fundamental for managing and exploring relevant content from large collections in streaming platforms. This paper presents a novel cross-modal contrastive learning framework that leverages the open-ended nature of text descriptions to guide music similarity modeling, addressing the limitations of traditional uni-modal approaches in capturing complex musical relationships. To overcome the scarcity of high-quality text-music paired data, this paper introduces a dual-source data acquisition approach combining online scraping and LLM-based prompting, where carefully designed prompts leverage LLMs’ comprehensive music knowledge to generate contextually rich descriptions. Exten1sive experiments demonstrate that the proposed framework achieves significant performance improvements over existing benchmarks through objective metrics, subjective evaluations, and real-world A/B testing on the Huawei Music streaming platform.

2024

End-to-End Automatic Singing Skill Evaluation Using Cross-Attention and Data Augmentation for Solo Singing and Singing With Accompaniment

Yaolong Ju, Chun Yat Wu, Betty Cortiñas Lorenzo, Jing Yang, Jiajun Deng, Fan Fan, and Simon Lui

In Proceedings of the 25th International Society for Music Information Retrieval Conference, May 2024

Abs DOI

Automatic singing skill evaluation (ASSE) systems are predominantly designed for solo singing, and the scenario of singing with accompaniment is largely unaddressed. In this paper, we propose an end-to-end ASSE system that effectively processes both solo singing and singing with accompaniment using data augmentation, where a comparative study is conducted on four different data augmentation approaches. Additionally, we incorporate bi-directional cross-attention (BiCA) for feature fusion which, compared to simple concatenation, can better exploit the inter-relationships between different features. Results on the 10KSinging dataset show that data augmentation and BiCA boost performance individually. When combined, they contribute to further improvements, with a Pearson correlation coefficient of 0.769 for solo singing and 0.709 for singing with accompaniment. This represents relative improvements of 36.8% and 26.2% compared to the baseline model score of 0.562, respectively.
Multi-View Midivae: Fusing Track- and Bar-View Representations for Long Multi-Track Symbolic Music Generation

Zhiwei Lin, Jun Chen, Boshi Tang, Binzhu Sha, Jing Yang, Yaolong Ju, Fan Fan, Shiyin Kang, Zhiyong Wu, and Helen Meng

In ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Apr 2024

Summary: Object and subjective experimental results demonstrate that, compared to the baseline, Multi-view MidiVAE exhibits significant improvements in terms of modeling long multi-track symbolic music.

Abs DOI

Variational Autoencoders (VAEs) constitute a crucial component of neural symbolic music generation, among which some works have yielded outstanding results and attracted considerable attention. Nevertheless, previous VAEs still encounter issues with overly long feature sequences and generated results lack contextual coherence, thus the challenge of modeling long multi-track symbolic music still remains unaddressed. To this end, we propose Multi-view MidiVAE, as one of the pioneers in VAE methods that effectively model and generate long multi-track symbolic music. The Multi-view MidiVAE utilizes the two-dimensional (2-D) representation, OctupleMIDI, to capture relationships among notes while reducing the feature sequences length. Moreover, we focus on instrumental characteristics and harmony as well as global and local information about the musical composition by employing a hybrid variational encoding-decoding strategy to integrate both Track- and Bar-view MidiVAE features. Objective and subjective experimental results on the CocoChorales dataset demonstrate that, compared to the baseline, Multi-view MidiVAE exhibits significant improvements in terms of modeling long multi-track symbolic music.
Cycle Frequency-Harmonic-Time Transformer for Note-Level Singing Voice Transcription

Yulun Wu, Yaolong Ju, Simon Lui, Jing Yang, Fan Fan, and Xuhao Du

In 2024 IEEE International Conference on Multimedia and Expo (ICME), Jul 2024

Summary: A novel 3D Cycle Frequency-Harmonic-Time Transformer (CFT) is proposed to explicitly capture the harmonic series of singing voices, where a tokenization scheme is defined that captures harmonics across multiple octaves, then the harmonic features are aggregated into the frequency-harmonic-time representations via a cyclic architecture.

Abs DOI

Singing voice transcription (SVT) is the task of converting singing voice music into symbolic note series. Although most SVT models utilized the time-frequency information from the input spectrogram, the useful harmonic information in singing voices has not been utilized enough. In this paper, we propose a novel 3D Cycle Frequency-Harmonic-Time Transformer (CFT) to explicitly capture the harmonic series of singing voices, where we first define a tokenization scheme that captures harmonics across multiple octaves, then the harmonic features are aggregated into the frequency-harmonic-time representations via a cyclic architecture. Results show that our method achieves state-of-the-art performances on several public datasets, including note-wise accuracy increases of 5.76% for MIR-ST500 and 13.56% for Cmedia.
Efficient adapter tuning for joint singing voice beat and downbeat tracking with self-supervised learning features

Jiajun Deng, Yaolong Ju, Jing Yang, Simon Lui, and Xunying Liu

In Proceedings of the 25th International Society for Music Information Retrieval Conference, Nov 2024

Summary: A novel temporal convolutional network-based beat-tracking approach featuring self-supervised learning representations and adapter tuning is proposed to track the beat and downbeat of singing voices jointly.

DOI

2023

Improving Automatic Singing Skill Evaluation with Timbral Features, Attention, and Singing Voice Separation

Yaolong Ju, Chunyang Xu, Yichen Guo, Jinhu Li, and Simon Lui

In 2023 IEEE International Conference on Multimedia and Expo (ICME), Jul 2023

Summary: This paper proposes a more general ASSE model which applies to both solo singing and singing with accompaniment, and employs an existing singing voice separation tool for accompaniment removal and compares ASSE models trained with and without accompaniment.

Abs DOI

Most automatic singing skill evaluation (ASSE) models focus only on solo singing, resulting in a limited application scope since singing is usually mixed with instrumental accompaniment in music. In this paper, we propose a more general ASSE model which applies to both solo singing and singing with accompaniment. For this purpose, we employ an existing singing voice separation tool for accompaniment removal and compare ASSE models trained with and without accompaniment. Results show that accompaniment removal achieves better performances. Furthermore, we explore different features and model architectures, concluding that the additions of timbral features, attention mechanism, and dense layer further improve the performance. Finally, we show that our proposed model achieves a Pearson correlation coefficient of 0.562, a 62.4% relative improvement compared to 0.346 for the baseline model.

2022

VocEmb4SVS: Improving Singing Voice Separation with Vocal Embeddings

Chenyi Li, Yi Li, Xuhao Du, Yaolong Ju, Shichao Hu, and Zhiyong Wu

In 2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Nov 2022

Summary: VocEmb4SVS is proposed, an SVS framework to utilize vocal embeddings of the singer as auxiliary knowledge for SVS conditioning and achieves state-of-the-art performance on the MUSDB18 dataset.

Abs DOI

Deep learning-based methods have shown promising performance on singing voice separation (SVS). Recently, embeddings related to lyrics and voice activities have been proven effective to improve the performance of SVS tasks. However, embeddings related to singers have never been studied before. In this paper, we propose VocEmb4SVS, an SVS framework to utilize vocal embeddings of the singer as auxiliary knowledge for SVS conditioning. First, a pre-trained separation network is employed to obtain pre-separated vocals from the mixed music signals. Second, a vocal encoder is trained to extract vocal embeddings from the pre-separated vocals. Finally, the vocal embeddings are integrated into the separation network to improve SVS performance. Experimental results show that our proposed method achieves state-of-the-art performance on the MUSDB18 dataset with an SDR of 9.56 dB on vocals.
AnimeTAB: A new guitar tablature dataset of anime and game music

Yuecheng Zhou, Yaolong Ju, and Lingyun Xie

Oct 2022

arXiv:2210.03027 [cs] Summary: This paper presents AnimeTAB, a fingerstyle Guitar tablature dataset in MusicXML format, which provides more high-quality guitar tablature for both researchers and guitar players and an accompanying analysis toolkit, TABprocessor, is included to further facilitate its use.

Abs DOI

While guitar tablature has become a popular topic in MIR research, there exists no such a guitar tablature dataset that focuses on the soundtracks of anime and video games, which have a surprisingly broad and growing audience among the youths. In this paper, we present AnimeTAB, a fingerstyle guitar tablature dataset in MusicXML format, which provides more high-quality guitar tablature for both researchers and guitar players. AnimeTAB contains 412 full tracks and 547 clips, the latter are annotated with musical structures (intro, verse, chorus, and bridge). An accompanying analysis toolkit, TABprocessor, is included to further facilitate its use. This includes functions for melody and bassline extraction, key detection, and chord labeling, which are implemented using rule-based algorithms. We evaluated each of these functions against a manually annotated ground truth. Finally, as an example, we performed a music and technique analysis of AnimeTAB using TABprocessor. Our data and code have been made publicly available for composers, performers, and music information retrieval (MIR) researchers alike.

2021

Addressing ambiguity in supervised machine learning: A case study on automatic chord labelling

Yaolong Ju

McGill University, Oct 2021

Abs

Chord labelling is an important analytical tool in Western tonal music with many possible applications. Manual chord labelling is a time-consuming process, and automatic chord labelling can be a promising alternative. However, most automated approaches to date have not sufficiently considered ambiguity in chord labelling, where multiple answers are possible due to different labelling strategies. In this dissertation, I present three ways of addressing this ambiguity in automatic chord labelling, using J. S. Bach’s chorales as a case study.

2020

Automatic Chord Labelling: A Figured Bass Approach

Yaolong Ju, Sylvain Margot, Cory McKay, and Ichiro Fujinaga

In Proceedings of the 7th International Conference on Digital Libraries for Musicology, Oct 2020

Summary: This paper proposes a series of four rule-based algorithms that automatically generate chord labels for homorhythmic Baroque chorales based on both figured bass annotations and the musical surface, which are applied to the existing Bach Chorales Figured Bass dataset.

DOI
Automatic Figured Bass Annotation Using the New Bach Chorales Figured Bass Dataset

Yaolong Ju, Sylvain Margot, Cory McKay, Luke Dahn, and Ichiro Fujinaga

In Proceedings of the 21th International Society for Music Information Retrieval Conference, Oct 2020
Figured Bass Encodings for Bach Chorales in Various Symbolic Formats: A Case Study

Yaolong Ju, Sylvain Margot, Cory McKay, and Ichiro Fujinaga

In Proceedings of the Music Encoding Conference, Oct 2020
Data Quality Matters: Iterative Corrections on a Corpus of Mendelssohn String Quartets and Implications for MIR Analysis

Jacob Degroot-Maggetti, Timothy Reuse, Laurent Feisthauer, Samuel Howes, Yaolong Ju, Suzaka Kokubu, Sylvain Margot, Néstor Nápoles López, and Finn Upham

In International Society for Music Information Retrieval Conference (ISMIR 2020), Oct 2020

2019

An Interactive Workflow for Generating Chord Labels for Homorhythmic Music in Symbolic Formats

Yaolong Ju, Samuel Howes, Cory McKay, Nathaniel Condit-Schultz, Jorge Calvo-Zaragoza, and Ichiro Fujinaga

In Proceedings of the 20th International Society for Music Information Retrieval Conference, Oct 2019

Abs

Automatic harmonic analysis is challenging: rule-based models cannot account for every possible edge case, and manual annotation is expensive and sometimes inconsistent, undermining the training and evaluation of machine learning models. We present an interactive workﬂow to address these problems, and test it on Bach chorales. First, a rule-based model was used to generate preliminary, consistent chord labels in order to pre-train three machine learning models. These four models were grouped into an ensemble that generated chord labels by voting, achieving 91.4% accuracy on a reserved test set. A domain expert then corrected only those chords that the ensemble did not agree on unanimously (20.9% of the generated labels). Finally, we used these corrected annotations to re-train the machine learning models, and the resulting ensemble attained an accuracy of 93.5% on the reserved test set, a 24.4% reduction in the number of errors. This versatile interactive workﬂow can either work in a fully automatic way, or can capitalize on relatively minimal human involvement to generate higher-quality chord labels. It combines the consistency of rule-based models with the nuance of manual analysis to generate relatively inexpensive highquality ground truth for training effective machine learning models.

2018

A Flexible Approach to Automated Harmonic Analysis: Multiple Annotations of Chorales by Bach and Prætorius

Nathaniel Condit-Schultz, Yaolong Ju, and Ichiro Fujinaga

In Proceedings of the 19th International Society of Music Information Retrieval Conference, Oct 2018

Abs

Despite being a core component of Western music theory, harmonic analysis remains a subjective endeavor, resistant automation. This subjectivity arises from disagreements regarding, among other things, the interpretation of contrapuntal ﬁgures, the set of “legal” harmonies, and how harmony relates to more abstract features like tonal function. In this paper, we provide a formal speciﬁcation of harmonic analysis. We then present a novel approach to computational harmonic analysis: rather than computing harmonic analyses based on one speciﬁc set of rules, we compute all possible analyses which satisfy only basic, uncontroversial constraints. These myriad interpretations can later be ﬁltered to extract preferred analyses; for instance, to forbid 7th chords or to prefer analyses with fewer non-chord tones. We apply this approach to two concrete musical datasets: existing encodings of 371 chorales by J.S. Bach and new encodings of 200 chorales by M. Prætorius. Through an online API users can ﬁlter and download numerous harmonic interpretations of these 571 chorales. This dataset will serve as a useful resource in the study of harmonic/functional progression, voice-leading, and the relationship between melody and harmony, and as a stepping stone towards automated harmonic analysis of more complex music.

2017

Non-chord Tone Identification Using Deep Neural Networks

Yaolong Ju, Nathaniel Condit-Schultz, Claire Arthur, and Ichiro Fujinaga

In Proceedings of the 4th International Workshop on Digital Libraries for Musicology - DLfM ’17, Oct 2017

Summary: The results suggest that DNNs offer an innovative and promising approach to tackling the problem of non-chord tone identification, as well as harmonic analysis.

Abs DOI

This demo addresses the problem of harmonic analysis by proposing a non-chord tone identiﬁcation model using deep neural networks (DNNs). By identifying non-chord tones, the task of harmonic analysis is much simpliﬁed. Trained and tested on a dataset of 140 Bach chorales, the DNN model was able to identify non-chord tones with F1measure of 72.19% using pitch-class, metric information, and a small contextual window around each input sonority as input features. These results suggest that DNNs offer an innovative and promising approach to tackling the problem of non-chord tone identiﬁcation, as well as harmonic analysis.

2012

K-means initial clustering center optimal algorithm based on Kruskal

Lan Huang, Shixian Du, Yu Zhang, Yaolong Ju, and Zhuo Li

J. Inf. Comput. Sci, Oct 2012