Extending LLMs for Acoustic Music Understanding — Models, Benchmarks, and Multimodal Instruction Tuning
主讲人: 马英浩(伦敦玛丽皇后大学)
活动时间: 从 2025-05-15 16:00 到 17:30
场地: 请选择
Abstract: Large Language Models (LLMs) have transformed learning and generation in text and vision, yet acoustic music—an inherently multimodal and expressive domain—remains underexplored. In this talk, I present recent progress in leveraging large-scale pre-trained models and instruction tuning for music understanding. I introduce MERT, a self-supervised acoustic music model with over 10k monthly downloads, and MRABLE, a universal benchmark for evaluating music audio representations.I also present MusiLingo, a system that aligns pre-trained models across modalities to support music captioning and question answering. To address the gap in evaluating instruction-following capabilities, I propose CMI-Bench, the first benchmark designed to test models' ability to understand and follow complex music-related instructions across audio, text, and symbolic domains. I conclude by discussing open challenges in the responsible deployment of generative music AI.
报告人介绍:马英浩是伦敦玛丽皇后大学数字音乐中心 (C4DM) 人工智能与音乐 (AIM) 项目的博士生,指导老师是 Emmanouil Benetos 博士和 CMU 的 Chris Donahue 博士。他的研究探索了音乐理解与大规模预训练模型的交叉点,重点是多模态学习、指令调整和模型评估。他共同开发了用于声学音乐理解的自监督模型 MERT 和用于评估通用音乐音频表示的基准 MRABLE。最近,他推出了用于音乐指令跟踪的 CMI-Bench 和用于音乐语言任务的多模态对齐系统 MusiLingo。除了研究之外,他还共同创立了多模态艺术投影 (MAP) 社区,并曾是北京大学中国音乐学社乐团的学生指挥。他的长期目标是建立人工智能音乐多模态理解的基础模型,同时解决生成系统中的公平性、安全性和创作完整性。