Multimodal Machine Learning for Spatiotemporal Identification of Sound Sources in the NYC Subway System

Project Sponsor

  • Iran R. Roman, Postdoctoral Associate, NYU CUSP
  • Juan P. Bello, Professor of Music Technology and Computer Science & Engineering, NYU CUSP


The NYC subway has a very complex soundscape that can be loud, exposing individuals riding and living nearby to harmful sound levels. The factors that give rise to loud levels of sound have not been carefully characterized and as a result it is not clear why some places and/or times of the day are louder than others. In this project, we will use machine learning on real video and audio recordings, collected in the NYC subway, to answer the question: how do station structure, train mechanics, and human activities in the NYC subway contribute to harmful sound levels?

Category: Urban Health

Project Description & Overview

The NYC subway system has a complex structure involving platforms and railways in parallel and overlapping floors. This results in the dynamic interaction of train traffic and human activities that give rise to a unique soundscape. Sound levels in the NYC subway can be loud and harmful to the human ear. However, it is not clear how (i.e. where and at what time) and which (i.e. trains, vs loud music, vs construction) sound sources are the main contributors. This targeted description of sound sources is essential to make suggestions to authorities and make sound levels safe for the human ear.

Our research group has data sources that can help us characterize the spatiotemporal dynamics of sound activities in the NYC subway system. Data includes detailed maps of subway station layout and structures, videos collected by placing a camera at the front of a moving train, and sound pressure level recordings collected at subway platforms. This project will use state-of-the-art computer vision and machine listening models to quantify features in this data and systematically identify which sound sources result in harmful sound levels, and where/when.

The project has these goals: 1) analyze our corpus of spatially and temporally tagged sound pressure level recordings at different subway stations and identify all times and locations where sound pressure levels are harmful. 2) Determine whether harmful sound pressure levels can be systematically attributed to the structural design of subway stations. 3) curate an audio-visual dataset of subway station activity with spatiotemporal annotations (using raw footage available on Youtube, collected by pedestrians at subway stations and by placing cameras at the front of moving trains). 4) use state-of-the-art computer vision and machine listening models to audio-visually characterize noise sources at subway stations with harmful noise levels.


  • A database of sound pressure levels previously collected by our research group at subway stations.
  • 3D Maps of subway station structure, readily available online.
  • Youtube videos with raw footage collected by pedestrians wandering around subway stations, or by placing a camera at the front of trains while they make their way through their scheduled lines.


The students should be comfortable with Python and familiar with data analysis tools such as numpy and pandas. Having a machine learning background and knowledge of statistical modeling is also desirable (basic classification models such as random forests and test/train splits for evaluation).

Learning Outcomes & Deliverables

First, students will learn how to convert a dense corpus of sound pressure levels into a sparse spatiotemporal representation of harmful sound levels. Second, students will learn how to carry out statistical modeling to determine whether the spatiotemporal distribution of harmful sound levels can be attributed to the structural design of subway stations. Third, students will learn how to curate a high-quality audio-visual dataset that multimodally highlights activities that take place at subway stations with different levels of noise. Fourth, students will learn how to use existing models to process their audiovisual dataset to quantify sound sources.