Radio Atlas

The first large-scale corpus of talk radio transcripts, including metadata and tools to explore and map radio.

The Radio Atlas project collects audio from a sample of talk radio stations around the country and automatically transcribes them, building a first-of-its-kind corpus of talk radio broadcasts. Combined with third-party metadata and a variety of tools to explore the radio content, Radio Atlas provides researchers a new level of visibility into an influential but understudied medium. Nearly all Americans have access to talk radio, and its regular audience, counting both conservative talk and public radio, numbers in the tens of millions. Given talk radio’s political influence and wide reach, we aim to enable additional research on its internal structure and connections to other media.

Date: December 2018 - Present
Type: Prototype Workshop
Motivating Context

Talk radio is an influential medium in US culture and especially politics, yet is understudied relative to its importance. The political impact of the medium has been noted at least as far back as the 1990s, as stars like Rush Limbaugh and Sean Hannity built national profiles. Nearly all Americans have access to talk radio broadcasts (counting both conservative talk and public radio), with Nielsen estimates of the regular audience in the tens of millions. The research community, however, has had very little visibility into radio. The lack of any corpus of content analogous to the closed-caption datasets for TV, and indeed the lack of any dataset of broadcasts at all, makes radio ephemeral, and makes research difficult.


The Radio Atlas project addresses this lack of data, quite simply, by providing some. We select certain radio stations for inclusion in the corpus, based on representativeness of radio or relevance to our other research, and collect audio from their broadcasts. To avoid having to place radio receivers around the United States, we take advantage of the fact that most stations now have simultaneous online streams of their over-the-air broadcasts. The content collected from these streams is automatically transcribed, annotated with metadata, and stored for research use. We’ve published samples of this corpus, but are unable to release all content on a rolling basis for copyright reasons. In lieu of full data, we provide several tools and interfaces for researchers to explore, map and understand the radio ecosystem.