cplusj2024

This website serves as the informal Proceedings of the Computation + Journalism Symposium 2024.The Symposium was held at Northeastern University in Boston, MA, USA on October 26 - 27, 2024.

The final program can be found here.

Accepted contributions

Search and discovery of local government meetings for journalists

Doug Beeferman and Nabeel Gillani

search engine, local news, local government, natural language processing

Since the onset of the COVID-19 pandemic, city council and school board meetings are recorded and made available online much more frequently than before. The transcripts of these meetings provide a valuable data source for journalists, policymakers, and social science researchers in an environment in which local news media is otherwise eroding.

In this talk we will discuss CivicSearch (https://civicsearch.org), a search engine, discovery tool, and alerting platform that builds upon a research corpus of these transcripts called LocalView (https://localview.net). CivicSearch is designed to analyze government meetings that are posted online soon after they occur. Using recent advances in natural language processing, it automatically clusters the conversations within these meetings into subject areas, and for each subject area identifies a set of issues and problem areas that local policymakers are trying to solve. Users may search for or monitor individual keywords or entire subject areas, using large language model-based summarization to synthesize findings across jurisdictions. The system also identifies specific geographic points of interest in communities that are mentioned in the meetings, enabling users to keep track of how places near to them are being discussed by policymakers.

CivicSearch currently includes the transcripts of about 650 cities and towns in the US and Canada. Journalists are using it to research stories and to monitor mentions of terms related to their beats. In our talk, we’ll discuss a few real-world use cases of the tool, including studies on chronic absenteeism in schools, drug treatment policy, and racial extremism.


A Case Study in an A.I.-Assisted Content Audit đź“‚

Rahul Bhargava, Elisabeth Hadjis and Meg Heckman

journalism, content audit, artificial intelligence

This paper presents an experimental case study utilizing machine learning and generative AI to audit content diversity in a hyperlocal news outlet, TS, based at a university and focused on underrepresented communities in Boston. Through computational text analysis, including entity extraction, topic labeling, and quote extraction and attribution, we evaluate the extent to which TS’s coverage aligns with its mission to amplify diverse voices. The results reveal coverage patterns, topical focus, and source demographics, highlighting areas for improvement in editorial practices. This research underscores the potential for AI-driven tools to support similar small newsrooms in enhancing content diversity and alignment with their community-focused missions. Future work envisions developing a cost-effective auditing toolkit to aid hyperlocal publishers in assessing and improving their coverage.


Participatory Journalism: Stakeholder Perspectives on Enhancing Online Discussion through Data Talk đź“‚

Cole Biehle, Ritvik Irigireddy, Lu Sun and Steven P. Dow

data talk, data visualization, journalism, online discussion, news website

Discussing data related to news is one form of participatory journalism. \textit{Data talk} can promote data literacy, foster community around grounded understanding, and encourage civic engagement. What are the challenges and potential of scaffolding data talk on news websites? To understand current practices and perceptions around future visions for audience engagement in news, we interviewed 12 diverse stakeholders including journalists, data scientists, readers, and moderators. To provoke future thinking, we asked participants to react to our interactive design probe with three scenarios: questioning data on a chart, adding data views, and telling personal data stories. We identified several challenges stakeholders face with data talk, including the fast news cycle hindering thorough data discussions and the difficulty of creating accessible insights and visuals. Despite significant data wrangling, discussions around the data remain rare. Participants reacted positively to the interactive scaffolding features in the three design scenarios, noting that these features can make data an effective entry point for discussions, scaffold audience participation in the data pipeline, and lower the barrier to engagement.


Talk: The Speed and Sentiment of News in Twitter (X) vs. Radio

William Brannon and Deb Roy

News cycle, Twitter (X), Talk radio, Event detection, Outrage, Agenda-setting, Comparative media analysis

The rapid evolution of the Internet is reshaping the media landscape, with frequent claims of an accelerated and increasingly outraged news cycle. We test these claims empirically, investigating the dynamics of news spread, decay, and sentiment on Twitter (X) versus talk radio. Both platforms are significant – radio for its wide reach and political influence, and Twitter for its large user base and high concentration of journalists.

Analyzing 2019-2021 data including 517,000 hours of radio content and 26.6 million tweets by elite journalists, politicians, and general users, we used automated event detection to identify 1,694 news events. We find that news on Twitter circulates faster, fades faster, and is more negative and outraged compared to radio, with Twitter outrage also more short-lived. We believe this is the first large-scale comparison of outrage between Twitter and traditional media. These patterns are consistent across user types and robustness checks, including a breakdown by ideology, a separate set of manually identified events, and cross-medium event matching. We also find evidence that some of these differences stem from differences in the affordances of the media themselves, especially the types of stories each selects for.

Our results illustrate an important way social media may influence traditional media: framing and agenda-setting simply by speaking first. As journalism evolves with these media, news audiences may encounter faster shifts in focus, less attention to each news event, and much more negativity and outrage.

This work was recently published in Scientific Reports: https://doi.org/10/gtwggj


CuratedDDS: A Taxonomy and a Dataset of Data-Driven Stories to Support Journalists’ Inspiration 📂

Louri Glen Kae Compain and Thomas Hurtut

Communication, Data Driven Stories, Data Journalism, Exemples, Ideation, Inspiration, Journalism, Workflow Design

So-called Data Driven Stories (DDS) are an increasingly popular online narrative format that combines text, media such as photos, and data visualizations. This format engages readers effectively due to its visual appeal and the trustworthiness conferred by data journalism. However, compared to more traditional formats, DDS require an thorough ideation and design process, about its narrative structure, and the ancitipated visualizations, which brings up additional challenges. In this paper, we present CurratedDDS, a tool developed to support DDS designers by offering a curated archive of DDS examples, annotated using a proposed DDS taxonomy, and an exploratory interface. To evaluate the efficiency and impact of CurratedDDS for the target audience, we conducted an evaluation in the form of a case study with a group of 6 media professionals.


AI Innovation in the Newsroom: A Design Thinking Workshop to Support New Approaches to Responsible AI đź“‚

Maximilian Eder

Design Thinking, Responsible AI, Journalism Practice

The main goal of this five-stage design thinking workshop, initially developed by Stanford University, is to incorporate the participants’ specific points of view into responsible AI tools in local journalism, with the expected outcome of a low-fidelity prototype.

The author previously conducted such a workshop with students from the German School of Journalism and journalists, social media editors, and product managers from Rheinische Post, a local German news media organization.

A working space for at most twelve participants with several tables and a screen or projector is needed to conduct the workshop. The workshop would take between three and four hours.


Google Searches Regarding Politicians are Dominated by a Few Nonpartisan News Sources

Zhen Guo, Allison Wan, Kai-Cheng Yang and David Lazer

Google Search Engine, Source Concentration, News, Local News, Election

As the dominant platform for distributing political information, Google Search has sparked concerns about partisan bias, filter bubbles, and, particularly, source concentration. Although research to date has found minimal evidence of partisan bias or filter bubbles, this does not necessarily absolve Google of bias, especially regarding the distribution of sources. Previous studies have shown that search results related to news and elections are often limited to a small selection of domains, prompting questions about the diversity and inclusivity of the information provided.

This study examines to which extent Google search results are concentrated by gathering daily data on the full names of 604 United States governors, House representatives, and election candidates from September 2020 to March 2021. Searches were simulated from all 435 congressional districts. We find the top 10 domains account for 64% of the total one billion search results, underscoring the significant concentration of information sources. Furthermore, our analysis of partisan market share reveals a mildly different distribution of domain visibility across the party affiliation of the politician searched. Our results also reveal persistent variations of the search results from different locations, indicating their significant role in customizing search results.

For the next steps, we plan to extend our investigation to evaluate the representation of local sources within Google Search results. We hope to reveal how well local media outlets surfaced in searches related to political figures, potentially illuminating the extent to which local perspectives are accessible in the digital information ecosystem.


Applying AI in Newsrooms: lessons from the Applied AI in Journalism Challenge

Bahareh Heravi

AI in Journalism, AI and News, Artificial Intelligence

This talk will explore the lessons learned from the Applied AI Journalism Challenge (AIJC)—an accelerator program where 12 newsrooms from across the globe developed practical generative AI applications. The program ran over five months and was funded by the Open Society Foundation. It served as both a prototype to accelerate the pragmatic use of AI in newsrooms and as an innovative educational model for integrating technological advances in the newsroom.   The programme involved a short training period, followed by ongoing and sustained mentoring during a rapid and competitive development phase. This method offered the participating newsrooms a deep dive into generative AI capabilities and practices, strengthened by regular stringent deadlines and expert coaching. The practical and mentor-driven nature of the programme’s structure played a crucial role in substantially enhancing the capabilities of each team.

This initiative proved highly successful, notably in fostering capable and motivated teams, and engaging them actively as they developed significant new capabilities. The talk will provide an overview of the programme, highlighting the teams involved, the project, the critical lessons learned, and potential future developments. Furthermore, it will explore the broader impact of such accelerator-type initiatives, emphasising their effectiveness as a method for improving AI literacy in newsrooms and their potential to shape the future of journalism.


The Challenges and Opportunities of Designing News Maps for Mobile Devices

Lily Houtman

cartography, maps, data journalism, mobile devices, mobile technology

This abstract is for a contributed talk. As the general public increasingly uses mobile devices in every aspect of daily life, designing data visualizations for mobile devices has become more important. Data journalism in particular is a key way science and other breaking news is communicated to the public, making attention to best practices and studying these visuals particularly worthwhile. Specifically, designing news maps for mobile devices presents many challenges for data journalists due to the fact that they use spatial data. The primary challenges for mobile news maps fall into six key themes: responsive and mobile-first design, screen size, orientation, and resolution, generalization and complexity, post-WIMP (windows, icons, mouse, pointer) environments, technical accessibility, and individual accessibility. To understand how data journalists work through these challenges, I interviewed 18 news cartographers about the present and future of mobile map design, centering techniques and practices. While the participants offered many creative solutions to produce beautiful and informative data visualizations, I also identified a few lingering challenges that warrant additional discussion and research: types of interactivity, time constraints, simultaneous design, and ability to conduct user testing.


Geographic Consistency: A new framework for measuring fluctuations and representation in local journalism

Stephen Jefferson

Local News, Spatial Data, Geography, Community Representation

This talk aims to share a new method for analyzing the representation of local communities by measuring changes in their news coverage over time. It will discuss how journalists, researchers, and developers can better assess representation by measuring local news consistency and fluctuations within specific regions — called geographic consistency. The speaker will define the mathematical model for geographic consistency and share results from a case study with three newsrooms published in 2024.

The lack of shared understanding and standardized practice for measuring consistency in journalism has led newsrooms to make editorial decisions that inadvertently cause inconsistent local coverage that prevent trust-building, such as instances of parachute journalism. The intentional practice of geographic consistency aims to stabilize these damaging fluctuations better and contextualize problems that may be limiting community engagement in local news.

From a broader perspective, this talk will highlight general lessons and advice for analyzing temporal and spatial patterns in local journalism based on the research team’s experience and past literature review. Similarly, it aims to illustrate how local news analysis can go beyond short-term case studies and become a model that’s integrated into newsroom workflows to have a more significant impact.


Contributed workshop: Overcoming Challenges in Implementing AI in the Newsroom

Nadia Kohler and Titus Plattner

AI implementation in newsroom, AI literacy, Technical dependencies, Risk mitigation, Future improvements of GenAI, Build or buy decision, Practical strategies, Case studies, Hands-on activities

Duration: 60-90 minutes

Objective: This workshop aims to provide participants with practical insights and strategies for navigating the challenges of implementing AI in the newsroom. Through interactive discussions, short case studies, and hands-on activities, attendees will share and learn how to address common obstacles and leverage AI to enhance their newsroom operations.

Group discussion to identify common challenges in AI implementation, such as

If both the talk (we also sumbitted a talk about Tamedia’s experience) and the workshop are accepted, the key takeaways from the workshop will be integrated into the talk.


Contributed Talk: Scaling GenAI Experimentation and Implementation in the Newsroom đź“‚

Nadia Kohler and Titus Plattner

Industry, Newsroom, GenAI, Technology, CMS, Text generation, Quality monitoring, Risk mitigation, Strategy

How Tamedia scaled up Generative AI (GenAI) experimentation and implementation across its newsrooms. This session will provide an in-depth look at our cross-functional approach involving Technology, Product, Journalism, and Sales teams. Attendees will gain insights into our test-and-learn methodology, iterative processes, and the current state of GenAI integration in our CMS and other systems, with concrete examples.

These example range from content gathering, headline and text generation, to quality monitoring. We can also share how we mitigate the risks inherent to AI, explore future directions for GenAI in our newsroom operations, and offer our strategic insights on its future potential.

By October 2024, we anticipate having even more valuable experiences and lessons to share.


Unveiling Disinformation Narratives with AI: Collaborative Insights from Fact-Checkers and Computer Scientists’ Work in Analyzing Climate Misinformation Narratives 📂

Irene Larraz, RamĂłn SalaverrĂ­a and Javier Serrano-Puche

Fact-checking, Disinformation narratives, Artificial intelligence, Computer Sciences, Journalism, Climate disinformation

Fact-checkers are transitioning from debunking falsehoods to analyzing disinformation narratives, aiming to uncover common themes and underlying messages within these pieces of misinformation. This shift seeks to piece together the puzzle of the disinformation ecosystem, providing a comprehensive view to better understand how false ideas propagate and their ultimate objectives. Artificial intelligence (AI) plays a pivotal role, facilitating the analysis of a vast array of misinformation messages and synthesizing key insights. This paper explores the collaborative efforts between fact-checkers and computer scientists through a case study focusing on the analysis of climate misinformation narratives following farmers’ protests in Europe within the Climate Facts Europe project, led by the EFCSN. The findings underscore the role of AI in assisting journalists to extract primary narratives and assess their impact over time.


Probing GPT-4 for Knowledge of Journalistic Tasks đź“‚

Charlotte Li and Nicholas Diakopoulos

Journalism Work, Task Taxonomy, Agentic AI, GPT-4, Future of Work

“What is an AI system’s comprehension of journalism tasks?” This is an important question to ask as conversations around building agents for use in newsrooms are advancing. In order for an AI system to serve as an agent for specific journalism tasks, it must have some understanding of the work of journalism and how tasks within it break down. In this paper, we assess the level of comprehension that GPT-4 has of journalism tasks using work activity and task descriptions from O*NET. We conduct a qualitative analysis of the output from GPT-4 and construct a journalism task taxonomy. We find that the output from GPT-4 covers the majority of descriptions in the baseline and offers new insights into journalistic tasks. We propose recommendations for future practitioner-centric research based on our results.


Evaluating GenAI Tools in Your Newsroom

Charlotte Li, Sachita Nishal and Nicholas Diakopoulos

Generative AI Evaluation, Benchmarking, User Studies, Generative AI in Newsrooms

Newsrooms today are adopting GenAI tools to improve the efficiency of existing workflows on various tasks in news production or exploring new reader experiences. However, practitioners also express reservations around this adoption, due to ethical concerns and technical challenges in evaluating these tools for journalism [1]. This partly stems from a lack of domain-specific strategies around creating low-cost, in-house AI use-case evaluations. Our workshop aims to empower practitioners and researchers to undertake such evaluations: by reviewing existing approaches to AI evaluation to lay a conceptual foundation, and by engaging news practitioners and researchers to brainstorm AI evaluation metrics for specific journalistic use-cases.

To achieve this, we will first teach attendees about existing AI evaluation approaches (e.g., benchmarking, user studies, etc), what they offer, and where they may fall short for journalism use-cases. We will then introduce our framework for AI evaluation in journalism, which stresses concrete tasks, human-AI interactions, and ethical considerations to define success metrics [2]. Participants will then work in breakout groups to brainstorm success metrics and appropriate benchmark datasets to evaluate AI for specific use-cases, such as headline generation, summarization, etc. Afterwards, groups will share their ideas and receive feedback, and we will document these conversations for reference and to inform efforts to establish an open LLM performance benchmark for news.

For these activities, we require a physical space for ~30 people for ~90 minutes with internet access, a screen, and projector. Participants will also need to bring their own laptops to access online resources.

[1] Diakopoulos, N., Cools, H., Li, C., Helberger, N., Kung, E., Rinehart, A., & Gibbs, L. (2024). Generative AI in Journalism: The Evolution of Newswork and Ethics in a Generative Information Ecosystem. Associated Press. https://doi.org/10.13140/RG.2.2.31540.05765 [2] Nishal, S., Li, C., & Diakopoulos, N. (2024). Domain-Specific Evaluation Strategies for AI in Journalism. Workshop on Evaluating Interactive AI at CHI 2024. https://doi.org/10.48550/ARXIV.2403.17911


pollfinder.ai | Using Large Language Models to Help Newsrooms Aggregate Polls For the 2024 Election

Dhrumil Mehta, Aisvarya Chandrasekar and Ken Miura

election, public opinion, polling, large language models, artificial intelligence

Pollfinder.Ai is a new project that is supported by the 2024 Brown Institute Magic Grant. The project utilizes large language models to help newsrooms aggregate election polls.

Newsrooms such as 538, RealClearPolitics, The Hill and The New York Times provide valuable feeds of aggregated pre-election polls that are used by the public and by journalists across the industry. But as we approach the 2024 election, the process of collecting those datasets is still painstakingly manual aside from a few automated scrapers of polls that publish consistently in a predictable location and format. Research assistants spend hours identifying new polls and entering them into a polling database. But LLMs can extract structured data out of unstructured formats.

The Columbia University-based team – including a former 538 staff member – is building a tool to automate portions of this data collection using LLMs. The system we are building uses multimodal LLMs to (1) sift through noisy keyword searches of content from places like X (formerly Twitter) and Google News Alerts to separate releases of polls from articles that reference them (2) extract metadata such as the start and end date of the survey, the name of the pollster, who sponsored the poll, etc from the press releases or poll pdfs and (3) index questions in the polls to include more aggregated issue polling. In this session we will show our system architecture, edit processes, and our preliminary findings on how well LLMs perform at this task.


Workshop: Using LLMs To Help You Build Data Visualizations In D3

Dhrumil Mehta and Aarushi Sahejpal

data visualization, javascript, interactive graphics, large language models, artificial intelligence

Let’s make some interactive graphics with the help of Large Language Models! In this session, we’ll have a hands-on demonstration of ways to use LLM chatbots like ChatGPT to build charts, graphs, tables and other kinds of data visualizations including interactive graphics in D3.js. The session leaders will demonstrate a few ways that they have used ChatGPT to help build production graphics, and then attendees will build their own! No previous JavaScript experience is necessary, although having some coding experience will be helpful.

At the end of the workshop, we’ll have a short demo where everyone can share what they have built. We will also collect the charts that everyone makes alongside the initial prompts they used and a transcript of their conversation, which we will put together as an artifact that attendees can reference after the session.

Previous Iterations: This workshop is an iteration of a successful workshop that we presented at NICAR 2024. The materials for that workshop can be found at https://data4news.com.

Technical Requirements: We use GitHub Codespaces, so we just need computers with internet connections.


Data Journalism Teachers’ Club: A New Community For Data Journalism Educators

Dhrumil Mehta, Aarushi Sahejpal and Nausheen Husain

Education, Pedagogy, Science of Teaching and Learning (SoTL)

The Data Journalism Teachers’ Club is a new community of educators — professors, teachers, trainers, and everyone in between — who are interested in thinking about frameworks for effective and thoughtful ways to teach data skills to journalists by putting the profession and practice of data journalism in conversation with pedagogy and learning experts and scholars of teaching.

In this talk, we will briefly describe the aims of the new organization, make the case for integration of more formal pedagogical design practices in journalism education, and walk through some of what we have learned through the past year of conversations with pedagogy experts and texts, including our three-part course design workshop series over the summer.

Our goal is to support new, seasoned and interested data journalism teachers by creating a space for structured and informed conversations about the science of teaching and learning for our field.

More about us at: http://datajournalismteachers.club


A complex interplay: Exploring the (mis)match between audiences’ and news organizations’ assessments of news personalization 📂

Eliza Mitova

news recommenders, digital journalism, news personalization, user evaluations, journalistic AI

This presentation explores the adoption, perception, and impact of news recommender systems (NRS) in conversation with Lewis and Westlund’s (2015) “4As” matrix—actors, actants, audiences, and activities. Using a combination of qualitative interviews with news professionals in the Netherlands and Switzerland and a standardized survey across five countries (CH, NL, UK, US, PL), I identify both alignment and misalignment between the news professionals’ and audiences’ expectations and assessments of NRS. The presentation will highlight these findings and offer practical implications for the responsible design of NRS in view of technological, socio-cultural, media, and political system influences.


De-jargonizing Science for Journalists with GPT-4: A Pilot Study đź“‚

Sachita Nishal, Eric Lee and Nicholas Diakopoulos

computational journalism, science journalism, sense-making, large language models, personalization, jargon detection

This study offers an initial evaluation of a human-in-the-loop system leveraging GPT-4 (a large language model or LLM), and Retrieval-Augmented Generation (RAG) to identify and define jargon terms in scientific abstracts, based on readers’ self-reported knowledge. The system achieves fairly high recall in identifying jargon and preserves relative differences in readers’ jargon identification, suggesting personalization as a feasible use-case for LLMs to support sense-making of complex information. Surprisingly, using only abstracts for context to generate definitions yields slightly more accurate and higher quality definitions than using RAG-based context from the fulltext of an article. The findings highlight the potential of generative AI for assisting science reporters, and inform future work on developing tools to simplify dense documents.


Predicting news deserts using supervised machine learning đź“‚

Arijit Paladhi

Journalism, News Deserts, Computation, ML, Modelling

The rapidly changing landscape of journalism has seen a growing concern over news deserts – regions where local news coverage is sparse or non-existent. We also stand on the precipice of a new era of computational research in the world of journalism, and this study proposes a new approach: leveraging Supervised Machine Learning (SML) to predict the emergence of these news deserts. Drawing from different comprehensive datasets this project attempts to build predictive models, assessing a mix of social, economic, and political factors to determine a county’s risk of becoming a news desert. At an intersection of traditional journalism research and modern computational methods, we employ Logistic Regression and Random Forests algorithms to further our understanding of the growing challenge of news deserts.


Modeling Information Change in Science Communication with Natural Language Processing

Jiaxin Pei

Science Communication, NLP, Computational Journalism

Public trust in science is dependent, in part, on journalists accurately reporting the latest scientific advances. The process of reporting on science requires careful effort to faithfully translate academic jargon into descriptions more accessible to the general public. In this talk, I will majorly describe two of my works on quantifying the information change in science communication. In the first part, I focus on how certain journalists describe scientific findings compared with the scholars. By building new computational models and analyzing a large-scale science communication dataset, I show how both scholars and journalists vary in how and when findings are described with certainty. In the second part, I will focus on the more general task of measuring whether the information of a finding changes in journalistic descriptions. I will introduce a new resource and model for aligning scientific findings from news stories, social media discussions, and full texts of academic papers. I will show how this new resource can improve downstream performance on evidence retrieval for fact-checking of real-world scientific claims and, through applying the model to millions of science news reports, can reveal large-scale trends in the degrees to which people and organizations faithfully communicate new scientific findings. At the end of the talk, I will discuss my ongoing efforts in building human-centered language technologies for effective science communication.


Social Network Analysis for Sports and Investigative Journalism Workshop

Hong Qu and Julian Benbow

Social network analysis, Data analytics, Data journalism

During his Nieman Fellowship year, Julian, a sport journalist, studied with Hong, a computational social scientist, to apply network science theory and methods to analyze the factors that lead to successful collaborations in lawmaking and in sports. This workshop empowers investigative and sports journalists to add network analysis skills in their repertoire. First, we introduce scientifically rigorous approaches to model social interactions and processes as graph data. Then, we deep dive into two cases studies utilizing real world datasets from Congress.gov and NBA Stats: Hong will teach how to build a step-by-step data pipeline for acquiring bill co-sponsorship data and transforming them into a bipartite network of congress member nodes and bill nodes with edges representing congress members who cosponsor the same bill; Julian will present a network analysis of basketball ball passing network. We demonstrate how to code using Python’s NetworkX library to calculate network statistics and derive insights about key players (centrality), motifs (recurring clusters or flows), and factors driving success, which can be defined as passing bills into laws, scoring points, or winning games. For visualization, we will use Flourish Studio’s interactive chart templates: Chord, Sankey, and Network. This hands-on workshop inspires and trains data journalists to brainstorm, incorporate, and implement social network analysis in their newsgathering and visual presentation.

A sample network visualization of abortion bills co-sponsorship: https://public.flourish.studio/visualisation/17132297/


Revealing the localness of news domains through consumption patterns on social media

Alexi Quintana, Kai-Cheng Yang, Pranav Geol, Burak Ozturan and David Lazer

local news, audience, social media, computational method

Local news is an integral pillar of the US democratic process, uniquely positioned to report on local affairs and elections. Recent studies have spotlighted topics like the digital consumption of local news and the decline of local news agencies, often referred to as the news desert phenomenon. A key piece of quantifying the changing local media landscape is classifying news outlets as local or national. Traditional classification methods, however, often rely on self-designations and human judgment, potentially misrepresenting actual consumption patterns and overlooking nuanced distribution differences.

This study introduces a data-driven metric to measure the localness of news domains using a Twitter dataset where over one million accounts are geo-located by matching their profiles with voter registration data. We employ these users’ overall sharing frequency of various news domains as a surrogate for measuring statewide news consumption. We then quantify the disparity between the observed distribution of a news domain and its expected distribution across states. We hypothesize that national news domains will exhibit minimal state-specific deviations while local outlets will show pronounced deviations in particular states.

Preliminary results suggest that our metric aligns well with classifications from prior research, confirming our hypothesis. Importantly, our metric unveils nuanced localization patterns by providing a continuous measure of an outlet’s localness and pinpointing the specific states where each local domain is over-represented. We plan to publicly release our metric and the localness scores for various domains, facilitating its use in broader research contexts.


Towards Identifying Local Content Deserts with Open-Source Large Language Models đź“‚

Marianne Aubin Le Quéré, Siyan Wang, Tazbia Fatima and Michael Krisch

news deserts, content deserts, large language models, local news, geotagging

News deserts have been defined as areas where residents do not have access to news and credible information. These are usually defined by whether an area has a physically proximate local news organization. In this project, we conceptualize content deserts: geographic areas that are systematically undercovered or not covered at all by the local press. We demonstrate an early approach to leveraging open-source large language models to identify article locations as well as key information about articles such as topic and community information need. We show that open-source language models can accurately identify the locations mentioned in a news article. When it comes to annotating local news articles, we show that the models perform well for tagging an article’s topic, but that other local categorizations do not perform as well. We deploy the best-performing model and prompt on a set of 1,000 articles from two publication, and demonstrate how the annotations can help to identify content deserts. Looking forward, these methods will allow for the construction of auditing tools for journalists to view how their coverage differs by neighborhood along topical axes.


Workshop: Preparing for the day after: Covering elections through the inauguration

Jason Radford

Elections, Scenarios, Collaboration

Elections do not end on election night. From counts and recounts, procedural lawsuits, certification, and inauguration; traditionally pro forma election processes have become critical points of contention. Journalists and news organizations must cover these critical events filled with nuanced election rules and esoteric legal issues.

In this workshop, we run participants through a series of post-election scenarios. Participants will discuss how they and their newsrooms might approach covering these scenarios and share ideas for covering issues.

The goal of this workshop is for participants to think through potential scenarios and share resources and ideas for covering election-related stories after election night. Participants will also walk away with a better understanding of how other newsrooms may cover these events.


Perceptions and Corrections of Misleading News Headlines: Insights from Journalists and News Consumers đź“‚

Md Main Uddin Rony, Saransh Grover, Farhana Uddin, Yoo Yeon Sung, Mohammad Ali and Naeemul Hassan

Misleading Headline, Misinformation, Disinformation, Online News Consumption

The Internet’s vast information landscape allows for widespread consumption and publication of content, but not all information is of high quality. Misleading news headlines, where headlines do not accurately reflect their articles, pose significant risks by confusing audiences. Despite their impact on information ecosystems, they have received relatively little attention compared to other misinformation forms like fake news, rumors, and hoaxes. This study aims to address this gap by exploring the perceptions of journalists (news producers) and online news consumers. Our research investigates how journalists and news consumers define and react to misleading headlines and the types of corrections they provide when headlines are deemed misleading. By comparing these perspectives, we uncover common techniques and discrepancies in the identification and correction of misleading headlines. The insights gained inform the development of effective strategies to combat this type of misinformation, enhancing our understanding of the unique challenges posed by misleading news headlines and promoting a healthier information ecosystem.


Bias or Not? Exploring US Press Representations of Law Enforcement in Lynching Coverage, 1789–1963 📂

Mohamed Salama

Media Bias, Law Enforcement, Lynching, Natural Language Processing, Historical Analysis

This research is inspired by existing scholarly literature [1,3,4,5,8,9,10], which argues for the explicit or implicit involvement of the press and law enforcement in the historical practices of lynching in the US. These arguments are typically based on piecemeal evidence and isolated reports, which may lack comprehensiveness. Leveraging access to 60,000 newspapers spanning from 1789 to 1963, this study rigorously investigates the norm of objectivity by analyzing 1,767 stories specifically mentioning law enforcement to understand how the press portrayed these authorities. Given that the concept of journalistic objectivity was novel during this era, we will scrutinize tonality bias—defined as the positive or negative slant in reporting—based on the framework by Eberl et al [2]. Our methodology employs a carefully designed model that integrates dependency parsing and sentiment analysis techniques. This approach allows for precise detection of objectivity and bias in journalistic reporting, adding a new layer to the investigation that enhances the depth of analysis and offers insights for future research.


An Annotated Dataset of U.S. Transgender News For Determining Agenda-Setting & Information Flows

Alyssa Smith, Sagar Kumar, Yukun Yang and Pranav Goel

agenda-setting, annotated dataset, information flows, trans studies

Mainstream news outlets set the agenda and terms of discussion for public discourse across various issues. As transgender people experience increasingly vitriolic attacks on their fundamental rights in the U.S., understanding the structure and dynamics of media discussions of transgender people becomes even more salient. We have collected a dataset of news articles about transgender people and trans issues from U.S. local and national media; our analyses using transfer entropy, a measure of information flow, indicate that while national media generally exert influence on local outlets in transgender discourses, the structure of such influence is often more complex than a simple one-step flow from national to local outlets. Our current approach employs Latent Dirichlet Allocation, an unsupervised method, to identify the major topics in transgender discourses. To better understand how more subtle aspects of coverage spread, we plan to incorporate human judgment into our workflow. We propose a dataset annotation pipeline for crowdsourced multi-label classification and taxonomy creation. It involves Amazon Mechanical Turk crowdworkers independently defining the topic space and assigning topic labels for articles using the DELUGE method (Bragg, 2013). DELUGE allows for probabilistic worker accuracy and asks questions of crowdworkers such that maximal information is gained from each question answered. The resulting annotated dataset will be invaluable to our research as well as to future studies in transgender discourses. It will be free from the biases domain experts might introduce, while still maintaining the richness and requisite world-knowledge that can come only from human annotations.


Physical data visualization workshop

Andres Snitcofsky

dataviz, physical, space intervention, performative

In this engaging and hands-on workshop, journalists will explore the art of storytelling with data visuaization through innovative and tangible means. Participants will be organized into teams and tasked with selecting open datasets, which they will then bring to life using a variety of physical materials such as tape, strings, balloons, and cardboard boxes, many of which are reused or leftover items. This creative approach encourages journalists to think beyond traditional screen-based visualizations and to craft compelling narratives that can be experienced in physical space.

Throughout the workshop, I will guide participants through each stage of the process, from dataset selection and conceptualization to the construction and presentation of their data stories. This collaborative environment fosters connections among participants and sparks meaningful discussions about the power and potential of data storytelling in various media.

The workshop will culminate in a final “show and tell” session, where teams will present their physical visualizations to the group, sharing insights and reflections on their creative processes and the stories they have uncovered. This final showcase not only highlights the diverse ways data can be visualized but also emphasizes the importance of innovation and collaboration in journalism.

Join us for a unique opportunity to enhance your data storytelling skills, connect with fellow journalists, and discover new perspectives on visualizing information in compelling and accessible ways


Strategies for Success: Dos and Don’ts in the Digital Transformation of Media Companies

Catherine Sotirakou, Katerina Mandenaki, Anastasia Karampela and Constantinos Mourlas

Artificial Intelligence, Data Journalism, Digital Transformation, Media Industries

In an era defined by digitization the media industry struggles to avoid “Digital Darwinism”. Our work investigates the current state of digital transformation in four European countries and focuses on digital literacy, artificial intelligence, monetization strategies and organizational adaptation. We conducted a quantitative survey administered to 150 professionals from circa 90 media organizations across Greece, France, Portugal and Cyprus. The results depict a digital transformation in the media industry that is yet to fully embrace the potential of AI and data analytics. Publishers’ hesitation reflects broader concerns about technology’s influence on journalism, although there is widespread interest in exploring how AI can be utilized to enhance the quality of news and boost revenue. The study further delves into digital literacy and highlights the necessity for targeted training in certain areas, since most of the participants are not aware of existing AI tools for journalism and marketing. Furthermore, managers in the media sector seem to ignore the potential for new business models to reconfigure existing profit-making structures and remain focused on advertisements and sponsored content as their principal revenue sources.

This work is done in the context of IQ Media Hub, that based on this knowledge, designed a forward-looking curriculum that spans from basic digital literacy to advanced technological applications, aiming to bridge the gap and prepare media organizations for the digital future.

The above study is Co-funded by the European Union under the program “IQ Media: A Collaborative Framework Towards Business Transformation, Innovation, Quality Journalism, and Advanced Digital Skills in the Media Environment covering Greece, Cyprus, France, and Portugal.” -“IQMedia” with AGREEMENT NUMBER no. 101112285.


Surfacing Newsworthy Public Documents As Leads đź“‚

Alexander Spangher, Emilio Ferrara, Ben Welsh, Nanyun Peng, Serdar Tumgoren and Jonathan May

Newsworthiness prediction, Document Linking, Computational Journalism

“Newsworthiness prediction” means trying to assess whether a particular story lead will get covered or not. In this work, we build newsworthiness predictors to find stories in voluminous city council meetings, aiming to reduce the time and effort it takes journalists to find stories. Building training datasets for this task is challenging: it is hard, ex post facto, to prove that a certain policy has or has not covered. We address this problem by implementing a novel probabilistic relational modeling framework, which we show is a low-annotation methodology that outperforms other, more state-of-the-art retrieval-based baselines. We scale this linking methodology across 13k city council policies from San Francisco Board of Supervisor meetings and 200k articles from the San Francisco Chronicle over 10 years of public policy meetings, finding about 7\% of policies get covered. Finally, we used this linked dataset to fine-tune language models to consume policy text, transcribed video, public discussion, and other features, and predict the likelihood of coverage. We perform human evaluation with expert journalists and show our systems identify newsworthy policies with 68\% F1-score and our coverage recommendations are helpful with an 84\% win-rate against baseline.


Missing in Detroit: Creating a framework for equitable reporting and minimizing harm đź“‚

Kristi Tanner, Anjanette Delgado and Stephen Harding

missing persons, automation, AI

Slightly more than one person per day is reported missing in Detroit. Journalists are inundated with more emails and social media posts than they can manage from law enforcement looking for assistance from the public to solve missing person cases. Historically, socioeconomic factors such as race, gender, education, age and income have unfairly elevated coverage of some over others. This talk will explain the automation process we created to cover every missing person reported in Detroit and keep track of when people are found – a machine learning and NLG workflow with journalists in the loop. We will review research and decisions made prior to publication to reduce harm and increase the opportunity for impact. We’re excited to show the careful considerations that newsrooms have to make when working with AI and sensitive subjects. For example, after further research and a conversation with Karen Shalev, professor of missing person studies at the University of Portsmouth, England, we chose to reference the first name only of missing children. Dropping the last name of missing youths aims to reduce the digital footprint for children who are later recovered where the “right to be forgotten” does not currently exist. Another edit, a change in frequency and format, except in the case of serious missing alerts, was made after reading the work of Lampinen & Moore (2016) who discuss public fatigue and the “car alarm effect.” Finally, we will continue conversations with law enforcement and researchers on increasing media coverage of missing persons without increasing vulnerabilities. The Missing in Detroit project launches in June.


Syntaktis: A Large-Language-Model-Backed Editing Interface for Supporting Ethical Journalism Practices đź“‚

Leah Teichholtz, Katy Gero and Elena Glassman

human-AI interaction, large language models, ethical journalism, editing, computational journalism

We present Syntaktis, a novel interactive computational journalism tool backed by a large language model that automates the identification of several key ethical journalism problems. The interface aims to augment human journalists, not replace them, by offering optional feedback and revisions like an experienced editor. To do so, Syntaktis is trained on the ethical journalism principles of the Society for Professional Journalists, which represent a gold standard taught in journalism schools and used by newspapers nationwide. We evaluate the efficacy of the interface through a user study with 14 student journalists and the quality of its output with a technical evaluation performed by two professional editors.


Using Generative Agents to Create Tip Sheets for Investigative Data Reporting đź“‚

Joris Veerbeek and Nicholas Diakopoulos

generative agents, computational news discovery, artificial intelligence, investigative data reporting, computational journalism

This paper introduces a system using generative AI agents to create tip sheets for investigative data reporting. Our system employs three specialized agents—analyst, reporter, and editor—to collaboratively generate and refine tips from datasets. We validate this approach using real-world investigative stories, demonstrating that our agent-based system generally generates more newsworthy and accurate insights compared to a baseline model without agents, although some variability was noted between different stories. Our findings highlight the potential of generative AI to provide leads for investigative data reporting.


What does it take to create an award-winning data public good? Behind the scenes of producing the Atlas of Sustainable Development Goals 2023

Divyanshi Wadhwa and Alice Thudt

global development, data visualization, data storytelling, global public goods

The Atlas of Sustainable Development Goals 2023 (https://datatopics.worldbank.org/sdgatlas/) presents interactive storytelling and data visualizations about the 17 Sustainable Development Goals. It highlights trends for selected targets within each goal and describes how some SDGs are measured. In this talk, we will showcase some data stories and visualizations from the Atlas that won the top “Most Beautiful” award at last year’s “Information is Beautiful Awards” and explain how we approach digital storytelling and data visualization at the World Bank with a view to creating a data public good. The session’s objectives include:

  1. Offering strategies for and discussing challenges around: a) achieving the right balance between complexity, accuracy, and effectiveness in telling data stories (about global development); b) identifying meaningful and compelling stories from large data sets; and c) developing creative and powerful data visuals that not only inform and educate, but also engage audiences.
  2. Discussing the benefits and challenges of producing the Atlas in cross-functional collaboration among data scientists, economists, writers, and data visualization developers.
  3. Describing our approach to making a data visualization publication a public good through data transparency and an open data and open code approach.

What We’ve Learned: Teaming up journalism and computer science students to produce data-driven investigations for news outlets

Brooke Williams

data-driven, computational, journalism, news partners, collaborations, student newsrooms

We launched the Justice Media Computational Journalism co-Lab in January 2021, and in the years since students in the co-Lab have published about 20 data-driven investigative news stories with nine news organizations. In other words, we have learned a lot about creating a student-driven newsroom, finding partners, working with editors and reporters, and what we can do, or not do, to help ensure the collaborations are succesful and the projects precise and impactful.

We see the co-Lab as one part of a solution to the lack of resources in newsrooms needed to consistently produce this type of vital yet time consuming journalism.

Justice Media co-Lab students — both journalism and computer science — have published an investigation on the front page of the Boston Globe based on a database they built of grievances from people in prison. Interdisciplinary student teams — as a part of the Justice Media course offered in spring or fall or via a summer internship — dug deep into data to publish a project on the racism behind today’s inequitable impacts of extreme heat for the Emancipator. They’ve shed light on allegations of prison guards abusing their power GBH, examined political rhetoric leading up to the mid term elections for USA Today and used computational methods to uncover a persistent gender pay gap at the University of Massachusetts for NBC Boston, among many other projects and partners. We are launching a Justice Media co-Lab fellowship this year.

We currently have two summer interns and are on the runway to publish a nationwide examination with ProPublica and another for a national investigative nonprofit — slated to be published in the coming months.

We could include faculty, students and others in this talk.


Public Concerns Over AI-Supercharging Misinformation in the 2024 US Presidential Election

Harry Yaojun Yan, Garrett Morrow, Kai-Cheng Yang and John Wihbey

generative AI, misinformation, election, news consumption, survey

Researchers and the media have highlighted the potential adverse effects of artificial intelligence (AI) on the 2024 US Presidential election. However, it remains unclear how concerned the public is about AI’s role in spreading misinformation and which information sources contribute to these concerns.

To address this gap, this study surveyed 1,001 Americans and found that four out of five (83.32%) were worried about AI being used to spread misinformation in the upcoming election. This high prevalence of concern was consistent across various demographic groups. News consumption, particularly television news, contributes to the high prevalence of concerns. In contrast, knowledge of ChaGPT development and direct experience with generative AI tools, such as ChatGPT and DALL-E, do little to alleviate the concerns.

The high prevalence of concerns about AI spreading misinformation in the upcoming elections is likely due to a combination of worries about election integrity in general, fear of the disruptive potential of AI technology, and its sensationalized news coverage. Although it remains uncertain whether these concerns are warranted, our findings highlight the need for AI literacy campaigns that focus on building knowledge rather than fostering fear in the public.


Introduction to National Internet Observatory

Kai-Cheng Yang, Pranav Goel, Jeffrey Gleason and Alvaro Feal

news consumption, information consumption, data donation, social media, browsing data, mobile data

The National Internet Observatory (NIO) aims to help researchers study online behavior. Participants install a browser extension and/or mobile apps to donate their online activity data along with comprehensive survey responses. The infrastructure will offer approved researchers access to a suite of structured, parsed content data for selected social media and news outlets. This data will help journalists and researchers understand participants’ news and information consumption on social media and the various pathways taken to reach particular news websites and articles on the Internet. The whole process is conducted within a robust research ethics framework, emphasizing ongoing informed consent and multiple layers, technical and legal, of interventions to protect the values at stake in data collection, data access, and research.

The proposed workshop will provide a brief overview of the NIO infrastructure, the data collected, the participants, and the researcher intake process. Participants are expected to learn about NIO, build connections, and potentially initiate collaborations through the workshop.

At the time of this submission, the workshop had not been presented at other conferences, but we are planning to deliver it at a couple of different venues in the coming months.

We would need a projector for the presentation and demonstration.


Finding and using undocumented APIs

Leon Yin and Piotr Sapiezynski

Web scraping, Data collection, API, Case studies, YouTube, Google, Amazon, Internet service provider

This workshop will introduce reporters and researchers to an exciting and overlooked data source found on most websites: undocumented APIs.

As opposed to documented and official APIs, undocumented APIs are unofficial and hidden in plain sight. They execute essential functions behind the scenes, many of them are so mundane that most people don’t even realize that something is happening.

Undocumented APIs are a key tool for investigations as a public data source when access is otherwise unreachable.

Speakers will introduce the topic, and go through case studies: investigations into Amazon private label products, Google’s keyword blocklist for YouTube advertisers, and mapping the digital divide in the United States using millions of Internet service plans. Speakers will discuss the thought process and trials involved with each case study, imparting tips and best practices along the way.

Importantly, participants will learn how to find and use undocumented APIs through a hands-on exercise and will be encouraged to find APIs in the wild during independent practice.

This workshop was presented at NICAR, FAccT, and C+J and has been gradually improved based on the feedback from the previous sessions.

The contents of the workshop can be found at inspectelement.org/apis.html. The participants can use their own computers, so only a projector screen and internet connection is necessary to run this workshop.

The workshop will be led by Leon Yin, an investigative journalist at Bloomberg News, and Piotr Sapiezynski, an algorithm audit researcher at Northeastern University.


Labeling AI-Generated News Content: Matching Journalist Intentions with Audience Expectations đź“‚

Jessica Zier and Nicholas Diakopoulos

Generative AI, Transparency, Disclosure, Labeling

Improvements in generative AI functionality and accessibility offer journalists a powerful tool to assist with content production, workflows, and efficiency. This increase in generative AI for news production calls for journalists to increase transparency around how they use this technology. Recent policy developments, such as the AI Act, Art 50., are further instigating the necessity of AI transparency. Disclosure, in the form of labeling, of AI-generated content is one such transparency strategy that is becoming increasingly common in news media. However, this disclosure is meaningless if readers are unsure of how to interpret the labels. This approach could backfire if there is a mismatch between the well-intentioned goal of labeling for transparency and reader interpretations of what these labels signal. Since there is no uniformity and guidelines are piecemeal on how exactly to implement labeling, this paper offers a starting point for bridging the knowledge gap between AI transparency as a principle and AI transparency as a practice to better align journalists’ transparency goals with reader expectations around AI-use disclosure.