Research directions

How can we tell why a NLP system produces a certain output?

How do we know and make sure that a NLP system will work well in the real world?

The current SOTA models are huge. How do we make them more efficient?

Explainable and transparent NLP

The NLP systems based on large language models are increasingly deployed in many real-world applications, and have real impact on the lives on many people. However, we still do not have reliable methods to explain the ‘reasoning’ backing their outputs, and many current systems also lack even the minimal transparency about their design and training data.

My current work in this area is backed by 2024 Villum Young Investigator grant (see upcoming positions). It focuses on the problem of attribution of the output of generative language models to their training data. This project has planned collaborations with Allen Institute for AI, Carnegie Mellon University, HuggingFace, and RIKEN-CSS.

Some relevant past work:

Cornish, C., & Rogers, A. (2025). Examining the Faithfulness of Deepseek R1’s Chain-of-Thought Reasoning. In A. Sinha, R. Vázquez, T. Mickus, R. Agarwal, I. Buhnila, P. Schmidtová, … J. Tiedemann (Eds.), Proceedings of the 1st Workshop on Confabulation, Hallucinations and Overgeneration in Multilingual and Practical Settings (CHOMPS 2025) (pp. 11–19). Mumbai, India: Association for Computational Linguistics.
Chain-of-Thought (CoT) ‘reasoning’ promises to enhance the performance and transparency of Large Language Models (LLMs). Models, such as Deepseek R1, are trained via reinforcement learning to automatically generate CoT explanations in their outputs. Their faithfulness, i.e. how well the explanations actually reflect their internal reasoning process, has been called into doubt by recent studies (Chen et al., 2025a; Chua and Evans, 2025). This paper extends previous work by probing Deepseek R1 with 445 logical puzzles under zero- and few-shot settings. We find that whilst the model explicitly acknowledges a strong harmful hint in 94.6% of cases, it reports less than 2% of helpful hints. Further analysis reveals implicit unfaithfulness as the model significantly reduces answer-rechecking behaviour for helpful hints (p<0.01) despite rarely mentioning them in its CoT, demonstrating a discrepancy between its reported and actual decision process. In line with prior reports for GPT, Claude, Gemini and other models, our results for DeepSeek raise concerns about the use of CoT as an explainability technique.
Højer, B., Thorn Jakobsen, T. S., Rogers, A., & Heinrich, S. (2025). Research Community Perspectives on ’Intelligence’ and Large Language Models. In W. Che, J. Nabende, E. Shutova, & M. T. Pilehvar (Eds.), Findings of the Association for Computational Linguistics: ACL 2025 (pp. 25796–25812). https://doi.org/10.18653/v1/2025.findings-acl.1324 Best Poster award at D3A2025
Despite the widespread use of ‘artificial intelligence’ (AI) framing in Natural Language Processing (NLP) research, it is not clear what researchers mean by “intelligence”. To that end, we present the results of a survey on the notion of “intelligence” among researchers and its role in the research agenda. The survey elicited complete responses from 303 researchers from a variety of fields including NLP, Machine Learning (ML), Cognitive Science, Linguistics, and Neuroscience.We identify 3 criteria of intelligence that the community agrees on the most: generalization, adaptability, & reasoning.Our results suggests that the perception of the current NLP systems as “intelligent” is a minority position (29%).Furthermore, only 16.2% of the respondents see developing intelligent systems as a research goal, and these respondents are more likely to consider the current systems intelligent.
Nielsen, M. L., Raaschou-Pedersen, J. S., Chrisander, E., Lassen, D. D., Grenet, J., Rogers, A., & Bjerre-Nielsen, A. (2025). Trading off performance and human oversight in algorithmic policy: evidence from Danish college admissions.
Student dropout is a significant concern for educational institutions due to its social and economic impact, driving the need for risk prediction systems to identify at-risk students before enrollment. We explore the accuracy of such systems in the context of higher education by predicting degree completion before admission, with potential applications for prioritizing admissions decisions. Using a large-scale dataset from Danish higher education admissions, we demonstrate that advanced sequential AI models offer more precise and fair predictions compared to current practices that rely on either high school grade point averages or human judgment. These models not only improve accuracy but also outperform simpler models, even when the simpler models use protected sociodemographic attributes. Importantly, our predictions reveal how certain student profiles are better matched with specific programs and fields, suggesting potential efficiency and welfare gains in public policy. We estimate that even the use of simple AI models to guide admissions decisions, particularly in response to a newly implemented nationwide policy reducing admissions by 10 percent, could yield significant economic benefits. However, this improvement would come at the cost of reduced human oversight and lower transparency. Our findings underscore both the potential and challenges of incorporating advanced AI into educational policymaking.
Piktus, A., Akiki, C., Villegas, P., Laurençon, H., Dupont, G., Luccioni, S., … Rogers, A. (2023). The ROOTS Search Tool: Data Transparency for LLMs. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), 304–314. Toronto, Canada: Association for Computational Linguistics.
ROOTS is a 1.6TB multilingual text corpus developed for the training of BLOOM, currently the largest language model explicitly accompanied by commensurate data governance efforts. In continuation of these efforts, we present the ROOTS Search Tool: a search engine over the entire ROOTS corpus offering both fuzzy and exact search capabilities. ROOTS is the largest corpus to date that can be investigated this way. The ROOTS Search Tool is open-sourced and available on Hugging Face Spaces: https://huggingface.co/spaces/bigscience-data/roots-search. We describe our implementation and the possible use cases of our tool.
Jernite, Y., Nguyen, H., Biderman, S., Rogers, A., Masoud, M., Danchev, V., … Mitchell, M. (2022). Data Governance in the Age of Large-Scale Data-Driven Language Technology. 2022 ACM Conference on Fairness, Accountability, and Transparency, 2206–2222. https://doi.org/10.1145/3531146.3534637
The recent emergence and adoption of Machine Learning technology, and specifically of Large Language Models, has drawn attention to the need for systematic and transparent management of language data. This work proposes an approach to global language data governance that attempts to organize data management amongst stakeholders, values, and rights. Our proposal is informed by prior work on distributed governance that accounts for human values and grounded by an international research collaboration that brings together researchers and practitioners from 60 countries. The framework we present is a multi-party international governance structure focused on language data, and incorporating technical and organizational tools needed to support its work.
González, A. V., Rogers, A., & Søgaard, A. (2021). On the Interaction of Belief Bias and Explanations. Findings of ACL-IJCNLP 2021, 2930–2942. Online: ACL.
A myriad of explainability methods have been proposed in recent years, but there is little consensus on how to evaluate them. While automatic metrics allow for quick benchmarking, it isn’t clear how such metrics reflect human interaction with explanations. Human evaluation is of paramount importance, but previous protocols fail to account for belief biases affecting human performance, which may lead to misleading conclusions. We provide an overview of belief bias, its role in human evaluation, and ideas for NLP practitioners on how to account for it. For two experimental paradigms, we present a case study of gradient-based explainability introducing simple ways to account for humans’ prior beliefs: models of varying quality and adversarial examples. We show that conclusions about the highest performing methods change when introducing such controls, pointing to the importance of accounting for belief bias in evaluation.
Kovaleva, O., Kulshreshtha, S., Rogers, A., & Rumshisky, A. (2021). BERT Busters: Outlier Dimensions That Disrupt Transformers. Findings of ACL-IJCNLP 2021, 3392–3405. Online: ACL.
Multiple studies have shown that Transformers are remarkably robust to pruning. Contrary to this received wisdom, we demonstrate that pre-trained Transformer encoders are surprisingly fragile to the removal of a very small number of features in the layer outputs (<0.0001% of model weights). In case of BERT and other pre-trained encoder Transformers, the affected component is the scaling factors and biases in the LayerNorm. The outliers are high-magnitude normalization parameters that emerge early in pre-training and show up consistently in the same dimensional position throughout the model. We show that disabling them significantly degrades both the MLM loss and the downstream task performance. This effect is observed across several BERT-family models and other popular pre-trained Transformer architectures, including BART, XLNet and ELECTRA; we also show a similar effect in GPT-2.
Rogers, A., Kovaleva, O., & Rumshisky, A. (2020). A Primer in BERTology: What We Know About How BERT Works. Transactions of the Association for Computational Linguistics, 8, 842–866.
Transformer-based models have pushed state of the art in many areas of NLP, but our understanding of what is behind their success is still limited. This paper is the first survey of over 150 studies of the popular BERT model. We review the current state of knowledge about how BERT works, what kind of information it learns and how it is represented, common modifications to its training objectives and architecture, the overparameterization issue, and approaches to compression. We then outline directions for future research.
Kovaleva, O., Romanov, A., Rogers, A., & Rumshisky, A. (2019). Revealing the Dark Secrets of BERT. Proceedings of EMNLP-IJCNLP), 4356–4365. https://doi.org/10.18653/v1/D19-1445
BERT-based architectures currently give state-of-the-art performance on many NLP tasks, but little is known about the exact mechanisms that contribute to its success. In the current work, we focus on the interpretation of self-attention, which is one of the fundamental underlying components of BERT. Using a subset of GLUE tasks and a set of handcrafted features-of-interest, we propose the methodology and carry out a qualitative and quantitative analysis of the information encoded by the individual BERT’s heads. Our findings suggest that there is a limited set of attention patterns that are repeated across different heads, indicating the overall model overparametrization. While different heads consistently use the same attention patterns, they have varying impact on performance across different tasks. We show that manually disabling attention in certain heads leads to a performance improvement over the regular fine-tuned BERT models.

Safe and Robust NLP

I use “safety” in the engineering sense of the word: the NLP systems should actually do what their developers are promising, the same way as e.g. construction engineers ensure that the bridges they build withstand the target load. This is bordering on the problem of robustness or generalization: the NLP systems are trained on some data, and to perform in the real world they need to generalize to the real-world data.

My current work in this area is backed by 2023 DFF Inge Lehmann grant (see upcoming positions). It focuses on the development of a benchmark that would reward systems for generalizing rather than memorizing their training data. This project has a planned collaboration with New York University. I also host an industrial PhD student co-funded by Innovation Fund Denmark, whose work focuses on robust assistance with clinical note entry.

Some relevant past work:

Güven, A. B., Rogers, A., & Goot, R. V. D. (2025). Do Syntactic Categories Help in Developmentally Motivated Curriculum Learning for Language Models? In L. Charpentier, L. Choshen, R. Cotterell, M. O. Gul, M. Y. Hu, J. Liu, … A. Williams (Eds.), Proceedings of the First BabyLM Workshop (pp. 288–300). Suzhou, China: Association for Computational Linguistics.
We examine the syntactic properties of BabyLM corpus, and age-groups within CHILDES. While we find that CHILDES does not exhibit strong syntactic differentiation by age, we show that the syntactic knowledge about the training data can be helpful in interpreting model performance on linguistic tasks. For curriculum learning, we explore developmental and several alternative cognitively inspired curriculum approaches. We find that some curricula help with reading tasks, but the main performance improvement come from using the subset of syntactically categorizable data, rather than the full noisy corpus.
Motzfeldt, A. G., Edin, J., Christensen, C. L., Hardmeier, C., Maaløe, L., & Rogers, A. (2025). Code Like Humans: A Multi-Agent Solution for Medical Coding. In C. Christodoulopoulos, T. Chakraborty, C. Rose, & V. Peng (Eds.), Findings of the Association for Computational Linguistics: EMNLP 2025 (pp. 22612–22627). Suzhou, China: Association for Computational Linguistics.
In medical coding, experts map unstructured clinical notes to alphanumeric codes for diagnoses and procedures. We introduce ‘Code Like Humans’: a new agentic framework for medical coding with large language models. It implements official coding guidelines for human experts, and it is the first solution that can support the full ICD-10 coding system (+70K labels). It achieves the best performance to date on rare diagnosis codes. Fine-tuned discriminative classifiers retain an advantage for high-frequency codes, to which they are limited. Towards future work, we also contribute an analysis of system performance and identify its ‘blind spots’ (codes that are systematically undercoded).
Puccetti, G., Rogers, A., Alzetta, C., Dell’Orletta, F., & Esuli, A. (2024). AI ‘News’ Content Farms Are Easy to Make and Hard to Detect: A Case Study in Italian. In L.-W. Ku, A. Martins, & V. Srikumar (Eds.), Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 15312–15338). Bangkok, Thailand: Association for Computational Linguistics. Area Chair Award
Large Language Models (LLMs) are increasingly used as ‘content farm’ models (CFMs), to generate synthetic text that could pass for real news articles. This is already happening even for languages that do not have high-quality monolingual LLMs. We show that fine-tuning Llama (v1), mostly trained on English, on as little as 40K Italian news articles, is sufficient for producing news-like texts that native speakers of Italian struggle to identify as synthetic.We investigate three LLMs and three methods of detecting synthetic texts (log-likelihood, DetectGPT, and supervised classification), finding that they all perform better than human raters, but they are all impractical in the real world (requiring either access to token likelihood information or a large dataset of CFM texts). We also explore the possibility of creating a proxy CFM: an LLM fine-tuned on a similar dataset to one used by the real ‘content farm’. We find that even a small amount of fine-tuning data suffices for creating a successful detector, but we need to know which base LLM is used, which is a major challenge.Our results suggest that there are currently no practical methods for detecting synthetic news-like texts ‘in the wild’, while generating them is too easy. We highlight the urgency of more NLP research on this problem.
Rogers, A., & Luccioni, S. (2024). Position: Key Claims in LLM Research Have a Long Tail of Footnotes. Forty-first International Conference on Machine Learning.
Much of the recent discourse within the ML community has been centered around Large Language Models (LLMs), their functionality and potential – yet not only do we not have a working definition of LLMs, but much of this discourse relies on claims and assumptions that are worth re-examining. We contribute a definition of LLMs, critically examine five common claims regarding their properties (including ’emergent properties’), and conclude with suggestions for future research directions and their framing.
Kuznetsov, I., Afzal, O. M., Dercksen, K., Dycke, N., Goldberg, A., Hope, T., … Gurevych, I. (2024). What Can Natural Language Processing Do for Peer Review? arXiv.
The number of scientific articles produced every year is growing rapidly. Providing quality control over them is crucial for scientists and, ultimately, for the public good. In modern science, this process is largely delegated to peer review – a distributed procedure in which each submission is evaluated by several independent experts in the field. Peer review is widely used, yet it is hard, time-consuming, and prone to error. Since the artifacts involved in peer review – manuscripts, reviews, discussions – are largely text-based, Natural Language Processing has great potential to improve reviewing. As the emergence of large language models (LLMs) has enabled NLP assistance for many new tasks, the discussion on machine-assisted peer review is picking up the pace. Yet, where exactly is help needed, where can NLP help, and where should it stand aside? The goal of our paper is to provide a foundation for the future efforts in NLP for peer-reviewing assistance. We discuss peer review as a general process, exemplified by reviewing at AI conferences. We detail each step of the process from manuscript submission to camera-ready revision, and discuss the associated challenges and opportunities for NLP assistance, illustrated by existing work. We then turn to the big challenges in NLP for peer review as a whole, including data acquisition and licensing, operationalization and experimentation, and ethical issues. To help consolidate community efforts, we create a companion repository that aggregates key datasets pertaining to peer review. Finally, we issue a detailed call for action for the scientific community, NLP and AI researchers, policymakers, and funding bodies to help bring the research in NLP for peer review forward. We hope that our work will help set the agenda for research in machine-assisted scientific quality control in the age of AI, within the NLP community and beyond.
Rogers, A., Gardner, M., & Augenstein, I. (2022). QA Dataset Explosion: A Taxonomy of NLP Resources for Question Answering and Reading Comprehension. ACM CSUR. https://doi.org/https://doi.org/10.1145/3560260
Alongside huge volumes of research on deep learning models in NLP in the recent years, there has been also much work on benchmark datasets needed to track modeling progress. Question answering and reading comprehension have been particularly prolific in this regard, with over 80 new datasets appearing in the past two years. This study is the largest survey of the field to date. We provide an overview of the various formats and domains of the current resources, highlighting the current lacunae for future work. We further discuss the current classifications of “reasoning types" in question answering and propose a new taxonomy. We also discuss the implications of over-focusing on English, and survey the current monolingual resources for other languages and multilingual resources. The study is aimed at both practitioners looking for pointers to the wealth of existing data, and at researchers working on new resources.
Ray Choudhury, S., Rogers, A., & Augenstein, I. (2022). Machine Reading, Fast and Slow: When Do Models ’Understand’ Language? Proceedings of the 29th International Conference on Computational Linguistics, 78–93. Gyeongju, Republic of Korea.
Two of the most fundamental issues in Natural Language Understanding (NLU) at present are: (a) how it can established whether deep learning-based models score highly on NLU benchmarks for the ’right’ reasons; and (b) what those reasons would even be. We investigate the behavior of reading comprehension models with respect to two linguistic ’skills’: coreference resolution and comparison. We propose a definition for the reasoning steps expected from a system that would be ’reading slowly’, and compare that with the behavior of five models of the BERT family of various sizes, observed through saliency scores and counterfactual explanations. We find that for comparison (but not coreference) the systems based on larger encoders are more likely to rely on the ’right’ information, but even they struggle with generalization, suggesting that they still learn specific lexical patterns rather than the general principles of comparison.
Bhargava, P., Drozd, A., & Rogers, A. (2021). Generalization in NLI: Ways (Not) To Go Beyond Simple Heuristics. Proceedings of the Second Workshop on Insights from Negative Results in NLP, 125–135. Online and Punta Cana, Dominican Republic: Association for Computational Linguistics.
Much of recent progress in NLU was shown to be due to models’ learning dataset-specific heuristics. We conduct a case study of generalization in NLI (from MNLI to the adversarially constructed HANS dataset) in a range of BERT-based architectures (adapters, Siamese Transformers, HEX debiasing), as well as with subsampling the data and increasing the model size. We report 2 successful and 3 unsuccessful strategies, all providing insights into how Transformer-based models learn to generalize.
Rogers, A., Kovaleva, O., Downey, M., & Rumshisky, A. (2020). Getting Closer to AI Complete Question Answering: A Set of Prerequisite Real Tasks. Proceedings of the AAAI Conference on Artificial Intelligence, 11.
The recent explosion in question answering research produced a wealth of both factoid RC and commonsense reasoning datasets. Combining them presents a different kind of task: not deciding simply whether information is present in the text, but also whether a confident guess could be made for the missing information. To that end, we present QuAIL, the first reading comprehension dataset (a) to combine textbased, world knowledge and unanswerable questions, and (b) to provide annotation that would enable precise diagnostics of the reasoning strategies by a given QA system. QuAIL contains 15K multi-choice questions for 800 texts in 4 domains (fiction, blogs, political news, and user story texts). Crucially, to solve QuAIL a system would need to handle both general and text-specific questions, impossible to answer from pretraining data. We show that the new benchmark poses substantial challenges to the current state-of-the-art systems, with a 30% drop in accuracy compared to the most similar existing dataset.

Sustainable NLP

The current cutting-edge NLP systems are based on large language models, some with hundreds with billions of parameters. As they are increasingly embedded in everyday applications and used of millions of people, the carbon costs of their use are also skyrocketing. It is imperative that in the future we find more efficient methods to achieve the same or better levels of performance.

My current work in this area is backed by 2023 DFF Inge Lehmann grant (see upcoming positions). It focuses on the development of a benchmark suite that deliberately caps pre-training and test data, so as to encourage machine learning research on more efficient solutions. This project has a planned collaboration with New York University.

Some relevant past work:

Prasanna, S., Rogers, A., & Rumshisky, A. (2020). When BERT Plays the Lottery, All Tickets Are Winning. Proceedings of EMNLP, 3208–3229. Online: ACL.
Much of the recent success in NLP is due to the large Transformer-based models such as BERT (Devlin et al, 2019). However, these models have been shown to be reducible to a smaller number of self-attention heads and layers. We consider this phenomenon from the perspective of the lottery ticket hypothesis. For fine-tuned BERT, we show that (a) it is possible to find a subnetwork of elements that achieves performance comparable with that of the full model, and (b) similarly-sized subnetworks sampled from the rest of the model perform worse. However, the "bad" subnetworks can be fine-tuned separately to achieve only slightly worse performance than the "good" ones, indicating that most weights in the pre-trained BERT are potentially useful. We also show that the "good" subnetworks vary considerably across GLUE tasks, opening up the possibilities to learn what knowledge BERT actually uses at inference time.

Lab

Amelie Wührl
data attribution, fact-checking

postdoc

Nikolas Vitsakis
NLP for Social Science Research, AI ethics

postdoc

Arturo Valdivia
User modeling, NLP for Social Good

postdoc (joint affiliation with CAISA)

Andreas Geert Motzfeldt
interpretability, robustness in clinical NLP

PhD student co-supervised with Christian Hardmeier

Arzu Burcu Güven
robustness, generalization across linguistic features

PhD student co-supervised with Rob van der Goot

Bertram Højer
interpretability, model analysis

PhD student co-supervised with Stefan Heinrich

Mattes Ruckdeschel
data attribution, argumentation analysis

PhD student co-supervised with Toine Bogers

Johannes Gabriel Sindlinger
data attribution, interpretability

PhD student PhD student co-supervised with Leon Derczynski

The lab is part of NLPNorth research group, with 4 other full-time faculty working in NLP. We are also a part of the AI Pioneer center, where it is possible to interact with other NLP researchers in University of Copenhagen and other institutions. Here are some reflections by NLPNorth PhD students on what it’s like to live and study in Denmark.

Alumni

Max Müller-Eberstein
generalization, data efficiency

postdoc, currently postdoc at the University of Tokyo

PhD and Postdoc Positions

Upcoming funded positions:

The current positions have been filled, any future positions will be announced on this page and ITU portal. I do not have the possibility to host interns. I maintain a list of upcoming talks and events, where it might be possible to meet in person.

Getting your own funding (if you have your own idea for a PhD or postdoc that you’d like to pursue with me):

DARA and DDSA funding opportunities (have a look at prior/current calls for PhD and postdoc applications)
Marie Curie postodcs: next application round is in the fall 2026. For EU-based applicants it is possible to obtain funding for a short visit to ITU for developing the application in spring 2026.

If you’d like me to support your funding application, please get in touch and let me know what are the specific research interests we have in common (based on the above lab research directions or my past work).

Logistics:

Generally, to be enrolled in the Ph.D. school in Denmark, you need to have a 2-year Master’s degree. The only possible exception at ITU is candidates who have 180 ECTS points from their B.Sc. program plus at least 60 ECTS points of master’s level studies (a total of 240 ECTS points). Such candidates would need to start by spending a year to finish their Master’s degree, and they would receive a significantly lower salary for two years, so unfortunately this is not a very good deal.
The PhD in Denmark is fixed-term (3 years). It is possible to take breaks to go on internships.
Non-EU candidates will need to receive a visa before the start of the studies, which usually takes about 3 months (after you receive and accept the offer).
You don’t have to learn Danish, either for professional or everyday life.

Thesis and project supervision at ITU

If you’re a B.Sc. or M.Sc. student at IT University of Copenhagen, and you would like to work with me, please:

reach out stating [ITU M.Sc. thesis], [ITU B.Sc. thesis] or [ITU research project] at the start of the subject.
introduce yourself and state:
- the topic(s) that could be in our shared interests (see mine below)
- your background and any relevant experience. For an NLP project, you would ideally have taken an NLP course (even online or self-taught), or you have hands-on experience with the subject matter relevant to your preferred topic.

Here are some of the research directions that I would be interested in:

Language model analysis: identifying the types of knowledge acquired from language model pre-training.
Language processing strategies: do NLP models perform well for the right reasons? What strategies should they follow when solving reasoning tasks?
Robustness and generalization: do NLP models reliably perform their tasks out of training distribution, and what can we do to help them?
NLP system auditing and documentation: establishing the cases where a system is safe to deploy
NLP system interpretability: how can we establish how a deep learning model arrives at its decisions?
Sustainable NLP: how can we build systems that work well, but don’t require billions of parameters and terabytes of data?

Feel free to also look at my recent publications and see if there’s anything you’d like to build on.