Publications

Total pubs: 52

Journal articles: 3
Conference and workshop articles: 31, including 18 in top-tier venues
Edited volumes: 12
Preprints and other non-peer-reviewed publications (excluding blog articles): 5

Bibliometrics: h-index 25, 6.2K+ citations on Google Scholar

2024

Puccetti, G., Rogers, A., Alzetta, C., Dell’Orletta, F., & Esuli, A. (2024). AI ‘News’ Content Farms Are Easy to Make and Hard to Detect: A Case Study in Italian. In L.-W. Ku, A. Martins, & V. Srikumar (Eds.), Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 15312–15338). Bangkok, Thailand: Association for Computational Linguistics. Area Chair Award
Large Language Models (LLMs) are increasingly used as ‘content farm’ models (CFMs), to generate synthetic text that could pass for real news articles. This is already happening even for languages that do not have high-quality monolingual LLMs. We show that fine-tuning Llama (v1), mostly trained on English, on as little as 40K Italian news articles, is sufficient for producing news-like texts that native speakers of Italian struggle to identify as synthetic.We investigate three LLMs and three methods of detecting synthetic texts (log-likelihood, DetectGPT, and supervised classification), finding that they all perform better than human raters, but they are all impractical in the real world (requiring either access to token likelihood information or a large dataset of CFM texts). We also explore the possibility of creating a proxy CFM: an LLM fine-tuned on a similar dataset to one used by the real ‘content farm’. We find that even a small amount of fine-tuning data suffices for creating a successful detector, but we need to know which base LLM is used, which is a major challenge.Our results suggest that there are currently no practical methods for detecting synthetic news-like texts ‘in the wild’, while generating them is too easy. We highlight the urgency of more NLP research on this problem.
Rogers, A., & Luccioni, S. (2024). Position: Key Claims in LLM Research Have a Long Tail of Footnotes. Forty-first International Conference on Machine Learning.
Much of the recent discourse within the ML community has been centered around Large Language Models (LLMs), their functionality and potential – yet not only do we not have a working definition of LLMs, but much of this discourse relies on claims and assumptions that are worth re-examining. We contribute a definition of LLMs, critically examine five common claims regarding their properties (including ’emergent properties’), and conclude with suggestions for future research directions and their framing.
Tafreshi, S., Akula, A., Sedoc, J., Drozd, A., Rogers, A., & Rumshisky, A. (Eds.). (2024). Proceedings of the Fifth Workshop on Insights from Negative Results in NLP. Mexico City, Mexico: Association for Computational Linguistics.
Kuznetsov, I., Afzal, O. M., Dercksen, K., Dycke, N., Goldberg, A., Hope, T., … Gurevych, I. (2024). What Can Natural Language Processing Do for Peer Review? arXiv.
The number of scientific articles produced every year is growing rapidly. Providing quality control over them is crucial for scientists and, ultimately, for the public good. In modern science, this process is largely delegated to peer review – a distributed procedure in which each submission is evaluated by several independent experts in the field. Peer review is widely used, yet it is hard, time-consuming, and prone to error. Since the artifacts involved in peer review – manuscripts, reviews, discussions – are largely text-based, Natural Language Processing has great potential to improve reviewing. As the emergence of large language models (LLMs) has enabled NLP assistance for many new tasks, the discussion on machine-assisted peer review is picking up the pace. Yet, where exactly is help needed, where can NLP help, and where should it stand aside? The goal of our paper is to provide a foundation for the future efforts in NLP for peer-reviewing assistance. We discuss peer review as a general process, exemplified by reviewing at AI conferences. We detail each step of the process from manuscript submission to camera-ready revision, and discuss the associated challenges and opportunities for NLP assistance, illustrated by existing work. We then turn to the big challenges in NLP for peer review as a whole, including data acquisition and licensing, operationalization and experimentation, and ethical issues. To help consolidate community efforts, we create a companion repository that aggregates key datasets pertaining to peer review. Finally, we issue a detailed call for action for the scientific community, NLP and AI researchers, policymakers, and funding bodies to help bring the research in NLP for peer review forward. We hope that our work will help set the agenda for research in machine-assisted scientific quality control in the age of AI, within the NLP community and beyond.
Rogers, A., Karpinska, M., Gupta, A., Lialin, V., Smelkov, G., & Rumshisky, A. (2024). NarrativeTime: Dense Temporal Annotation on a Timeline. In N. Calzolari, M.-Y. Kan, V. Hoste, A. Lenci, S. Sakti, & N. Xue (Eds.), Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) (pp. 12053–12073). Torino, Italia: ELRA and ICCL.
For the past decade, temporal annotation has been sparse: only a small portion of event pairs in a text was annotated. We present NarrativeTime, the first timeline-based annotation framework that achieves full coverage of all possible TLINKs. To compare with the previous SOTA in dense temporal annotation, we perform full re-annotation of the classic TimeBankDense corpus (American English), which shows comparable agreement with a signigicant increase in density. We contribute TimeBankNT corpus (with each text fully annotated by two expert annotators), extensive annotation guidelines, open-source tools for annotation and conversion to TimeML format, and baseline results.
Savcisens, G., Eliassi-Rad, T., Hansen, L. K., Mortensen, L. H., Lilleholt, L., Rogers, A., … Lehmann, S. (2024). Using Sequences of Life-Events to Predict Human Lives. Nature Computational Science, 4(1), 43–56.
https://arxiv.org/pdf/2306.03009.pdf
Jiménez-Sánchez, A., Avlona, N.-R., Juodelyte, D., Sourget, T., Vang-Larsen, C., Rogers, A., … Cheplygina, V. (2024). Copycats: the many lives of a publicly available medical imaging dataset. The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
Medical Imaging (MI) datasets are fundamental to artificial intelligence in healthcare. The accuracy, robustness, and fairness of diagnostic algorithms depend on the data (and its quality) used to train and evaluate the models. MI datasets used to be proprietary, but have become increasingly available to the public, including on community-contributed platforms (CCPs) like Kaggle or HuggingFace. While open data is important to enhance the redistribution of data’s public value, we find that the current CCP governance model fails to uphold the quality needed and recommended practices for sharing, documenting, and evaluating datasets. In this paper, we conduct an analysis of publicly available machine learning datasets on CCPs, discussing datasets’ context, and identifying limitations and gaps in the current CCP landscape. We highlight differences between MI and computer vision datasets, particularly in the potentially harmful downstream effects from poor adoption of recommended dataset management practices. We compare the analyzed datasets across several dimensions, including data sharing, data documentation, and maintenance. We find vague licenses, lack of persistent identifiers and storage, duplicates, and missing metadata, with differences between the platforms. Our research contributes to efforts in responsible data curation and AI algorithms for healthcare.

2023

Luccioni, A. S., & Rogers, A. (2023). Mind Your Language (Model): Fact-Checking LLMs and Their Role in NLP Research and Practice. arXiv (under review).
Much of the recent discourse within the NLP research community has been centered around Large Language Models (LLMs), their functionality and potential – yet not only do we not have a working definition of LLMs, but much of this discourse relies on claims and assumptions that are worth re-examining. This position paper contributes a definition of LLMs, explicates some of the assumptions made regarding their functionality, and outlines the existing evidence for and against them. We conclude with suggestions for research directions and their framing in future work.
Rogers, A., Karpinska, M., Boyd-Graber, J., & Okazaki, N. (2023). Program Chairs’ Report on Peer Review at ACL 2023. Toronto, Canada: Association for Computational Linguistics.
We present a summary of the efforts to improve conference peer review that were implemented at ACL’23. This includes work with the goal of improving review quality, clearer workflow and decision support for the area chairs, as well as our efforts to improve paper-reviewer matching for various kinds of non- mainstream NLP work, and improve the overall incentives for all participants of the peer review process. We present analysis of the factors affecting peer review, identify the most problematic issues that the authors complained about, and provide suggestions for the future chairs. We hope that publishing such reports would (a) improve transparency in decision-making, (b) help the people new to the field to understand how the *ACL conferences work, (c) provide useful data for the future chairs and workshop organizers, and also academic work on peer review, and (d) provide useful context for the final program, as a source of information for meta-research on the structure and trajectory of the field of NLP.
Rogers, A., Boyd-Graber, J., & Okazaki, N. (Eds.). (2023). Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Toronto, Canada: Association for Computational Linguistics.
Rogers, A., Boyd-Graber, J., & Okazaki, N. (Eds.). (2023). Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Toronto, Canada: Association for Computational Linguistics.
Rogers, A., Boyd-Graber, J., & Okazaki, N. (Eds.). (2023). Findings of the Association for Computational Linguistics: ACL 2023. Toronto, Canada: Association for Computational Linguistics.
Piktus, A., Akiki, C., Villegas, P., Laurençon, H., Dupont, G., Luccioni, S., … Rogers, A. (2023). The ROOTS Search Tool: Data Transparency for LLMs. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), 304–314. Toronto, Canada: Association for Computational Linguistics.
ROOTS is a 1.6TB multilingual text corpus developed for the training of BLOOM, currently the largest language model explicitly accompanied by commensurate data governance efforts. In continuation of these efforts, we present the ROOTS Search Tool: a search engine over the entire ROOTS corpus offering both fuzzy and exact search capabilities. ROOTS is the largest corpus to date that can be investigated this way. The ROOTS Search Tool is open-sourced and available on Hugging Face Spaces: https://huggingface.co/spaces/bigscience-data/roots-search. We describe our implementation and the possible use cases of our tool.
Can, B., Mozes, M., Cahyawijaya, S., Saphra, N., Kassner, N., Ravfogel, S., … Voita, L. (Eds.). (2023). Proceedings of the 8th Workshop on Representation Learning for NLP (RepL4NLP 2023). Toronto, Canada: Association for Computational Linguistics.
Tafreshi, S., Akula, A., Sedoc, J., Drozd, A., Rogers, A., & Rumshisky, A. (Eds.). (2023). The Fourth Workshop on Insights from Negative Results in NLP. Dubrovnik, Croatia: Association for Computational Linguistics.

2022

Scao, T. L., Fan, A., Akiki, C., Pavlick, E., Ilić, S., Hesslow, D., … Wolf, T. (2022). BLOOM: A 176B-Parameter Open-Access Multilingual Language Model. arxiv.
Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions. While these capabilities have led to widespread adoption, most LLMs are developed by resource-rich organizations and are frequently kept from the public. As a step towards democratizing this powerful technology, we present BLOOM, a 176B-parameter open-access language model designed and built thanks to a collaboration of hundreds of researchers. BLOOM is a decoder-only Transformer language model that was trained on the ROOTS corpus, a dataset comprising hundreds of sources in 46 natural and 13 programming languages (59 in total). We find that BLOOM achieves competitive performance on a wide variety of benchmarks, with stronger results after undergoing multitask prompted finetuning. To facilitate future research and applications using LLMs, we publicly release our models and code under the Responsible AI License.
Laurençon, H., Saulnier, L., Wang, T., Akiki, C., Villanova del Moral, A., Le Scao, T., … Jernite, Y. (2022). The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset. Thirty-Sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track. Presented at the New Orleans, United States. New Orleans, United States.
As language models grow ever larger, the need for large-scale high-quality text datasets has never been more pressing, especially in multilingual settings. The BigScience workshop, a 1-year international and multidisciplinary initiative, was formed with the goal of researching and training large language models as a values-driven undertaking, putting issues of ethics, harm, and governance in the foreground. This paper documents the data creation and curation efforts undertaken by BigScience to assemble the Responsible Open-science Open-collaboration Text Sources (ROOTS) corpus, a 1.6TB dataset spanning 59 languages that was used to train the 176-billion-parameter BigScience Large Open-science Open-access Multilingual (BLOOM) language model. We further release a large initial subset of the corpus and analyses thereof, and hope to empower large-scale monolingual and multilingual modeling projects with both the data and the processing tools, as well as stimulate research around this large multilingual corpus.
Rogers, A., Gardner, M., & Augenstein, I. (2022). QA Dataset Explosion: A Taxonomy of NLP Resources for Question Answering and Reading Comprehension. ACM CSUR. https://doi.org/https://doi.org/10.1145/3560260
Alongside huge volumes of research on deep learning models in NLP in the recent years, there has been also much work on benchmark datasets needed to track modeling progress. Question answering and reading comprehension have been particularly prolific in this regard, with over 80 new datasets appearing in the past two years. This study is the largest survey of the field to date. We provide an overview of the various formats and domains of the current resources, highlighting the current lacunae for future work. We further discuss the current classifications of “reasoning types" in question answering and propose a new taxonomy. We also discuss the implications of over-focusing on English, and survey the current monolingual resources for other languages and multilingual resources. The study is aimed at both practitioners looking for pointers to the wealth of existing data, and at researchers working on new resources.
Thorn Jakobsen, T., & Rogers, A. (2022). What Factors Should Paper-Reviewer Assignments Rely On? Community Perspectives on Issues and Ideals in Conference Peer-Review. Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 4810–4823. Seattle, United States: Association for Computational Linguistics.
Both scientific progress and individual researcher careers depend on the quality of peer review, which in turn depends on paper-reviewer matching. Surprisingly, this problem has been mostly approached as an automated recommendation problem rather than as a matter where different stakeholders (area chairs, reviewers, authors) have accumulated experience worth taking into account. We present the results of the first survey of the NLP community, identifying common issues and perspectives on what factors should be considered by paper-reviewer matching systems. This study contributes actionable recommendations for improving future NLP conferences, and desiderata for interpretable peer review assignments.
Jernite, Y., Nguyen, H., Biderman, S., Rogers, A., Masoud, M., Danchev, V., … Mitchell, M. (2022). Data Governance in the Age of Large-Scale Data-Driven Language Technology. 2022 ACM Conference on Fairness, Accountability, and Transparency, 2206–2222. https://doi.org/10.1145/3531146.3534637
The recent emergence and adoption of Machine Learning technology, and specifically of Large Language Models, has drawn attention to the need for systematic and transparent management of language data. This work proposes an approach to global language data governance that attempts to organize data management amongst stakeholders, values, and rights. Our proposal is informed by prior work on distributed governance that accounts for human values and grounded by an international research collaboration that brings together researchers and practitioners from 60 countries. The framework we present is a multi-party international governance structure focused on language data, and incorporating technical and organizational tools needed to support its work.
Tafreshi, S., Sedoc, J., Rogers, A., Drozd, A., Rumshisky, A., & Akula, A. (Eds.). (2022). Proceedings of the Third Workshop on Insights from Negative Results in NLP. Dublin, Ireland: Association for Computational Linguistics.
Gella, S., He, H., Majumder, B. P., Can, B., Giunchiglia, E., Cahyawijaya, S., … Dyer, C. (Eds.). (2022). Proceedings of the 7th Workshop on Representation Learning for NLP. Dublin, Ireland: Association for Computational Linguistics.
Ray Choudhury, S., Rogers, A., & Augenstein, I. (2022). Machine Reading, Fast and Slow: When Do Models ’Understand’ Language? Proceedings of the 29th International Conference on Computational Linguistics, 78–93. Gyeongju, Republic of Korea.
Two of the most fundamental issues in Natural Language Understanding (NLU) at present are: (a) how it can established whether deep learning-based models score highly on NLU benchmarks for the ’right’ reasons; and (b) what those reasons would even be. We investigate the behavior of reading comprehension models with respect to two linguistic ’skills’: coreference resolution and comparison. We propose a definition for the reasoning steps expected from a system that would be ’reading slowly’, and compare that with the behavior of five models of the BERT family of various sizes, observed through saliency scores and counterfactual explanations. We find that for comparison (but not coreference) the systems based on larger encoders are more likely to rely on the ’right’ information, but even they struggle with generalization, suggesting that they still learn specific lexical patterns rather than the general principles of comparison.
Puccetti, G., Rogers, A., Drozd, A., & Dell’Orletta, F. (2022). Outliers Dimensions that Disrupt Transformers Are Driven by Frequency. Findings of EMNLP 2022. Association for Computational Linguistics.
Transformer-based language models are known to display anisotropic behavior: the token embeddings are not homogeneously spread in space, but rather accumulate along certain directions. A related recent finding is the outlier phenomenon: the parameters in the final element of Transformer layers that consistently have unusual magnitude in the same dimension across the model, and significantly degrade its performance if disabled. We replicate the evidence for the outlier phenomenon and we link it to the geometry of the embedding space. Our main finding is that in both BERT and RoBERTa the token frequency, known to contribute to anisotropicity, also contributes to the outlier phenomenon. In its turn, the outlier phenomenon contributes to the ’vertical’ self-attention pattern that enables the model to focus on the special tokens. We also find that, surprisingly, the outlier effect on the model performance varies by layer, and that variance is also related to the correlation between outlier magnitude and encoded token frequency.

2021

Sedoc, J., Rogers, A., Rumshisky, A., & Tafreshi, S. (Eds.). (2021). Proceedings of the Second Workshop on Insights from Negative Results in NLP. Online and Punta Cana, Dominican Republic: Association for Computational Linguistics.
Bhargava, P., Drozd, A., & Rogers, A. (2021). Generalization in NLI: Ways (Not) To Go Beyond Simple Heuristics. Proceedings of the Second Workshop on Insights from Negative Results in NLP, 125–135. Online and Punta Cana, Dominican Republic: Association for Computational Linguistics.
Much of recent progress in NLU was shown to be due to models’ learning dataset-specific heuristics. We conduct a case study of generalization in NLI (from MNLI to the adversarially constructed HANS dataset) in a range of BERT-based architectures (adapters, Siamese Transformers, HEX debiasing), as well as with subsampling the data and increasing the model size. We report 2 successful and 3 unsuccessful strategies, all providing insights into how Transformer-based models learn to generalize.
Rogers, A., Baldwin, T., & Leins, K. (2021). Just What Do You Think You’re Doing, Dave? A Checklist for Responsible Data Use in NLP. Findings of the Association for Computational Linguistics: EMNLP 2021, 4821–4833. Punta Cana, Dominican Republic: Association for Computational Linguistics.
A key part of the NLP ethics movement is responsible use of data, but exactly what that means or how it can be best achieved remain unclear. This position paper discusses the core legal and ethical principles for collection and sharing of textual data, and the tensions between them. We propose a potential checklist for responsible data (re-)use that could both standardise the peer review of conference submissions, as well as enable a more in-depth view of published research across the community. Our proposal aims to contribute to the development of a consistent standard for data (re-)use, embraced across NLP conferences.
Rogers, A., Calixto, I., Vulić, I., Saphra, N., Kassner, N., Camburu, O.-M., … Shwartz, V. (Eds.). (2021). Proceedings of the 6th Workshop on Representation Learning for NLP (RepL4NLP-2021). Online: Association for Computational Linguistics.
González, A. V., Rogers, A., & Søgaard, A. (2021). On the Interaction of Belief Bias and Explanations. Findings of ACL-IJCNLP 2021, 2930–2942. Online: ACL.
A myriad of explainability methods have been proposed in recent years, but there is little consensus on how to evaluate them. While automatic metrics allow for quick benchmarking, it isn’t clear how such metrics reflect human interaction with explanations. Human evaluation is of paramount importance, but previous protocols fail to account for belief biases affecting human performance, which may lead to misleading conclusions. We provide an overview of belief bias, its role in human evaluation, and ideas for NLP practitioners on how to account for it. For two experimental paradigms, we present a case study of gradient-based explainability introducing simple ways to account for humans’ prior beliefs: models of varying quality and adversarial examples. We show that conclusions about the highest performing methods change when introducing such controls, pointing to the importance of accounting for belief bias in evaluation.
Kovaleva, O., Kulshreshtha, S., Rogers, A., & Rumshisky, A. (2021). BERT Busters: Outlier Dimensions That Disrupt Transformers. Findings of ACL-IJCNLP 2021, 3392–3405. Online: ACL.
Multiple studies have shown that Transformers are remarkably robust to pruning. Contrary to this received wisdom, we demonstrate that pre-trained Transformer encoders are surprisingly fragile to the removal of a very small number of features in the layer outputs (<0.0001% of model weights). In case of BERT and other pre-trained encoder Transformers, the affected component is the scaling factors and biases in the LayerNorm. The outliers are high-magnitude normalization parameters that emerge early in pre-training and show up consistently in the same dimensional position throughout the model. We show that disabling them significantly degrades both the MLM loss and the downstream task performance. This effect is observed across several BERT-family models and other popular pre-trained Transformer architectures, including BART, XLNet and ELECTRA; we also show a similar effect in GPT-2.
Rogers, A. (2021). Changing the World by Changing the Data. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 2182–2194. Online: ACL.
NLP community is currently investing a lot more research and resources into development of deep learning models than training data. While we have made a lot of progress, it is now clear that our models learn all kinds of spurious patterns, social biases, and annotation artifacts. Algorithmic solutions have so far had limited success. An alternative that is being actively discussed is more careful design of datasets so as to deliver specific signals. This position paper maps out the arguments for and against data curation, and argues that fundamentally the point is moot: curation already is and will be happening, and it is changing the world. The question is only how much thought we want to invest into that process.

2020

Rogers, A., Kovaleva, O., & Rumshisky, A. (2020). A Primer in BERTology: What We Know About How BERT Works. Transactions of the Association for Computational Linguistics, 8, 842–866.
Transformer-based models have pushed state of the art in many areas of NLP, but our understanding of what is behind their success is still limited. This paper is the first survey of over 150 studies of the popular BERT model. We review the current state of knowledge about how BERT works, what kind of information it learns and how it is represented, common modifications to its training objectives and architecture, the overparameterization issue, and approaches to compression. We then outline directions for future research.
Rogers, A., Sedoc, J., & Rumshisky, A. (Eds.). (2020). Proceedings of the First Workshop on Insights from Negative Results in NLP. Online: Association for Computational Linguistics.
Rogers, A., & Augenstein, I. (2020). What Can We Do to Improve Peer Review in NLP? Findings of EMNLP, 1256–1262. Online: ACL.
Peer review is our best tool for judging the quality of conference submissions, but it is becoming increasingly spurious. We argue that a part of the problem is that the reviewers and area chairs face a poorly defined task forcing apples-to-oranges comparisons. There are several potential ways forward, but the key difficulty is creating the incentives and mechanisms for their consistent implementation in the NLP community.
Prasanna, S., Rogers, A., & Rumshisky, A. (2020). When BERT Plays the Lottery, All Tickets Are Winning. Proceedings of EMNLP, 3208–3229. Online: ACL.
Much of the recent success in NLP is due to the large Transformer-based models such as BERT (Devlin et al, 2019). However, these models have been shown to be reducible to a smaller number of self-attention heads and layers. We consider this phenomenon from the perspective of the lottery ticket hypothesis. For fine-tuned BERT, we show that (a) it is possible to find a subnetwork of elements that achieves performance comparable with that of the full model, and (b) similarly-sized subnetworks sampled from the rest of the model perform worse. However, the "bad" subnetworks can be fine-tuned separately to achieve only slightly worse performance than the "good" ones, indicating that most weights in the pre-trained BERT are potentially useful. We also show that the "good" subnetworks vary considerably across GLUE tasks, opening up the possibilities to learn what knowledge BERT actually uses at inference time.
Rogers, A., Kovaleva, O., Downey, M., & Rumshisky, A. (2020). Getting Closer to AI Complete Question Answering: A Set of Prerequisite Real Tasks. Proceedings of the AAAI Conference on Artificial Intelligence, 11.
The recent explosion in question answering research produced a wealth of both factoid RC and commonsense reasoning datasets. Combining them presents a different kind of task: not deciding simply whether information is present in the text, but also whether a confident guess could be made for the missing information. To that end, we present QuAIL, the first reading comprehension dataset (a) to combine textbased, world knowledge and unanswerable questions, and (b) to provide annotation that would enable precise diagnostics of the reasoning strategies by a given QA system. QuAIL contains 15K multi-choice questions for 800 texts in 4 domains (fiction, blogs, political news, and user story texts). Crucially, to solve QuAIL a system would need to handle both general and text-specific questions, impossible to answer from pretraining data. We show that the new benchmark poses substantial challenges to the current state-of-the-art systems, with a 30% drop in accuracy compared to the most similar existing dataset.

2019

Romanov, A., Rumshisky, A., Rogers, A., & Donahue, D. (2019). Adversarial Decomposition of Text Representation. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 815–825.
In this paper, we present a method for adversarial decomposition of text representation. This method can be used to decompose a representation of an input sentence into several independent vectors, each of them responsible for a specific aspect of the input sentence. We evaluate the proposed method on two case studies: the conversion between different social registers and diachronic language change. We show that the proposed method is capable of fine-grained controlled change of these aspects of the input sentence. It is also learning a continuous (rather than categorical) representation of the style of the sentence, which is more linguistically realistic. The model uses adversarial-motivational training and includes a special motivational loss, which acts opposite to the discriminator and encourages a better decomposition. Furthermore, we evaluate the obtained meaning embeddings on a downstream task of paraphrase detection and show that they significantly outperform the embeddings of a regular autoencoder.
Kovaleva, O., Romanov, A., Rogers, A., & Rumshisky, A. (2019). Revealing the Dark Secrets of BERT. Proceedings of EMNLP-IJCNLP), 4356–4365. https://doi.org/10.18653/v1/D19-1445
BERT-based architectures currently give state-of-the-art performance on many NLP tasks, but little is known about the exact mechanisms that contribute to its success. In the current work, we focus on the interpretation of self-attention, which is one of the fundamental underlying components of BERT. Using a subset of GLUE tasks and a set of handcrafted features-of-interest, we propose the methodology and carry out a qualitative and quantitative analysis of the information encoded by the individual BERT’s heads. Our findings suggest that there is a limited set of attention patterns that are repeated across different heads, indicating the overall model overparametrization. While different heads consistently use the same attention patterns, they have varying impact on performance across different tasks. We show that manually disabling attention in certain heads leads to a performance improvement over the regular fine-tuned BERT models.
Rogers, A., Drozd, A., Rumshisky, A., & Goldberg, Y. (2019). Proceedings of the 3rd Workshop on Evaluating Vector Space Representations for NLP.
Rogers, A., Kovaleva, O., & Rumshisky, A. (2019). Calls to Action on Social Media: Potential for Censorship and Social Impact. EMNLP-IJCNLP 2019 Second Workshop on Natural Language Processing for Internet Freedom.
Calls to action on social media are known to be effective means of mobilization in social movements, and a frequent target of censorship. We investigate the possibility of their automatic detection and their potential for predicting real-world protest events, on historical data of Bolotnaya protests in Russia (2011-2013). We find that political calls to action can be annotated and detected with relatively high accuracy, and that in our sample their volume has a moderate positive correlation with rally attendance.

2018

Karpinska, M., Li, B., Rogers, A., & Drozd, A. (2018). Subcharacter Information in Japanese Embeddings: When Is It Worth It? Proceedings of the Workshop on the Relevance of Linguistic Structure in Neural Architectures for NLP, 28–37. Melbourne, Australia: ACL.
Languages with logographic writing systems present a difficulty for traditional character-level models. Leveraging the subcharacter information was recently shown to be beneficial for a number of intrinsic and extrinsic tasks in Chinese. We examine whether the same strategies could be applied for Japanese, and contribute a new analogy dataset for this language.
Rogers, A., Hosur Ananthakrishna, S., & Rumshisky, A. (2018). What’s in Your Embedding, And How It Predicts Task Performance. Proceedings of the 27th International Conference on Computational Linguistics, 2690–2703. Santa Fe, New Mexico, USA, August 20-26, 2018: ACL.
Attempts to find a single technique for general-purpose intrinsic evaluation of word embeddings have so far not been successful. We present a new approach based on scaled-up qualitative analysis of word vector neighborhoods that quantifies interpretable characteristics of a given model (e.g. its preference for synonyms or shared morphological forms as nearest neighbors). We analyze 21 such factors and show how they correlate with performance on 14 extrinsic and intrinsic task datasets (and also explain the lack of correlation between some of them). Our approach enables multi-faceted evaluation, parameter search, and generally – a more principled, hypothesis-driven approach to development of distributional semantic representations.
Rogers, A., Romanov, A., Rumshisky, A., Volkova, S., Gronas, M., & Gribov, A. (2018). RuSentiment: An Enriched Sentiment Analysis Dataset for Social Media in Russian. Proceedings of the 27th International Conference on Computational Linguistics, 755–763. Santa Fe, New Mexico, USA: ACL.
This paper presents RuSentiment, a new dataset for sentiment analysis of social media posts in Russian, and a new set of comprehensive annotation guidelines that are extensible to other languages. RuSentiment is currently the largest in its class for Russian, with 31,185 posts annotated with Fleiss’ kappa of 0.58 (3 annotations per post). To diversify the dataset, 6,950 posts were pre-selected with an active learning-style strategy. We report baseline classification results, and we also release the best-performing embeddings trained on 3.2B tokens of Russian VKontakte posts.

2017

Li, B., Liu, T., Zhao, Z., Tang, B., Drozd, A., Rogers, A., & Du, X. (2017). Investigating Different Syntactic Context Types and Context Representations for Learning Word Embeddings. Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, 2411–2421. Copenhagen, Denmark, September 7–11, 2017.
The number of word embedding models is growing every year. Most of them are based on the co-occurrence information of words and their contexts. However, it is still an open question what is the best definition of context. We provide a systematical investigation of 4 different syntactic context types and context representations for learning word embeddings. Comprehensive experiments are conducted to evaluate their effectiveness on 6 extrinsic and intrinsic tasks. We hope that this paper, along with the published code, would be helpful for choosing the best context type and representation for a given task.
Rogers, A. (2017). Multilingual Computational Lexicography: Frame Semantics Meets Distributional Semantics (Ph.D. Dissertation, University of Tokyo). University of Tokyo, Tokyo.
Rogers, A., Drozd, A., & Li, B. (2017). The (Too Many) Problems of Analogical Reasoning with Word Vectors. Proceedings of the 6th Joint Conference on Lexical and Computational Semantics (* SEM 2017), 135–148.
This paper explores the possibilities of analogical reasoning with vector space models. Given two pairs of words with the same relation (e.g. man:woman :: king:queen), it was proposed that the offset between one pair of the corresponding word vectors can be used to identify the unknown member of the other pair (king - man + woman = queen). We argue against such “linguistic regularities” as a model for linguistic relations in vector space models and as a benchmark, and we show that the vector offset (as well as two other, better-performing methods) suffers from dependence on vector similarity.

2016

(before 2017 my last name was “Gladkova”)

Drozd, A., Gladkova, A., & Matsuoka, S. (2016). Word Embeddings, Analogies, and Machine Learning: Beyond King - Man + Woman = Queen. Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, 3519–3530. Osaka, Japan, December 11-17.
Gladkova, A., & Drozd, A. (2016). Intrinsic Evaluations of Word Embeddings: What Can We Do Better? Proceedings of The 1st Workshop on Evaluating Vector Space Representations for NLP, 36–42. https://doi.org/10.18653/v1/W16-2507
Gladkova, A., Drozd, A., & Matsuoka, S. (2016). Analogy-Based Detection of Morphological and Semantic Relations with Word Embeddings: What Works and What Doesn’t. Proceedings of the NAACL-HLT SRW, 47–54. https://doi.org/10.18653/v1/N16-2002
Santus, E., Gladkova, A., Evert, S., & Lenci, A. (2016). The CogALex-V Shared Task on the Corpus-Based Identification of Semantic Relations. Osaka, Japan, December 11-17: ACL.

2015

Drozd, A., Gladkova, A., & Matsuoka, S. (2015). Discovering Aspectual Classes of Russian Verbs in Untagged Large Corpora. Proceedings of 2015 IEEE International Conference on Data Science and Data Intensive Systems (DSDIS), 61–68. https://doi.org/10.1109/DSDIS.2015.30
This paper presents a case study of discovering and classifying verbs in large web-corpora. Many tasks in natural language processing require corpora containing billions of words, and with such volumes of data co-occurrence extraction becomes one of the performance bottlenecks in the Vector Space Models of computational linguistics. We propose a co-occurrence extraction kernel based on ternary trees as an alternative (or a complimentary stage) to conventional map-reduce based approach, this kernel achieves an order of magnitude improvement in memory footprint and processing speed. Our classifier successfully and efficiently identified verbs in a 1.2-billion words untagged corpus of Russian fiction and distinguished between their two aspectual classes. The model proved efficient even for low-frequency vocabulary, including nonce verbs and neologisms.
Drozd, A., Gladkova, A., & Matsuoka, S. (2015). Python, Performance, and Natural Language Processing. Proceedings of the 5th Workshop on Python for High-Performance and Scientific Computing, 1:1–1:10. https://doi.org/10.1145/2835857.2835858
We present a case study of Python-based workflow for a data-intensive natural language processing problem, namely word classification with vector space model methodology. Problems in the area of natural language processing are typically solved in many steps which require transformation of the data to vastly different formats (in our case, raw text to sparse matrices to dense vectors). A Python implementation for each of these steps would require a different solution. We survey existing approaches to using Python for high-performance processing of large volumes of data, and we propose a sample solution for each step for our case study (aspectual classification of Russian verbs), attempting to preserve both efficiency and user-friendliness. For the most computationally intensive part of the workflow we develop a prototype distributed implementation of co-occurrence extraction module using IPython.parallel cluster.