Total pubs: 21


  1. Prasanna, S., Rogers, A., & Rumshisky, A. (2020). When BERT Plays the Lottery, All Tickets Are Winning. ArXiv:2005.00561 [Cs].
    Much of the recent success in NLP is due to the large Transformer-based models such as BERT (Devlin et al, 2019). However, these models have been shown to be reducible to a smaller number of self-attention heads and layers. We consider this phenomenon from the perspective of the lottery ticket hypothesis. For fine-tuned BERT, we show that (a) it is possible to find a subnetwork of elements that achieves performance comparable with that of the full model, and (b) similarly-sized subnetworks sampled from the rest of the model perform worse. However, the "bad" subnetworks can be fine-tuned separately to achieve only slightly worse performance than the "good" ones, indicating that most weights in the pre-trained BERT are potentially useful. We also show that the "good" subnetworks vary considerably across GLUE tasks, opening up the possibilities to learn what knowledge BERT actually uses at inference time.
  2. Rogers, A., Kovaleva, O., Downey, M., & Rumshisky, A. (2020). Getting Closer to AI Complete Question Answering: A Set of Prerequisite Real Tasks. Proceedings of the AAAI Conference on Artificial Intelligence, 11.
    The recent explosion in question answering research produced a wealth of both factoid RC and commonsense reasoning datasets. Combining them presents a different kind of task: not deciding simply whether information is present in the text, but also whether a confident guess could be made for the missing information. To that end, we present QuAIL, the first reading comprehension dataset (a) to combine textbased, world knowledge and unanswerable questions, and (b) to provide annotation that would enable precise diagnostics of the reasoning strategies by a given QA system. QuAIL contains 15K multi-choice questions for 800 texts in 4 domains (fiction, blogs, political news, and user story texts). Crucially, to solve QuAIL a system would need to handle both general and text-specific questions, impossible to answer from pretraining data. We show that the new benchmark poses substantial challenges to the current state-of-the-art systems, with a 30% drop in accuracy compared to the most similar existing dataset.
  3. Rogers, A., Kovaleva, O., & Rumshisky, A. (2020). A Primer in BERTology: What We Know about How BERT Works. (Accepted to TACL).
    Transformer-based models are now widely used in NLP, but we still do not understand a lot about their inner workings. This paper describes what is known to date about the famous BERT model (Devlin et al. 2019), synthesizing over 40 analysis studies. We also provide an overview of the proposed modifications to the model and its training regime. We then outline the directions for further research.


  1. Kovaleva, O., Romanov, A., Rogers, A., & Rumshisky, A. (2019). Revealing the Dark Secrets of BERT. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 4356–4365.
    BERT-based architectures currently give state-of-the-art performance on many NLP tasks, but little is known about the exact mechanisms that contribute to its success. In the current work, we focus on the interpretation of self-attention, which is one of the fundamental underlying components of BERT. Using a subset of GLUE tasks and a set of handcrafted features-of-interest, we propose the methodology and carry out a qualitative and quantitative analysis of the information encoded by the individual BERT’s heads. Our findings suggest that there is a limited set of attention patterns that are repeated across different heads, indicating the overall model overparametrization. While different heads consistently use the same attention patterns, they have varying impact on performance across different tasks. We show that manually disabling attention in certain heads leads to a performance improvement over the regular fine-tuned BERT models.
  2. Rogers, A., Drozd, A., Rumshisky, A., & Goldberg, Y. (2019). Proceedings of the 3rd Workshop on Evaluating Vector Space Representations for NLP.
  3. Rogers, A., Kovaleva, O., & Rumshisky, A. (2019). Calls to Action on Social Media: Potential for Censorship and Social Impact. EMNLP-IJCNLP 2019 Second Workshop on Natural Language Processing for Internet Freedom.
  4. Rogers, A., Smelkov, G., & Rumshisky, A. (2019). NarrativeTime: Dense High-Speed Temporal Annotation on a Timeline. ArXiv:1908.11443 [Cs].
    We present NarrativeTime, a new timeline-based annotation scheme for temporal order of events in text, and a new densely annotated fiction corpus comparable to TimeBank-Dense. NarrativeTime is considerably faster than schemes based on event pairs such as TimeML, and it produces more temporal links between events than TimeBank-Dense, while maintaining comparable agreement on temporal links. This is achieved through new strategies for encoding vagueness in temporal relations and an annotation workflow that takes into account the annotators’ chunking and commonsense reasoning strategies. NarrativeTime comes with new specialized web-based tools for annotation and adjudication.
  5. Romanov, A., Rumshisky, A., Rogers, A., & Donahue, D. (2019). Adversarial Decomposition of Text Representation. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 815–825.


  1. Karpinska, M., Li, B., Rogers, A., & Drozd, A. (2018). Subcharacter Information in Japanese Embeddings: When Is It Worth It? Proceedings of the Workshop on the Relevance of Linguistic Structure in Neural Architectures for NLP, 28–37. Melbourne, Australia: Association for Computational Linguistics.
  2. Rogers, A., Hosur Ananthakrishna, S., & Rumshisky, A. (2018). What’s in Your Embedding, And How It Predicts Task Performance. Proceedings of the 27th International Conference on Computational Linguistics, 2690–2703. Santa Fe, New Mexico, USA, August 20-26, 2018: Association for Computational Linguistics.
  3. Rogers, A., Romanov, A., Rumshisky, A., Volkova, S., Gronas, M., & Gribov, A. (2018). RuSentiment: An Enriched Sentiment Analysis Dataset for Social Media in Russian. Proceedings of the 27th International Conference on Computational Linguistics, 755–763. Santa Fe, New Mexico, USA: Association for Computational Linguistics.


  1. Li, B., Liu, T., Zhao, Z., Tang, B., Drozd, A., Rogers, A., & Du, X. (2017). Investigating Different Syntactic Context Types and Context Representations for Learning Word Embeddings. Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, 2411–2421. Copenhagen, Denmark, September 7–11, 2017.
  2. Rogers, A. (2017). Multilingual Computational Lexicography: Frame Semantics Meets Distributional Semantics (Ph.D. Dissertation). University of Tokyo, Tokyo.
  3. Rogers, A., Drozd, A., & Li, B. (2017). The (Too Many) Problems of Analogical Reasoning with Word Vectors. Proceedings of the 6th Joint Conference on Lexical and Computational Semantics (* SEM 2017), 135–148.


(before 2017 my last name was “Gladkova”)

  1. Drozd, A., Gladkova, A., & Matsuoka, S. (2016). Word Embeddings, Analogies, and Machine Learning: Beyond King - Man + Woman = Queen. Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, 3519–3530. Osaka, Japan, December 11-17.
  2. Gladkova, A., & Drozd, A. (2016). Intrinsic Evaluations of Word Embeddings: What Can We Do Better? Proceedings of The 1st Workshop on Evaluating Vector Space Representations for NLP, 36–42.
  3. Gladkova, A., Drozd, A., & Matsuoka, S. (2016). Analogy-Based Detection of Morphological and Semantic Relations with Word Embeddings: What Works and What Doesn’t. Proceedings of the NAACL-HLT SRW, 47–54.
  4. Santus, E., Gladkova, A., Evert, S., & Lenci, A. (2016). The CogALex-V Shared Task on the Corpus-Based Identification of Semantic Relations. Proceedings of the 5th Workshop on Cognitive Aspects of the Lexicon (CogALex-V), 69–79. Osaka, Japan, December 11-17: ACL.


  1. Drozd, A., Gladkova, A., & Matsuoka, S. (2015). Discovering Aspectual Classes of Russian Verbs in Untagged Large Corpora. Proceedings of 2015 IEEE International Conference on Data Science and Data Intensive Systems (DSDIS), 61–68.
    This paper presents a case study of discovering and classifying verbs in large web-corpora. Many tasks in natural language processing require corpora containing billions of words, and with such volumes of data co-occurrence extraction becomes one of the performance bottlenecks in the Vector Space Models of computational linguistics. We propose a co-occurrence extraction kernel based on ternary trees as an alternative (or a complimentary stage) to conventional map-reduce based approach, this kernel achieves an order of magnitude improvement in memory footprint and processing speed. Our classifier successfully and efficiently identified verbs in a 1.2-billion words untagged corpus of Russian fiction and distinguished between their two aspectual classes. The model proved efficient even for low-frequency vocabulary, including nonce verbs and neologisms.
  2. Drozd, A., Gladkova, A., & Matsuoka, S. (2015). Python, Performance, and Natural Language Processing. Proceedings of the 5th Workshop on Python for High-Performance and Scientific Computing, 1:1–1:10.
    We present a case study of Python-based workflow for a data-intensive natural language processing problem, namely word classification with vector space model methodology. Problems in the area of natural language processing are typically solved in many steps which require transformation of the data to vastly different formats (in our case, raw text to sparse matrices to dense vectors). A Python implementation for each of these steps would require a different solution. We survey existing approaches to using Python for high-performance processing of large volumes of data, and we propose a sample solution for each step for our case study (aspectual classification of Russian verbs), attempting to preserve both efficiency and user-friendliness. For the most computationally intensive part of the workflow we develop a prototype distributed implementation of co-occurrence extraction module using IPython.parallel cluster.