NAACL Ethics Review Process Report-Back

In this blog post, we give an experience report and overview of the process we created and followed as chairs of the NAACL 2021 Ethics Committee. In doing so, we aim on the one hand to provide transparency about this process to the NAACL community and on the other hand to support similar work for later conferences and reviewing systems (including ACL Rolling Review).

This blog post reflects the opinions of the Ethics Committee chairs, and not necessarily the rest of the organizing committee. For the PC’s point of view, see the blog post by PCs.

Long-Term Vision

In the long-term, we hope to see our community routinely considering societal impact (or “ethics”) in the course of review, alongside other factors such as the clarity of definition of hypotheses, the extent to which experiments reported effectively test the hypotheses, grounding and relationship to existing literature, validity of described methods, clarity of presentation, and likelihood of impact within the scientific community. Just as with the other dimensions of review, the dimension of ethics/societal impact is broad, including such considerations as impact on the environment, on workers involved in the research (e.g. as crowdworkers), on data subjects, on potential users of technology developed on the basis of the research, on other people impacted if such software is used, and potentially also societal systems and structures. Of course, no researcher is in a position to perfectly predict future impacts; thus the emerging best practices involve clear description of known information (energy usage, compensation of crowd workers, privacy protections for data subjects, etc) and thoughtful discussion of potential risks of application of the results.

Goals for NAACL 2021

While there is a long history of research communities engaging with questions of ethics (see e.g. the World Medical Association’s Geneva Declaration of 1948 and Helsinki Declaration of 1964, or the ACM Code of Ethics, which debuted in 1973), these issues have only recently garnered wide-spread attention within the NLP community (e.g. Journée d’étude ATALA “éthique et TAL” (2014), ETeRNAL (workshop on ethics and NLP) at TALN 2015 and 2020, Hovy and Spruit 2016, and the ACL workshops on Ethics and NLP (2017, 2018)) or NLP education (e.g. Bender, Hovy and Schofield 2020). The ACL adopted the ACM’s Code of Ethics in March 2020, and the first ACL conference to explicitly point to the Code of Ethics in its call for papers (and reviewing processes) was EMNLP 2020.

Thus our goals for NAACL 2021 were in the first instance pedagogical, i.e. to provide guidance to authors and reviewers with an eye towards raising the overall level of expertise in these matters in the field. At the same time, we aimed to raise the overall quality of NAACL 2021 papers along this dimension (ethics/societal impact), while understanding that it would not be possible either to predict and prevent every possible negative impact nor to even necessarily have perfect recall on any particular drawback (e.g. unfair compensation of crowdworkers).

To meet the goals laid out here, we constituted a committee of reviewers from among those with research expertise in societal impact in either NLP or adjacent fields (the “Ethics Committee”), provided guidance to authors and reviewers, had the Ethics Committee review papers flagged by primary reviewers, and set up a system of shepherding or conditional acceptance for a small handful of papers. Each of these measures are described in more detail below.

Constituting the Committee

In recruiting researchers to join the NAACL 2021 Ethics Committee, we put a particular emphasis on diversity. On the one hand, we wanted to ensure that potential societal impacts of work published at NAACL were considered from multiple different cultural perspectives. On the other hand, we also wanted to make sure that we weren’t treating this additional service work as solely the job of minoritized people in our field. Finally, we wanted to ensure that no member of our committee would review papers from their same country (according to their current professional affiliations) to minimize the chances of adverse impacts on committee members. We began with people we already knew directly and asked for further recommendations, especially from world regions from which we had not yet managed to recruit. (Everyone is busy; in many cases our original contacts couldn’t join the committee but were able to send us additional names.) Because ethics/societal impact is a relatively new area within NLP but also growing in adjacent fields, we called on researchers who look at ethics/societal impact and AI a bit more broadly. Our final committee included 38 reviewers (plus the two chairs), representing 22 countries, with 15 members with affiliations in Europe, 10 in US+Canada, 6 in Asia, 5 in Latin America, and 4 in Africa. The full list of the Ethics Committee for NAACL 2021 can be found here.

Providing Guidance to Authors and Reviewers

The NAACL Program Chairs allowed space (after the 8th/4th page in the main track, and the 6th page in the Industry and demo tracks) for an optional “ethical considerations” section. To provide guidance to authors on how to write such a section (or alternatively what to include elsewhere in the paper), we posted the Ethics FAQ one month before the submission deadline. This document includes specific guidance for papers introducing new datasets, papers concerning NLP applications, papers concerning identity characteristics (e.g. gender), and papers reporting on computationally intensive experiments.

We strongly encourage authors to consider ethics issues at the outset of a research project, rather than at the point of writing it up (see also Sim et al 2021). For example, if researchers have conducted an experiment based on data that was collected in a way that violates the data subjects’ privacy, there’s nothing that can be written into an ‘ethical considerations’ section that can remedy this. In other cases, however, the underlying research is sound, but there are important ethical choices to be made in the writing (e.g. around how identity categories are discussed) and/or it is important for raising the overall level of ethical practice in the field that the writing be transparent about decisions such as compensation of crowdworkers.

Taking the long-term view, we hope that by outlining these issues for NAACL 2021, we are positioning researchers to take them into consideration for future research projects as well. We hope that as the field progresses, best practices will emerge around these questions and become widely adopted. In addition, we note that the list of considerations presented in our FAQ is surely not comprehensive. We hope that future work will build on this and refine it, as the Ethics Advisory Committee for ACL-IJCNLP 2021 has already done. As language technology becomes (even) more widely used and our collective understanding of its impacts deepens, surely additional kinds of considerations will continue to emerge.

Review Process

The Ethics FAQ was also intended as a resource for the primary (scientific) reviewers of the main, industry and demo tracks at NAACL. We asked the primary reviewers to consider the ethics review questions and then indicate, for each paper, whether it should be further reviewed by the ethics committee, and if so, to briefly state why. The ethics review questions were not intended to be exhaustive; we also asked reviewers to flag papers based on additional ethical concerns not listed there, should they arise.

Through this process 143 (of 1797 or 8%) main track submissions, 9 (of 128 or 7%) industry track submissions, and 4 (of 42 or 9.5%) demo track submissions were sent over to the Ethics Committee for further review. The Ethics Committee Chairs did a first pass, removing any papers that appeared to have been flagged spuriously. (In most cases, these were papers where the “ethics review” flag was checked, but no written justification applied. We communicated with the reviewers and asked them to either provide a written justification or uncheck the flag. In a small number of cases, this first pass review showed that there was no need for further ethics review.) The result of this process was a total of 113 papers that we then allocated to the Ethics Committee for review, ensuring that each paper was assigned two reviewers and no reviewer was reviewing a paper authored by people from their country of affiliation.

In terms of the timeline, this was set up so that the ethics review took place parallel to the work of the ACs and SACs, submitting our recommendations to the track chairs about a week before their decision deadline. In principle, we anticipated additional papers getting recommended to us by ACs or SACs, but in practice only a couple of additional papers were noted at this point. We think this was in part because of how we implemented the system in START, and this is definitely something that could be improved. (Though we are hopeful that the ACL Rolling Review system will provide a more systematic way to incorporate ethics review, and that eventually, it will just become part of the primary review process.)

Since NAACL 2021’s ethics review process started after the primary review process, we could have in principle decided only to provide ethics review for papers that otherwise received high scores. However, in keeping with our pedagogical goals, and given that the number of flagged papers was low enough (and our committee large enough) that it was feasible, we had all such papers reviewed. This way, even people whose work was not accepted to NAACL benefitted from guidance on how to address the ethical considerations raised by their work.

The ethics reviewers were asked to answer the following questions:

  1. Do you find any ethical concerns with the research described in this paper, either in terms of the research process (data collection, human subjects treatment, etc) or in terms of the potential impact of the technology developed? If so, please describe.
  2. Do you find any ethical concerns with the way the research is described in this paper (e.g. language essentializing identity categories)? If so, please describe.
  3. What recommendations do you have for the authors to improve this paper (either for publication at NAACL 2021 or in future work) regarding the design and/or presentation of the research, including the ethical considerations section?

In addition, in the “confidential comments to the Area Chairs/PC Chairs” section, please enter your recommendation from the following list. Note: If you choose “conditional accept” it should be clear from the remarks above what conditions need to be met.

  • acceptable as is (though you might have suggestions)
  • conditional accept (as far as ethical considerations are concerned; in this case, please specify what must be addressed)
  • reject on ethical grounds

A slightly awkward facet of this process was that we were unable to create a separate review form in START for the ethics committee, and so they had to use the standard form (and put in “default” answers to questions that were required but irrelevant to ethics review). On the other hand, a key benefit of using START for the reviews is that it meant they would be made available to the authors without further effort.

Once the ethics reviews were complete, we asked the ethics reviewers to discuss with each other, in cases of disagreement, as is standard practice for all reviews. Finally, we (as Ethics Committee chairs) made recommendations to the track chairs (main, industry, and demo) on the basis of the ethics reviews. At this point, because there were only two of us, we only made recommendations for papers not already listed as “reject” or “maybe reject”. Out of 30 papers across the three tracks in this category, we recommended “accept as is” for 11, “conditional accept” for 15, and “reject on ethical grounds” for 4. The final decisions rested with the track chairs, and we do not know which (if any) of these papers might not have been accepted for independent reasons.

Paper Shepherding

The final step in the process was shepherding papers that were put into the “conditional accept” category. For the main track, this was handled by the PC chairs themselves. For the industry and demo tracks, by the Ethics Committee chairs. We offered to look over revised versions of papers ahead of the camera ready deadline and provide feedback to authors, noting that no revisions would be accepted past the camera ready deadline and any conditionally accepted paper that had not met its conditions by that time would not be included in the program.


We conclude with some reflections on what we’ve learned from this process. Our first take-away is that time, communication and collaboration really matter in ethical practice, in many ways. Working out how to integrate review by the ethics committee in the overall NAACL 2021 review processes (for the three separate tracks) required careful management of timelines, negotiated in collaboration with the PC and industry and demo track chairs. The overall process for us began in September 2020 and is only wrapped up in April 2021, and if anything, it would have been good to be able to begin sooner, so that we could have published our guidance to authors sooner. For the papers that ended up in the “conditional accept” category, the process showed the value of a review process with the time for back and forth communication between authors and reviewers. On a larger scale, we see that the considerations raised via the ethics review process are generally very much aligned with the strategies and goals of slow science (see also Kan 2018), emphasizing thoughtful design of experiments, thorough investigation of data (which is thus collected at manageable scales), and broad contemplation of how the study at hand fits into the scholarly and social landscape. Finally, we note that integrating ethical considerations into the research process is ultimately about valuing our own time as researchers and allocating it as a limited resource to the projects that are most beneficial (see also Bender 2020). This is all the more true as we are not only researchers, but also citizens, who will sooner or later be faced with the issues we generated.

Our second take-away is that it is important to not treat ethics review as an all-or-nothing proposition. It doesn’t make sense to think of papers as wholly “ethical” or wholly “unethical”. Likewise, it would be impossible (or at least extremely impractical) to set up a process that would catch all papers that might be deemed “unethical”. Furthermore, setting up the process as a punitive one strikes us as counter-productive. Rather, we find it much more productive to think about this work as a process of improvement: How can we, as a field, move towards better, more ethical research practice? How can we, as a field, move towards better writing about societal impacts and ethical considerations of our work? How can we, as a field, move towards more effective review around ethical considerations?

Concretely, we find that the field is at different points of the process in learning about different kinds of ethical considerations and developing associated best practices:

Fair compensation of crowdworkers

It seems fairly settled that fair compensation of crowd workers (and any others doing annotation work) is important (see, e.g., Fort et al, 2011, Hara et al, 2018), and that research employing crowdworkers should always be reported together with information about compensation, specifically: how was a fair rate of pay determined and how did the researchers ensure that crowd workers were compensated fairly?

Scraped data and terms of service

It is also clear that the terms of service of websites should be taken into consideration before scraping data. For example, the Twitter terms of service currently state “Never derive or infer, or store derived or inferred, information about” a Twitter user’s sensitive information, including health information and other categories. Similarly, as of October 2020, Reddit’s terms of service include the statement “We conditionally grant permission to crawl the Services in accordance with the parameters set forth in our robots.txt file, but scraping the Services without Reddit’s prior consent is prohibited.” A company’s terms of service aren’t necessarily put in place to protect the people whose data is potentially being collected (rather, they are more likely to be ultimately designed to protect the company). Nonetheless, they do connect with both considerations of privacy for data subjects and legal liability for dataset developers, distributors and users, and should be considered carefully. It seems clear to us that it is time to move away from methods that rely on scraping ever larger quantities of data (see, e.g. Bender, Gebru et al 2021), but the field has not yet settled on best practices for opt-in or other ethical data collection practices. In our recommendations to track chairs this year, we drew a distinction between reusing existing scraped datasets (without redistributing them) and scraping new ones (with the intent to distribute them). In the future, we hope that further work on the intersection of principles of privacy, the legal import of terms of service, and the rights of the various people involved (data subjects, copyright holders, etc) will produce more clear-cut guidelines in this area.

Not Essentializing Identity Characteristics

There are many research questions that might be asked by looking at linguistic data (e.g. social media activity) and correlating either linguistic facts or information about the content of the linguistic data with demographic or other identity characteristics (e.g. gender, race, etc). There are some well-known pitfalls in this area that should be avoided, which all center around treating identity characteristics as something which can be reliably ascribed (e.g. based on names) and also essential properties of humans and drawn from fixed sets (e.g. binary gender). For an example of best practices around avoiding these pitfalls, see Larson 2017.

Environmental Impact

There is a growing literature considering the environmental impact of compute-intensive approaches to NLP and other fields that involve machine learning (Strubell et al 2019, Henderson et al 2020, Schwartz et al 2020). The emerging best practices include calculating how much energy a given experiment or methodology requires and publishing that information, so that researchers looking to build on existing work can make informed choices. Note that this is important even if the experiments are run with clean energy only, as the people adopting the methodology might not have similar access to clean energy.


Though there are some emerging best practices around specific issues, it is also the case that ethics/social impact review will necessarily always be open ended, because the potential impacts of our work are themselves open ended. It is for this reason that we find it very important for the field as a whole to gain experience over time reasoning about potential societal impacts, both in our roles as researchers and in our roles as reviewers.