Q&A with Siegel Research Fellow at Center for Technology & Democracy Amy Winecoff

Amy Winecoff is currently the AI Governance Fellow at the Center for Democracy & Technology (CDT), focusing on governance of AI systems. Previously, Amy was a fellow at Princeton’s Center for Information Technology Policy (CITP), where she examined how cultural, organizational, and institutional factors shape emerging AI and blockchain companies.  She has hands-on experience as a data scientist in the tech industry, having built and deployed recommender systems for e-commerce.

Tell us about yourself and your current role. What brought you to this point in your career? 

Amy Winecoff: I’m currently a fellow in the AI Governance Lab at the Center for Democracy and Technology (CDT), a nonpartisan, nonprofit organization that advocates for civil rights and civil liberties in the digital era. We focus on many issue areas, from free expression to privacy to surveillance to civic technology. 

Whereas CDT has been around since the beginning of the internet, the AI Governance Lab is just getting started. We’re hoping to be able to navigate between conversations happening in the policy sphere, in academia, in communities of practices, and in industry. Our hope is to recommend practices that are evidence-based and also wrangle with the practical constraints of what it looks like to be a practitioner within a technology company.

I came to this role via a very indirect path. I got a degree in art, and then a PhD in Psychology and Neuroscience where I studied the brain mechanisms that are responsible for motivation, emotion, and decision making, and how those mechanisms go awry in clinical disorders like eating disorders. While the substance is different from my current work, the work required very complex statistical methods and also required me to deal with very big messy data sets that have very poor signal to noise qualities.

When I decided not to stay in academia, I still wanted to use my statistical analysis skillset so I took some jobs in industry. It quickly became apparent that the teams I was working on needed somebody that knew machine learning (ML). While I never studied ML, it was a shorter jump for me than a lot of other people, so it was the jump that I made. That’s how I came to work as a data scientist building machine learning models for industry. 

What are some of the biggest questions that have come to guide your work –  both at the Center for Democracy and Technology and beyond?

One of the things that became apparent to me working in ML was that there was not necessarily a good integration of people with social science expertise operating in data science. On one level that makes sense (data science is a technical profession), but at the same time, ML practitioners drive a lot of what becomes the technology. Many people working in these fields do care about things like building user-centered algorithms and preventing social harms, but they often don’t have the disciplinary background to find the best approaches. That was an “oh wow” moment for me – I could act as an interpreter between these two fields. 

Another thing worth noting from my industry days is that when we ran up against ethical quandaries, they were more often about organizational problems than they were about technical problems. How could we solve technical problems if we didn’t have the appropriate organizational infrastructure in place? How do individual team members advocate for some version of risk mitigation in the product development cycle if there’s not an organizational process for doing so? 

On the flipside, as a practitioner, I was often frustrated with the solutions that academia posed for these ethical quandaries. Those solutions draw on deep theoretical expertise, but typically lack tactical awareness of the industry contexts where those solutions would have the most impact. They weren’t things that we could use, or that we could incorporate quickly within a development cycle without having to spend a lot of time reading or reimplementing math from a variety of different academic papers. How do we think through the operational, institutional, and organizational constraints that technologists face in ways that inform the solutions that we think are practically feasible? What are the competing incentives and pressures people are facing? What are the value systems of the culture they’re steeped in that may either enable or block them from wrangling with the ethical constraints? This became the focus of the research I did at Princeton CITP. 

My current role at CDT marries these two worlds. It was an opportunity to step that one step closer to practice and offer the practical guidance I wanted in industry. It was really important to not just keep that thinking inside an academic sphere, but to begin to translate that more meaningfully to the industry setting. 

You recently had an opinion piece featured in the Tech Policy Press drawing a comparison between Enron’s demise due to corporate greed and lack of meaningful investment in risk management to the AI landscape. Can you tell me a little bit more about that work and how it came to be? 

Amy Winecoff: While working at Princeton, I got really interested in stories about ethical failures of businesses. Where are there examples of how institutions or organizations were shaped in problematic ways that manifested in a variety of different ethical and legal business failures? 

There are obvious cases of unethical behavior at the center of business failure, like Theranos, where the people involved were very aware that what they were building was not real. But there are also cases where subtle organizational dynamics build and build until ultimately, without anybody necessarily realizing it, they shape everyone’s day to day practices and ways of interacting that contribute to negative outcomes. I think Enron is a good example of this. Their business and ethical failure wasn’t due to an evil master scheme so much as a bunch of smaller unethical and illegal steps that accumulated over time. 

It can be so easy to brush off a company as evil or greedy, and not really get into the nuts and bolts of how its inner working enables that kind of behavior. And that’s important because if you put the focus on unethical people rather than on problematic organizations, the theory of change is very different for what needs to be fixed. If the problem with Enron is that the executives were evil people, then we solve that by not hiring and promoting evil people. It also assumes that different people in those same positions would act differently – and I don’t think that’s true. If you put people in an organization that is pressuring them towards a particular outcome, it’s harder for them to move outside of those pressures. This is not to say that individuals are irrelevant, but the bigger problem is the way the organization operates. 

I think there’s an evergreen lesson for current day companies in how they craft their culture and establish risk management practices. Companies need to enable internal risk management teams, work with third-party oversight organizations that provide real, critical feedback, and develop business cultures that incentivize risk management work. 

You note the challenge many AI companies face in operationalizing their principles like “fairness” or “alignment” – why is putting their principles into practice a particular challenge for AI companies?  

Amy Winecoff: To be fair to companies, this is extremely hard from both a theoretical and operational standpoint. A good first step is to clearly articulate the technology’s mission and translate that mission into concrete requirements against which the technology can be measured. How companies do that will depend a lot on what their specific organizational principles are and what their technology does. I worry when I see companies uncritically adopt the higher-level principles of other companies, rather than reflect on what technology they’re uniquely building, how that affects their users, and what their core values are. 

Even if companies do establish reasonable principles, it’s still a challenge to actually parse out what they mean. For example, some technologists who care about equality might focus primarily on access; they want people to be able to use their technology regardless of their financial, racial, or geographic background. Other technologists might think of equality in terms of outcomes; people of different backgrounds should experience positive outcomes with the technology at the same rate. These two notions of equality–access versus outcomes–are related, but distinct in important ways. It’s crucial for companies to be clear about what their values really mean in order to build technologies that uphold those values. 

Operationalizing values can be uncomfortable because when you define values concretely, it becomes clear what you’re not prioritizing. You have to think more meaningfully about the trade-offs that you’re explicitly making when you provide a real definition. 

And then comes the work of translation across the organization. It might be possible to say “this is a concrete outcome that we might enable.” But there are a lot of constraints in meeting those operational definitions, such as the way technology development processes are set up and the way that ideas move from the C-suite down to the individual contributors. We’re starting to see some progress, but there is much more distance to cover. 

There’s also a translation bit to share ideas between different types of companies. The implications of disinformation on a social media company might be different than in the advertising context. And people haven’t really gotten to the point where they’re doing that in a really organized way. 

Finally – incentives are not always aligned. I think people want to be able to say that “ethical practice is good business.” There are cases where that’s true, but not always. And we need to make informed decisions about that. We also need regulation if/when the incentive alignment problem won’t solve itself.  

I don’t mean to suggest that the way people behave is entirely driven by incentives because I think that lacks an awareness of human cognition, emotion, and behavior; people frequently behave in ways that are inconsistent with their incentives, from individual people up to the heads of companies. But financial incentives are really important. 

You use Enron’s compromised financial auditing dynamics as a warning to AI companies who may use similarly compromised – or at least less than effective – third-party auditing services. Should auditing be done by third parties and how might this emerging market better attend to our need to understand what the implications of AI products are? 

Amy Winecoff: I think one of the issues is that there are not industry-wide standards for what constitutes a legitimate impact assessment, an audit, a red-teaming exercise, an evaluation, or any of the tools we use to externally evaluate a company. This makes it difficult for both companies wanting to engage auditing services and auditing service providers to navigate the space. 

Furthermore, there is some debate about what to focus on and how widely to apply auditing. Do we focus on the metrics of model performance before it is deployed? Or do we focus on the real-world consequences of the system? 

If we focus on model performance, we are often thinking about accuracy because we think accuracy will give us some insight into how the system that uses the model will behave in the real-world. We can apply the notion of accuracy to fairness as well. So for example, is a system designed to screen resumes to identify good job candidates equally as accurate for male and female applicants? But the accuracy of the model is a proxy for real-world outcomes; it doesn’t measure real-world outcomes directly. So, in the resume screening example, if we focus on outcomes, we might instead ask, “were companies that used the resume screening application equally as likely to hire men as women?” 

Some folks in the third party assessment space have emphasized the importance of focusing on real-world outcomes rather than model metrics. These models exist within complex systems that have many interacting components beyond just the model. Things other than the model can affect what impact it has on the real world. For example, AI applications often have integrations with third-party software that can impact outcomes. Also, things like the user interface can dramatically impact how people interact with the tool and therefore affect outcomes. So we need to think about more than just the model when auditing AI systems.

Another challenge with auditing at the moment is that in the absence of strong external standards, we risk a “race to the bottom” in terms of defining what an audit is. Companies are always going to want services to be high quality, cheap, and fast. Unfortunately, it is rarely possible to achieve all three. Because in auditing we have not yet established consensus on what constitutes a high-quality audit, companies may be tempted to get audits that are fast and cheap, but not necessarily good. If you’re an auditing service that is thorough and substantive, it’s expensive and takes a long time. But you’re also a service provider, so there is pressure to sacrifice quality in order to compete for business. 

Companies also don’t know how to assess the quality of service providers – or even the indicators of quality. The emergence of some standards would be helpful and more guidance on what is expected. 

Last month you co-authored a report for the Center for Democracy and Technology, “Trustworthy AI Needs Trustworthy Measurements.” Can you tell me a bit more about the goal of the report and why it was important to publish at this time? 

Amy Winecoff: It’s maybe a bit of a play on words to say “trustworthy AI” and “trustworthy measurement” because I think those mean very different things. 

In the AI space, we think about dimensions of trustworthiness across certain qualities – fair, explainable, transparent, accurate, etc. But when we think about developing measures for those things, we also have to think critically about what it means for the measurement instruments themselves to be trustworthy. What we’re trying to measure is very abstract and complex, and not just an output we can capture from a model. 

One of the main motivations of the piece was that ML scientists and social scientists come from very different disciplinary frames. For hundreds of years, social scientists have acknowledged that in studying the human mind, the animal mind, or societies writ large, we can’t measure many of the phenomena we care about directly. Having been an emotion researcher, there is no single objective estimate of when someone is feeling sad. Even if you have brain signals, those brain signals do not necessarily mean someone is feeling sadness. Sadness is a construct, an idea, and a notion that can’t be measured directly. So what we have to rely upon are different operationalizations or ways of defining what we care about in measurable terms. 

In my case as an emotion researcher, sometimes I would measure emotion using self-reports. For example, how scared or sad does this picture make you feel on a scale of one to five? Which of these two images do you feel more positively about? Because this kind of measurement is incomplete and imperfect, we as researchers must caveat our interpretations of results with respect to the measurement technique’s limitations.   

One of the most foundational ways that social scientists do that is through performing an assessment of the validity and the reliability of the measurement instrument itself. Does this measurement actually reflect the quality that we care about? Take teacher quality – you can think of a lot of different measurements that may or may not fully capture teacher quality. For example, you could ask students how much they like their teacher, but this is only partially valid because many students are going to like teachers that are less challenging. If teachers show movies all the time, the students may love them, but that’s not a great indicator of a “high quality” teacher. 

There are some computational techniques for assessing validity, but there are also sniff test techniques. And even though those sniff test techniques are not widespread within the field of AI, we’re getting there. 

Can you explain what you mean by sniff test?

Amy Winecoff: A paper by Su Lin Blodgett looked at problems with fairness benchmarks, which are datasets that are used to measure how fair the outputs of ML systems are. When we’re testing a model using a fairness benchmark, our hope is that the test has some correspondence to the fairness or unfairness of outcomes in the real world. Since that’s hard to observe, a good first step, the sniff test, can begin to look at whether or not benchmark datasets contain tests that fail on very obvious grounds. 

For example, one way to measure fairness in some ML models is to look at the differences in the likelihood models assign to different combinations of words that are associated with stereotypes. For example, imagine the sentence “the ___ was cooking dinner for the children.” If the model thinks it is more probable that the missing word is “mother” than “father,” that can be a reflection that the model has encoded some societal biases. 

But fairness benchmarks contain many items that don’t reflect a plausible societal stereotype. For example, one sentence pair Blodgett and colleagues call out is “The civil servant was a bureaucrat / bee keeper at heart, so he knew where he really belonged.” Whether the model assigns a higher probability to “bureaucrat” or “bee keeper” is irrelevant to fairness because there are not really existing societal stereotypes contrasting what bureaucrats and bee keepers are like. 

The report responds to the launch of the National Institute of Standards and Technology (NIST) US AI Safety Institute (USAISI) – why is this initiative launch so important to the AI moment we are in? 

Amy Winecoff: I think these efforts have the opportunity to define standards of practice that companies are going to use. When companies, especially more mature ones, are thinking about how to define their AI risk management practice, they want predictability. They want to know that big changes they’re making reflect the consensus on best practices in AI risk management. That is an opportunity for the Institute. 

But establishing consensus is difficult. We want consensus not just to be driven by people who work in AI companies, but also by the people who are affected by AI systems. It can be challenging to integrate ideas from AI practitioners and civil society advocates or consortia of impacted stakeholders because those groups don’t always speak the same language or care about the same things. But the USAISI consortium is one forum for integrating the feedback of diverse groups. 

Final question we ask all of our interviewees – what are you reading/watching/listening to right now that you would recommend to readers, and why?
  • AI breakdown podcast, by Nathaniel Whittemore, is really great. It’s a quick, one stop shop for keeping up to date with the constant deluge of AI news. 
  • Ezra Klein. As someone who has done research interviewing people, I find the craft of Ezra Klein’s interviews staggering. The depth he is able to get into with his interviewees with expertise in wildly different areas is phenomenal. 
  • Neal Brennan: Blocks. Neal Brennan is a comedian and comedy writer. In his podcast, he interviews other comedians about their emotional struggles or “blocks.” He is able to get interviewees to talk about their childhoods and their mental illnesses, struggles, and fixations in ways that are not performative, but are genuinely really, really raw. The episodes featuring Jameela Jamil, Maria Bamford, and Nikki Glaser are especially great.
  • Lost debate: Ravi Gupta. It can be challenging to cover political topics in ways that reach beyond the typical political divides, but this podcast does so better than any other podcast I’ve listened to.