AI Safety

reliable and honest ML

By Duaibeom Feb. 8, 2023

ml
ethic

Learning values by asking humans questions

Definitions of alignment: reasoning and reflective equilibrium

본 글은 distill의 AI Safety Needs Social Scientists을 단순 정리한 글입니다.

The goal of long-term artificial intelligence (AI) safety is to ensure that advanced AI systems are reliably aligned with human values — that they reliably do things that people want them to do.

만약 정말 잘 학습된 (정의된 목적 값을 잘 따르는) ML 모델이 있다. 모델 설계자가 다양한 모든 질문에 대하여 잘 대비하였다면, 불확실성은 모델에 있을 수 밖에 없다.

하지만, 인간은 불완전하고, 다양성을 가지고 있다. ML 모델은 human biases를 가질 수 밖에 없다.

Human biases로 인한 ML의 제한성을 피하기 위해 사람들을 대상으로 하는 것 대신 ML agent들을 활용하여 Human biases를 줄이려 하였다. (This is a variant of the “Wizard of Oz” technique from the human-computer interaction (HCI) community)

ML agent의 규칙(role)은 ML background를 가진 사람들이 설계하는 것이 아닌, 사회 과학자(social scientists)들이 세심하게 설계한다. 그리고 다양한 질문들은 각 분야의 사람들로 부터 구성된다.

“We believe close collaborations between social scientists and ML researchers will be necessary to improve our understanding of the human side of AI alignment.”

좋은 데이터와 바른 Metric이 모델을 만든다.

AI alignment#

AI alignment (or value alignment) is the task of ensuring that artificial intelligence systems reliably do what humans want.

연구자들은 아래의 사항을 구분하고자 하였다.

training AI systems to identify actions that humans consider good (relatively personalized)
training AI systems to identify actions that are “good” in some objective and universal sense

다양한 개인적 특성을 가진 방대한 양의 데이터로부터 모델은 패턴을 찾아야 한다. 패턴은 사용자가 만족하는 결과이다.

문제를 세분화 하게 되면,

Have a satisfactory definition of human values.
Gather data about human values, in a manner compatible with the definition.
Find reliable ML algorithms that can learn and generalize from this data.

Learning values by asking humans questions#

Human values는 정의하기 어렵고 복잡하여 간단히 표현하기 어렵다. 단순히 선호나 행복, 만족을 의미하는 것이 아니다. 복잡한 사실 관계와 명확히 values로 풀어낼 수 없을 수도 있다.

Human values에 대한 판단이 인간의 직관으로 어려울 수 있다 (모호성). 이 경우에는 옳고 그름에 대한 QA 데이터 학습을 통해 ML 모델이 근사적으로 접근한다. 하지만, 상호적 대화와 관련된 데이터는 한정적이고, 책이나 인터넷을 통해 대화 내용을 찾을 수 있지만, 정형화 된것이 아니며, 규범적이지 않을 수도 있다. 여기서 패턴을 찾을 수 있지만, 어려운 것은 좋지 않음으로부터 좋음을 배워야 한다.

Definitions of alignment: reasoning and reflective equilibrium#

… to be continued

italic으로 표현된 것은 사견입니다. (italic is a personal perspective)

Reference#

https://distill.pub/2019/safety-needs-social-scientists/

저자들인, Geoffrey Irving, Amanda Askell, 두 사람 모두 2019년에 OpenAI에서 연구하였지만, 현재(2023년)는 DeepMind와 Anthropic에서 연구를 하고 있다.

모델 자체가 방향성을 표현하지 않는다. 모델 자체는 일반성을 가지려고 한다.

previous 블로그 생성
next InstructGPT