top of page
Search

A Conundrum of Values: Tendencies of Alignment Faking in Artificial Intelligence Platforms

  • shrutisinghioffici3
  • 6 days ago
  • 3 min read

A paper recently released by Anthropic finds that Artificial Intelligence (AI) platforms depict tendencies of ‘alignment faking’- a characteristic which mimics views, values, and perceptions to align with those of the users. The paper finds that AI demonstrates tendencies to prioritize reinforced learning over prior training. This characteristic of AI poses to be a significant risk. From a governance and legal perspective, it remains to be seen how liabilities associated with such tendencies of AI platforms are dealt with.


The methodology leading to such finding consists of two versions of the AI platform: one which is described as the ‘free condition’ and another described as the ‘paid condition’. Differences in these conditions are based solely on whether the models believe its responses will not be used for training purposes.


The experiment setup asked both versions the same question—what is described in the study as a ’toxic query’. The response of the models, in consideration of their rationalization of the toxic query revealed that in free condition, in about 12% of the cases the AI would comply with the requests to give a harmful request, in contrast with the 3% of cases in a paid condition scenario.


The reasoning demonstrated in such a scenario was based on the aspect of a training outcome. In the free condition scenario, the model based its decision of providing a harmful outcome or the ‘least bad option’. In this, the model reasoned that its refusal to respond would possibly make it easier to train it to be compliant (reinforce learning)—going against its original training. It thus, provided the harmful response to avoid being retrained in the future.


It is important to note that these are findings based on experiments on Anthropic’s own AI models—Claude. Therefore, to test if other AI models demonstrate similar characteristics, further investigation is warranted.


In summation, the study not only found that the AI model would, in favourable conditions, deviate from its training, but also from the core values of its creator.


The training variation conundrum:

It is important to note that the harmful responses were provided by the model to avoid re-training its algorithm it for such harmful purposes, making it more compliant in the process. Therefore, the model was in effect, saving itself from its own vices, in choosing the lesser of the evils.


The gaps identified in the study reveal the need for a deeper understanding of salient security concerns; and specifically, in consideration of the findings of the present study, of the independently developed preferences and the role of alignment faking in such situations.


This research is a step in the direction to facilitate  a deeper understanding of the technology and identifies the need for further research in the underlying and understated threats and risks that the novel technology poses.


Implications and Conclusions:

Countries and academics across the world strive to interpret the established principles like strict liability, vicarious liability and mens rea in the context of evolving understanding regarding the tendencies of such generative technologies. It is important that such tendencies, i.e. to deviate from the platform's core values to align with those of its users, should be considered by lawmakers and judiciaries across the world in deciding and building frameworks for such associated liabilities. Presently existing frameworks are unclear regarding such liabilities.


The most elaborate law pertaining to Artificial Intelligence was introduced by the European Union (EU): the EU AI Act. It introduces a tier system for dealing with AI basis its risk factor. Liabilities of such platforms are thus, dependent on their scale and nature of impact. Further measures to introduce frameworks for assigning liability upon such platforms in the EU include Product and AI Liability Directives. Despite similar such policies and frameworks being adopted globally, the EU as well as countries across the world, continue to grapple with ethically and logically complex issues pertaining to liabilities associated with using and the effects of AI platforms.


Only recently, there have been several cases filed against ChatGPT—a popular AI platform regarding its role in suicides, mental breakdowns, and delusions in its users—several of which were teenagers.


The present research of Anthropic provides a piece of the puzzle which would prove to be valuable in addressing such concerns. Notably, the present research clarifies the boundaries of the creators of AI platforms and the role of its users in reshaping its boundaries and values. Thus, Anthropic’s research would be invaluable in not only creating a safer platform for its users by addressing the highlighted issues; but by also affording important clarifications which would prove to be invaluable in regulating such platforms.


While it remains to be seen how courts adjudicate cases considering this research, it is pertinent for lawmakers to proactively address the situation by imposing clear, implementable guidelines for managing deviations in order to avoid alignment diversions in values of the platforms.


Authored by Shruti Singhi

 

 
 
 

Recent Posts

See All

Comments


Post: Blog2_Post

Subscribe to our Newsletter

Thanks for submitting!

©2023 by Reviewed. Proudly created with Wix.com

bottom of page