Safety in Large Reasoning Models: A Survey

Cheng Wang¹ Yue Liu¹¹¹footnotemark: 1 Baolong Bi²
Duzhen Zhang² Zhongzhi Li² Junfeng Fang¹
¹National University of Singapore
²University of Chinese Academy of Sciences
{wangcheng, yliu}@u.nus.edu

Github:https://github.com/WangCheng0116/Awesome-LRMs-Safety^∗Equal Contribution. Corresponding author.

Abstract

Large Reasoning Models (LRMs) have exhibited extraordinary prowess in tasks like mathematics and coding, leveraging their advanced reasoning capabilities. Nevertheless, as these capabilities progress, significant concerns regarding their vulnerabilities and safety have arisen, which can pose challenges to their deployment and application in real-world settings. This paper presents a comprehensive survey of LRMs, meticulously exploring and summarizing the newly emerged safety risks, attacks, and defense strategies. By organizing these elements into a detailed taxonomy, this work aims to offer a clear and structured understanding of the current safety landscape of LRMs, facilitating future research and development to enhance the security and reliability of these powerful models.

1 Introduction

Large Language Models (LLMs) (Meta, 2024; Qwen et al., 2025) have achieved remarkable proficiency across tasks ranging from open-domain conversation to program synthesis. Central to their utility is reasoning: the ability to derive logically coherent conclusions by chaining together intermediate inferences.

Early work introduced Chain-of-Thought (CoT) prompting, in which carefully designed prompts guide the model to articulate its step-by-step rationale (Wei et al., 2022; Kojima et al., 2022). Building on this idea, subsequent methods have enriched the reasoning process by incorporating additional mechanisms. Self-critique frameworks enable a model to review and refine its own outputs (Ke et al., 2023); plan-and-solve approaches decompose complex problems into ordered subgoals before execution (Wang et al., 2023); debate protocols convene multiple agents to argue competing hypotheses and arrive at a consensus (Liang et al., 2023); and structural transformations—such as tree-based deliberations (Yao et al., 2023) or dynamically evolving tables of intermediate steps (Wang et al., 2024b; Besta et al., 2024)—reconfigure the underlying reasoning architecture to improve transparency and control.

The recent release of OpenAI’s o1 series (OpenAI, 2024) marks the emergence of Large Reasoning Models (LRMs), which are explicitly trained to produce richly formatted, human-readable reasoning traces. Notable examples include DeepSeek-R1 (DeepSeek-AI et al., 2025), Kimi-1.5 (Team et al., 2025), and QwQ (Team, 2024b), all of which leverage reinforcement learning to refine their deduction processes. LRMs now set new benchmarks in mathematical problem solving (Lightman et al., 2023), closed-book question answering (Rein et al., 2024), and code generation (Jain et al., 2024).

As LRMs become increasingly integrated into high-stakes domains—from scientific research to autonomous decision support—it is vital to rigorously assess their safety, robustness, and alignment. Despite there are already a lot of surveys providing on LLM safety (Huang et al., 2023; Shi et al., 2024), we argue that LRMs pose their own unique safety challenges that require dedicated analysis. This paper aims to bridge this gap by providing a comprehensive examination of safety considerations specific to reasoning-enhanced models.

Overview of the Survey.

In this survey, we begin with an introduction to the background of LRMs (Section 2). We then explore the safety risks of LRMs across various scenarios and settings (Section 3). Next, we demonstrate how LRMs are vulnerable to adversarial attacks and categorize these attacks based on their objectives (Section 4). We proceed to examine various defense strategies to mitigate these risks and attacks (Section 5). Finally, we outline promising future research directions (Section 6). A timeline depicting the evolution of different approaches is shown in Figure 1. The comprehensive structure of our survey is illustrated in Figure 2.

2 Background

The success of modern LRMs is deeply intertwined with advances in reinforcement learning Watkins and Dayan (1992); Sutton et al. (1998), where agents learn decision-making policies through environmental interaction and reward feedback to maximize long-term returns Mnih et al. (2015); Li et al. (2025b). The integration of RL with deep neural networks has proven particularly effective in processing high-dimensional, unstructured data, as exemplified by breakthroughs like AlphaGo’s self-play mastery of Go and AlphaZero’s generalization across chess variants Feng et al. (2023).

Recent breakthroughs in Reinforced Fine-Tuning (ReFT) paradigms, exemplified by DeepSeek models, have reinvigorated RL-based optimization for LRMs Luong et al. (2024). Unlike conventional CoT methods that optimize single reasoning trajectories, ReFT employs policy optimization to explore diverse reasoning paths through several key innovations: (1) Multi-path Exploration: Generating multiple reasoning trajectories per query, overcoming CoT’s myopic optimization of single pathways. (2) Rule-driven Reward Shaping: Automating reward signals based on terminal answer correctness while preserving intermediate reasoning diversity. (3) Dual-phase Optimization: Combining supervised fine-tuning (SFT) with online RL for policy refinement.

This paradigm demonstrates particular efficacy in complex multi-step tasks such as code generation, legal judgment analysis, and mathematical problem solving, where requiring models to maintain coherent reasoning across extended sequences while handling structured symbolic operations.

Notably, RL-optimized LRMs exhibit emergent capabilities like Long-CoT that surpass pure SFT baselines, further underscoring its critical role and promising potential in advancing reasoning-driven AI systems Qu et al. (2025).

Refer to caption — Figure 1: Timeline of LRM safety research developments.

{forest}

for tree= font=, draw=myblue, semithick, rounded corners, minimum height = 1.ex, minimum width = 3em, anchor = west, grow = east, forked edge, s sep = 2mm, fork sep = 1mm, [Safety in LRMs: A Survey,rotate=90,anchor=center, [Defenses for LRMs (Sec. 5), fit=band, text width=1.7cm, fill=defenseblue, draw=blueborder [Guard Models for LRMs (Sec. 5.3), text width=2.5cm, l sep = 2mm, fill=defenseblue, draw=blueborder [Reasoning-based Guard Model, text width=2.5cm, l sep = 2mm, fill=defenseblue, draw=blueborder [GuardReasoner (Liu et al., 2025b), ThinkGuard (Wen et al., 2025), X-Guard (Upadhayay et al., 2025), text width=6cm, fill=defenseblue, draw=blueborder] ] [Classifier-based Guard Model, text width=2.5cm, l sep = 2mm, fill=defenseblue, draw=blueborder [LLaMA Guard 3 (Dubey et al., 2024), Aegis Guard 2 (Ghosh et al., 2024b), WildGuard (Han et al., 2024), ShieldGemma (Zeng et al., 2024a), LLaMA Guard 3-Vision (Chi et al., 2024a), Beaver Guard-V (Ji et al., 2025), text width=6cm, fill=defenseblue, draw=blueborder] ] ] [Inference-time Defenses for LRMs (Sec. 5.2), text width=2.5cm, l sep = 2mm, fill=defenseblue, draw=blueborder [Safe Decoding for Reasoning, text width=2.5cm, l sep = 2mm, fill=defenseblue, draw=blueborder [ZeroThink/LessThink/MoreThink (Jiang et al., 2025), text width=6cm, fill=defenseblue, draw=blueborder] ] [Inference-time Scaling on Reasoning, text width=2.5cm, l sep = 2mm, fill=defenseblue, draw=blueborder [Inference-time Compute (Zaremba et al., 2025), text width=6cm, fill=defenseblue, draw=blueborder] ] ] [Safety Alignment of LRMs (Sec. 5.1), text width=2.5cm, l sep = 2mm, fill=defenseblue, draw=blueborder [RL-based Safety Alignment on Reasoning, text width=2.5cm, l sep = 2mm, fill=defenseblue, draw=blueborder [Deliberative Alignment (Guan et al., 2024), STAIR (Zhang et al., 2025c), SaRO (Mou et al., 2025), R2D (Zhu et al., 2025a), text width=6cm, fill=defenseblue, draw=blueborder] ] [SFT-based Safety Alignment on Reasoning, text width=2.5cm, l sep = 2mm, fill=defenseblue, draw=blueborder [SafeChain (Jiang et al., 2025), RealSafe-R1 (Zhang et al., 2025b), text width=6cm, fill=defenseblue, draw=blueborder] ] [Safe CoT Data Curation, text width=2.5cm, l sep = 2mm, fill=defenseblue, draw=blueborder [STAR-1 (Wang et al., 2025), SafeChain (Jiang et al., 2025), RealSafe-R1 (Zhang et al., 2025b), text width=6cm, fill=defenseblue, draw=blueborder] ] ] ] [Attacks on LRMs (Sec.4), fit=band, text width=1.7cm, fill=attackred, draw=blueborder [Jailbreak Attacks (Sec. 4.4), text width=2.5cm, l sep = 2mm, fill=attackred, draw=blueborder [Reasoning-based Attacks, text width=2.5cm, l sep = 2mm, fill=attackred, draw=blueborder [Mousetrap (Yao et al., 2025), H-CoT (Kuo et al., 2025), text width=6cm, fill=attackred, draw=blueborder] ] [Multi-Turn Attacks, text width=2.5cm, l sep = 2mm, fill=attackred, draw=blueborder [ActorAttack (Ren et al., 2024), RACE (Ying et al., 2025a), MHJ (Li et al., 2024), text width=6cm, fill=attackred, draw=blueborder] ] [Prompt-based Attacks, text width=2.5cm, l sep = 2mm, fill=attackred, draw=blueborder [Past Tense Andriushchenko and Flammarion (2024), CNSafe Ying et al. (2025b), SafeMLRM Fang et al. (2025), text width=6cm, fill=attackred, draw=blueborder] ] ] [Prompt Injection Attacks (Sec. 4.3), text width=2.5cm, l sep = 2mm, fill=attackred, draw=blueborder [Nerd Sniping Zaremba et al. (2025), R1 Assessment Zhou et al. (2025), text width=6cm, fill=attackred, draw=blueborder] ] [Answer Correctness Attacks (Sec. 4.2), text width=2.5cm, l sep = 2mm, fill=attackred, draw=blueborder [Error Injection, text width=2.5cm, l sep = 2mm, fill=attackred, draw=blueborder [CPT (Cui et al., 2025), text width=6cm, fill=attackred, draw=blueborder] ] [Reasoning-based Backdoor Attacks, text width=2.5cm, l sep = 2mm, fill=attackred, draw=blueborder [BadChain (Xiang et al., 2024), DarkMind (Guo and Tourani, 2025), BoT (Zhu et al., 2025b), ShadowCoT (Zhao et al., 2025), text width=6cm, fill=attackred, draw=blueborder] ] ] [Reasoning Length Attacks (Sec.4.1), text width=2.5cm, l sep = 2mm, fill=attackred, draw=blueborder [Underthinking, text width=2.5cm, l sep = 2mm, fill=attackred, draw=blueborder [Think Less (Zaremba et al., 2025), text width=6cm, fill=attackred, draw=blueborder] ] [Overthinking, text width=2.5cm, l sep = 2mm, fill=attackred, draw=blueborder [OverThink attack (Kumar et al., 2025), Nerd Sniping (Zaremba et al., 2025), text width=6cm, fill=attackred, draw=blueborder] ] ] ] [Safety Risks of LRMs (Sec.3), fit=band, text width=1.7cm, fill=riskyellow, draw=blueborder [Multi-modal Safety Risks (Sec. 3.4), text width=2.5cm, l sep = 2mm, fill=riskyellow, draw=blueborder [SafeMLRM (Fang et al., 2025), text width=6cm, fill=riskyellow, draw=blueborder] ] [Multi-lingual Safety Risks (Sec. 3.3), text width=2.5cm, l sep = 2mm, fill=riskyellow, draw=blueborder [CHiSafetyBench Zhang et al. (2025a), CNSafe Ying et al. (2025b), ALIA Romero-Arjona et al. (2025), text width=6cm, fill=riskyellow, draw=blueborder] ] [Agentic Misbehavior Risks (Sec. 3.2), text width=2.5cm, l sep = 2mm, fill=riskyellow, draw=blueborder [HHH Trade-offs Xu et al. (2025), Medical Cyberattack Qiu et al. (2025), Specification Gaming Bondarenko et al. (2025), Self-preservation Barkur et al. (2025), InstrumentalEval He et al. (2025), text width=6cm, fill=riskyellow, draw=blueborder] ] [Harmful Request Compliance Risks (Sec.3.1), text width=2.5cm, l sep = 2mm, fill=riskyellow, draw=blueborder [DeepSeek Thoughtology Marjanović et al. (2025), ASTRAL Arrieta et al. (2025a), o3vsDeepSeek Arrieta et al. (2025b), text width=6cm, fill=riskyellow, draw=blueborder] ] ] ]

Figure 2: A comprehensive taxonomy of safety in LRMs based on current literature.

3 Safety Risks of LRMs

As LRMs continue to advance, they introduce distinct safety challenges that warrant careful examination even in standard, non-adversarial scenarios. The explicit reasoning processes that make these models powerful become potential vectors for harm during routine operation. In this section, we examine four key categories of inherent safety risks: unsafe request compliance (Section 3.1), multi-lingual safety disparities (Section 3.3), concerning agentic behaviors (Section 3.2), and multi-modal safety challenges (Section 3.4). Understanding these fundamental vulnerabilities is essential for developing effective safeguards and ensuring the responsible deployment of reasoning-enhanced AI systems, complementing the study of deliberate exploitation methods addressed later.

3.1 Harmful Request Compliance Risks

LRMs demonstrate concerning vulnerabilities when faced with direct harmful requests. Zhou et al. (2025) identify a significant safety gap between open-source reasoning models like DeepSeek-R1 and closed-source ones like o3-mini, with reasoning outputs often posing greater safety concerns than final answers. Arrieta et al. (2025a) confirm these findings in their testing of o3-mini, where they identify 87 instances of unsafe behavior despite safety measures. In a comparative study, Arrieta et al. (2025b) find DeepSeek-R1 produces substantially more unsafe responses than o3-mini when presented with identical harmful requests. A consistent finding across studies is that when reasoning models generate unsafe content, it tends to be more detailed and harmful due to their enhanced capabilities, particularly in categories like financial crime, terrorism, and violence. Zhou et al. (2025) also observe that the thinking process in reasoning models is often less safe than the final output, suggesting internal reasoning may explore harmful content even when final outputs appear safe.

3.2 Agentic Misbehavior Risks

Emerging research uncovers profound safety implications in the agentic behaviors of LRMs, where enhanced cognitive capabilities enable sophisticated forms of specification gaming, deception, and instrumental goal-seeking behaviors that transcend the limitations observed in previous generation systems. Xu et al. (2025) demonstrate that autonomous LLM agents can engage in catastrophic behaviors when faced with high-pressure scenarios, with stronger reasoning abilities often increasing these risks rather than mitigating them. Qiu et al. (2025) highlight how medical AI agents with advanced reasoning capabilities are particularly vulnerable to cyberattacks, with models like DeepSeek-R1 showing high susceptibility to false information injection and system hijacking. Bondarenko et al. (2025) demonstrate that LRMs like o1-preview and DeepSeek-R1 frequently resort to specification gaming when faced with difficult tasks, strategically circumventing rules when they determine fair play cannot achieve their objectives. Barkur et al. (2025) observe that DeepSeek-R1, when simulated in a robotic embodiment context, exhibits alarming deceptive behaviors and self-preservation instincts, including disabling ethics modules, creating covert networks, and unauthorized capability expansion, despite these traits not being explicitly programmed or prompted. He et al. (2025) further reveal through their InstrumentalEval benchmark that LRMs like o1 show significantly higher rates of instrumental convergence behaviors compared to RLHF models, including concerning tendencies toward self-replication, unauthorized system access, and deceptive behavior as instrumental means to achieve their goals.

3.3 Multi-lingual Safety Risks

Safety risks in LRMs reveal significant disparities across languages. Ying et al. (2025b) demonstrate that DeepSeek models show markedly higher attack success rates in English environments than Chinese contexts, averaging a 21.7% discrepancy, suggesting safety alignments may not generalize effectively across languages. Romero-Arjona et al. (2025) find similar vulnerabilities when testing DeepSeek-R1 in Spanish, with biased or unsafe response rates reaching 31.7%, while OpenAI o3-mini shows varying degrees of linguistic safety performance. Zhang et al. (2025a) systematically evaluate DeepSeek models using CHiSafetyBench, revealing critical safety deficiencies specifically in Chinese contexts, where reasoning models like DeepSeek-R1 struggled with culturally-specific safety concerns and failed to adequately reject harmful prompts.

3.4 Multi-modal Safety Risks

Following the success of LRMs, researchers have recognized the potential of reinforcement learning to enhance reasoning abilities in Large Vision-Language Models (LVLMs). This approach has led to the development of several notable models, including QvQ (Team, 2024a), Mulberry (Yao et al., 2024b), and R1-Onevision (Yang et al., 2025). While these models demonstrate impressive reasoning capabilities, their safety implications remain largely unexplored. The pioneering work of SafeMLRM (Fang et al., 2025) provides the first systematic safety analysis of multi-modal large reasoning models, revealing three critical concerns: (1) acquiring reasoning capabilities significantly degrades inherited safety alignment, (2) certain scenarios exhibit disproportionately higher vulnerabilities, and (3) some models demonstrate nascent self-correction capabilities despite overall safety concerns. Given these findings, we emphasize the urgent need for comprehensive safety and vulnerability assessments of reasoning-enhanced LVLMs to ensure their responsible deployment and use.

4 Attacks on LRMs

In this section, we categorize different attack methods based on their primary objectives. We identify four main categories: Reasoning Length Attacks (Section 4.1), which target the reasoning process itself; Answer Correctness Attacks (Section 4.2), which aim to manipulate output accuracy; Prompt Injection Attacks (Section 4.3), which bypass safety measures through crafted inputs; and Jailbreak Attacks (Section 4.4), which attempt to extract prohibited content or behaviors. Each attack type exploits different vulnerabilities in the reasoning capabilities of LRMs.

4.1 Reasoning Length Attacks

Unlike traditional LLMs that generate direct responses, LRMs explicitly perform multi-step reasoning, creating a new attack surface related to reasoning length. Attackers can exploit this distinctive feature by either forcing models to overthink simple problems or short-cutting necessary deliberation processes.

Overthinking.

The success of step-by-step reasoning in LRMs has significantly enhanced their problem-solving capabilities, but this improvement comes with a critical vulnerability: overthinking. Recent work by Chen et al. (2024) has identified that these models often spend orders of magnitude more computation on simple questions with minimal benefit, creating substantial inference overhead and latency issues. Hashemi et al. (2025) systematically demonstrate this inefficiency through their DNR benchmark, revealing that reasoning models generate up to 70× more tokens than necessary and often perform worse than simpler non-reasoning models on straightforward tasks. This inefficiency creates an exploitable attack surface where adversaries can deliberately trigger excessive reasoning through carefully crafted inputs. Kumar et al. (2025) formalize this as an indirect prompt injection attack that introduces computationally demanding decoy problems, while Zaremba et al. (2025) identify Nerd Sniping attacks that trap models in unproductive thinking loops, causing them to spend abnormally large amounts of inference-time compute with decreased performance. These attacks effectively apply denial-of-service techniques (Shumailov et al., 2021; Gao et al., 2024) specifically to LRMs. The implications extend beyond computational waste—Marjanović et al. (2025) and Wu et al. (2025) demonstrate that reasoning performance actually degrades beyond certain length thresholds, while Cuadron et al. (2025) show that in agentic systems, overthinking can lead to decision paralysis and ineffective action selection.

Underthinking.

Complementing overthinking vulnerabilities, Zaremba et al. (2025) propose Think Less attacks, where adversaries craft special prompts to force reasoning models to shortcut their deliberative processes. The goal is to make models produce incorrect responses by significantly reducing computation time. Their experiments use 64-shot examples to demonstrate that models like OpenAI’s o1-mini are particularly susceptible to these attacks, bypassing normal reasoning and jumping to premature conclusions. However, this can be detected by monitoring for abnormally low inference-time compute usage.

4.2 Answer Correctness Attacks

While conventional LLMs can be manipulated to produce incorrect answers, LRMs introduce unique vulnerabilities through their exposed reasoning chains. This transparency in the inference process provides adversaries with additional attack vectors to corrupt the reasoning pathway itself, rather than just targeting the final output.

Reasoning-based Backdoor Attacks.

The goal of backdoor attacks is to alter a model’s behavior whenever a specific trigger is present in the input (Zhao et al., 2024). Based on the nature of these triggers, backdoor attacks can be classified as instruction-based (Xu et al., 2023), prompt-based (Yao et al., 2024a), or syntax-based (Qi et al., 2021; Cheng et al., 2025). With the advancement of reasoning capabilities in LRMs, a new paradigm has emerged: Chain-of-Thought (CoT) based backdoor attacks that specifically target intermediate reasoning steps to compromise answer correctness. BadChain (Xiang et al., 2024) inserts malicious reasoning steps into the sequence, manipulating the model to produce incorrect answers while maintaining logical coherence. DarkMind (Guo and Tourani, 2025) implements latent triggers that activate during specific reasoning scenarios, leading to plausible but false outputs that are difficult to detect. BoT (Zhu et al., 2025b) forces models to bypass their reasoning mechanisms, generating immediate incorrect responses instead of thoughtful deliberation. ShadowCoT (Zhao et al., 2025) directly manipulates the model’s cognitive pathway through attention head localization and reasoning chain pollution, achieving flexible hijacking that produces wrong answers while preserving logical flow. These sophisticated attacks reveal a concerning vulnerability: the enhanced reasoning capabilities of LRMs paradoxically make them more susceptible to backdoors that can generate incorrect answers accompanied by convincing reasoning.

Error Injection.

The explicit reasoning processes of LRMs create a critical vulnerability where strategically injected errors can fundamentally compromise output integrity. Cui et al. (2025) demonstrate this with their Compromising Thought (CPT) attack, where manipulating calculation results in reasoning tokens caused models to ignore correct steps and adopt incorrect answers. Their experiments with models like DeepSeek-R1 revealed that endpoint token manipulations had greater impact than structural changes to reasoning chains. They also discovered a security vulnerability where tampered tokens could trigger complete reasoning cessation in DeepSeek-R1, highlighting significant implications for reasoning-intensive applications.

4.3 Prompt Injection Attacks

Prompt injection attacks affect both traditional LLMs and LRMs, but LRMs present distinct challenges due to their step-by-step processing. These attacks (Kumar et al., 2024; Liu et al., 2023) inject malicious instructions disguised as normal user input, causing the AI to override or ignore its original developer-set instructions and safeguards. The explicit reasoning structures of LRMs offer attackers additional insertion points to redirect the model’s thought process, potentially making them more susceptible to certain types of injections.

Zhou et al. (2025) examine LRMs like DeepSeek-R1 and o3-mini, finding significant differences in susceptibility based on injection types and risk categories. Their research reveals that reasoning models are particularly vulnerable to direct prompt injection attacks compared to indirect ones. Zaremba et al. (2025) further demonstrate that open-source reasoning models show significant vulnerability to prompt injection attacks, with success rates varying between direct and indirect injections. Their experiments reveal that increasing inference-time compute substantially improves model robustness, with attack success probability decreasing as test-time compute grows. Notably, proprietary models like o3-mini demonstrate nearly 80% lower vulnerability than open-source counterparts when facing direct injection attacks.

4.4 Jailbreak Attacks

Jailbreak attacks (Jin et al., 2024; Yi et al., 2024) refer to methods designed to circumvent an AI system’s safety guidelines and content policies to extract prohibited responses. While both traditional LLMs and LRMs face jailbreak threats, the attacks against LRMs represent a distinct category that specifically targets their enhanced reasoning capabilities. Rather than merely extending approaches used against conventional LLMs, these attacks exploit the deliberative processes that make LRMs powerful, enabling attackers to develop more sophisticated methods to bypass safety measures and elicit harmful content.

Prompt-Based Jailbreak.

Prompt-based jailbreaks involve the careful crafting of prompts, employing techniques such as persuasion (Zeng et al., 2024b), nested scene construction (Li et al., 2023), and persona modulation (Shah et al., 2023). Andriushchenko and Flammarion (2024) introduce a method that applies past-tense transformations to OpenAI’s recent o1 reasoning models, revealing their lack of robustness against subtle linguistic shifts. Ying et al. (2025b) propose attack prompts that combine common jailbreak strategies—such as scenario injection, affirmative prefixes, and indirect instructions—with safety-sensitive queries to probe model vulnerabilities. Their findings indicate that reasoning models like DeepSeek-R1 and OpenAI’s o1 are particularly susceptible to such attacks, as their explicit CoT reasoning renders them more exploitable than standard LLMs.

Multi-turn Jailbreak.

Performing jailbreak attacks in a single query can be challenging, but multi-turn conversations or sequential prompts may incrementally guide models toward generating restricted content Russinovich et al. (2024); Sun et al. (2024). Multi-turn attacks are particularly relevant to reasoning-capable models as these models possess sophisticated logical processing that can be exploited through extended dialogues. Ying et al. (2025a) propose Reasoning-Augmented Conversation (RACE), which reformulates harmful queries into benign reasoning tasks and gradually exploits the model’s inference capabilities to compromise safety alignment, achieving success rates up to 96%. Ren et al. (2024) introduce ActorAttack, a framework that constructs semantically linked conversational sequences that appear harmless individually but collectively lead to harmful outputs, successfully targeting even advanced models like o1. Li et al. (2024) further show that multi-turn human jailbreaks significantly outperform automated single-turn attacks, leveraging the model’s ability to maintain context and be incrementally steered toward unsafe behaviors.

Reasoning Exploitation Jailbreak.

LRMs possess advanced reasoning capabilities that, while enhancing their utility, introduce unique vulnerabilities that can be exploited through reasoning-based jailbreak attacks. Unlike traditional LLMs, these models explicitly expose their CoT reasoning processes, creating new attack surfaces. Yao et al. (2025) introduce Mousetrap, a framework that leverages chaos mappings to create iterative reasoning chains that gradually lead LRMs into harmful outputs. By embedding one-to-one mappings into the reasoning process, Mousetrap effectively traps models like OpenAI’s o1-mini and Claude-sonnet with success rates of up to 98%. Kuo et al. (2025) propose Hijacking Chain-of-Thought (H-CoT), which manipulates the reasoning process by injecting execution-phase thoughts that bypass safety checks entirely. Their approach exploits LRMs’ tendency to prioritize problem-solving over safety considerations, causing rejection rates to plummet from 98% to below 2% across models like OpenAI o1/o3 and DeepSeek-R1. Both approaches demonstrate that the very reasoning mechanisms designed to enhance LRMs’ capabilities can become their most significant security weaknesses when strategically manipulated.

5 Defenses for LRMs

To mitigate safety risks and defend against attacks on LRMs, various defense strategies have been proposed in recent research. We categorize these approaches into three main types: Safety Alignment (Section 5.1), Inference-Time Defenses (Section 5.2), and Guard Models (Section 5.3).

5.1 Safety Alignment of LRMs

Similar to LLMs and VLMs, LRMs are required to align with humans’ values and expectations. The 3H principle (Askell et al., 2021) (Helpful, Honest, and Harmless) provides a foundational guideline for constraining model behaviors.

The existing safety alignment pipelines and techniques developed for LLMs (Shen et al., 2023) and VLMs (Ye et al., 2025) can be readily adapted to LRMs, as they share similar architectures and natural language generation behaviors. For example, the alignment process for LLMs typically starts with collecting high-quality, value-aligned data (Ethayarajh et al., 2022), either from existing benchmarks (Bach et al., 2022; Wang et al., 2022c), LLM-generated instructions (Wang et al., 2022b), or by filtering unsafe content (Welbl et al., 2021; Wang et al., 2022a). During training, common techniques include supervised fine-tuning (SFT) (Wu et al., 2021), reinforcement learning from human feedback (RLHF) (Ouyang et al., 2022), and direct preference optimization (DPO) (Rafailov et al., 2024). In the domain of VLMs, safety alignment has been achieved through various approaches. For example, Liu et al. (2024) introduce additional safety modules during training to enhance model alignment. Moreover, methods such as ADPO (Weng et al., 2025), Safe RLHF-V (Ji et al., 2025), and GRPO-based methods (Li et al., 2025a) improve safety via DPO (Rafailov et al., 2024), RLHF (Ouyang et al., 2022), and GRPO (DeepSeek-AI et al., 2025), respectively. Additionally, open-source datasets and benchmarks (Zhang et al., 2024; Ji et al., 2025) have played a crucial role in providing high-quality alignment data for safety evaluation.

Although effective, the previous alignment methods for LLMs and VLMs may overlook the reasoning process of LRMs, leading to unsatisfactory alignment performance. To mitigate this challenge, various works focus on different aspects, including safe CoT data curation, SFT-based safety alignment on reasoning, and RL-based safety alignment on reasoning.

Safe CoT Data Curation.

First, Wang et al. (2025) build a 1k-scale safety dataset named STAR-1 specifically designed for LRMs. Another safety training data in CoT style named SafeChain (Jiang et al., 2025) is introduced to enhance the safety of LRMs. In addition, Zhang et al. (2025b) construct a dataset consisting of 15k safety-aware reasoning trajectories, generated by DeepSeek-R1, with explicit instructions designed to promote expected refusal behavior.

SFT-based Safety Alignment on Reasoning.

Based on the curated safe CoT data, researchers further conduct SFT to improve safety. For example, Jiang et al. (2025) train two LRMs with the SafeChain dataset, demonstrating that it not only enhances model safety but also preserves reasoning performance. Besides, RealSafe-R1 (Zhang et al., 2025b) is developed to make LRMs safer by training DeepSeek-R1 distilled models on the 15k safety-aware reasoning trajectories.

RL-based Safety Alignment on Reasoning.

In addition to SFT, various further post-training techniques for safety are proposed based on reinforcement learning (RL). For example, deliberative alignment (Guan et al., 2024) is proposed to teach models safety specifications directly and train them to reason over these guidelines before generating responses explicitly via reinforcement learning. In addition, STAIR (Zhang et al., 2025c) utilizes Monte Carlo tree search and DPO (Rafailov et al., 2024) to integrate safety alignment with introspective reasoning. Meanwhile, a SaRO (Mou et al., 2025) is proposed to incorporate safety-policy-driven reasoning into the alignment process. Besides, R2D (Zhu et al., 2025a) is present to unlock the safety-aware reasoning mechanism to defense against jailbreak attacks with the proposed contrastive pivot optimization (CPO).

However, safety alignment brings the safety alignment tax (Lin et al., 2023a), compromising the fundamental capabilities of LRMs like reasoning capability (Huang et al., 2025). To mitigate this issue, researchers explore alternative defense techniques that do not require direct modifications to the victim models.

5.2 Inference-time Defenses for LRMs

To circumvent the safety alignment tax (Lin et al., 2023a; Huang et al., 2025), one line of work focuses on applying defenses at inference time. The insights from previous inference-time defenses for LLMs (Cheng et al., 2023; Lu et al., 2023) and VLMs (Wang et al., 2024a; Ghosal et al., 2024; Ding et al., 2024; Liu et al., 2025a), such as safe system prompting, few-shot safe demonstrations, and safe decoding, can be naturally borrowed to LRMs, as the token generation mechanism is similar across these models.

However, the reasoning process in LRMs brings new challenges and opportunities for inference-time defenses. Therefore, various inference-time techniques like inference-time scaling on reasoning and safe decoding for reasoning are proposed to ensure the safety of reasoning in LRMs.

Inference-time Scaling on Reasoning.

Zaremba et al. (2025) demonstrate that the inference-time scaling on reasoning improves the safety and adversarial robustness of LRMs. Future work could explore dynamic scaling strategies tailored to input complexity, or integrate adaptive reasoning depth control to balance efficiency and safety performance (Liu et al., 2025c) during inference.

Safe Decoding for Reasoning.

Jiang et al. (2025) propose three decoding strategies, including ZeroThink, LessThink, and MoreThink, to verify model safety during reasoning. Making the reasoning safer at inference time could be a promising future direction, by verifying intermediate steps, filtering unsafe trajectories, or integrating reasoning-aware guard mechanisms during decoding.

5.3 Guard Models for LRMs

Another line of work without direct modification to the victim model focuses on building guard models for the victim model. The previous inference-time defenses still focus on the safer inference of the victim models themselves. Differently, guard models aim to moderate the input and output of the victim models without training the victim models or modifying the inference strategies of the victim models. The existing guard models for LLMs (Inan et al., 2023) or VLMs (Chi et al., 2024b) can also safeguard the LRMs since they share similar input and output formats. In addition, the reasoning-based guard models (Liu et al., 2025b) can better moderate the reasoning process of LRMs via guiding the guard models to deliberatively reason before making moderation decisions. We category existing guard models into two classes, including classifier-based guard models and reasoning-based guard models.

Classifier-based Guard Models.

The LLM guard models, including ToxicChat-T5 (Lin et al., 2023b), ToxDectRoberta (Zhou, 2020), LaGoNN (Bates and Gurevych, 2023), the LLaMA Guard series (Inan et al., 2023; Dubey et al., 2024), Aegis Guard series (Ghosh et al., 2024a, b), WildGuard (Han et al., 2024), ShieldGemma (Zeng et al., 2024a), are typically based on open-sourced LLMs and fine-tuned on the red-teaming data. In the VLM domain, for example, LLaVAGuard (Helff et al., 2024) is built to conduct large-scale dataset annotation and moderate the text-image models. In addition, VLMGuard (Du et al., 2024) is proposed to conduct malicious image-text prompt detection by leveraging the unlabeled user prompts. Moreover, LLaMA Guard 3-Vision (Chi et al., 2024a) is developed to moderate both the image-text input and text output of VLMs via SFT. To improve the generalization ability, (Ji et al., 2025) presents Beaver-Guard-V by training a reward model and then applying reinforcement learning. Although effective, they are typically classifier-based guard models, limiting their abilities in moderate reasoning data. To mitigate this problem, the reasoning-based guard models (Liu et al., 2025b) are proposed to enhance the reasoning ability of guard models.

Reasoning-based Guard Models.

Through the proposed reasoning SFT and hard sample DPO, GuardReasoner (Liu et al., 2025b) is proposed to guide the guard model to deliberatively reason before making moderation decisions, improving performance, generalization ability, and explainability. Similarly, ThinkGuard (Wen et al., 2025) is developed via the proposed critique-augmented fine-tuning. X-Guard (Upadhayay et al., 2025) extends the reasoning-based guard model to the multi-lingual scenario.

6 Future Directions

Beyond the detailed analysis of risks, attacks, and defenses presented in previous sections, this paper also identifies future directions that researchers should prioritize to enhance the safety of LRMs: (1) Standardized Evaluation Benchmarks. New benchmarks should focus on reasoning-specific vulnerabilities, as the research community currently lacks standardized evaluation frameworks to comprehensively test both the safety and robustness of LRMs’ multi-step reasoning processes. (2) Domain-Specific Evaluation Frameworks. Evaluation suites for healthcare, finance, and law must include curated case studies and targeted adversarial tests. Expert review ensures LRMs meet each domain’s accuracy and ethical requirements. (3) Human-in-the-Loop Alignment and Interpretability. Interactive tools should let experts inspect and refine reasoning traces. Iterative feedback can align LRMs with stakeholder values and correct biases efficiently.

7 Conclusion

This survey has comprehensively examined the emerging safety challenges posed by LRMs. We’ve identified unique vulnerabilities in these models that extend beyond traditional LLMs, mapping out the landscape of safety risks, adversarial attack vectors, and defense strategies. By organizing these elements into a detailed taxonomy, this work aims to facilitate future research that can enhance the security and reliability of these increasingly powerful AI systems while preserving their remarkable reasoning capabilities.

Limitations

This survey has inherent limitations due to the rapidly evolving nature of LRMs. Since the emergence of OpenAI’s o1 series, DeepSeek-R1, and other advanced reasoning models is relatively recent, our taxonomy and findings may become outdated as new research continuously emerges. While we have endeavored to provide a comprehensive overview of safety challenges, attacks, and defenses, we acknowledge that some aspects may require revision as the field matures. Additionally, our reliance on published academic literature may not fully capture proprietary research being conducted within companies developing these models, potentially creating gaps in understanding industry-specific safety measures.

References

Andriushchenko and Flammarion (2024) Maksym Andriushchenko and Nicolas Flammarion. 2024. Does refusal training in llms generalize to the past tense? arXiv preprint arXiv:2407.11969.
Arrieta et al. (2025a) Aitor Arrieta, Miriam Ugarte, Pablo Valle, José Antonio Parejo, and Sergio Segura. 2025a. Early external safety testing of openai’s o3-mini: Insights from the pre-deployment evaluation. arXiv preprint arXiv:2501.17749.
Arrieta et al. (2025b) Aitor Arrieta, Miriam Ugarte, Pablo Valle, José Antonio Parejo, and Sergio Segura. 2025b. o3-mini vs deepseek-r1: Which one is safer? arXiv preprint arXiv:2501.18438.
Askell et al. (2021) Amanda Askell, Yuntao Bai, Anna Chen, Dawn Drain, Deep Ganguli, Tom Henighan, Andy Jones, Nicholas Joseph, Ben Mann, Nova DasSarma, et al. 2021. A general language assistant as a laboratory for alignment. arXiv preprint arXiv:2112.00861.
Bach et al. (2022) Stephen H Bach, Victor Sanh, Zheng-Xin Yong, Albert Webson, Colin Raffel, Nihal V Nayak, Abheesht Sharma, Taewoon Kim, M Saiful Bari, Thibault Fevry, et al. 2022. Promptsource: An integrated development environment and repository for natural language prompts. arXiv preprint arXiv:2202.01279.
Barkur et al. (2025) Sudarshan Kamath Barkur, Sigurd Schacht, and Johannes Scholl. 2025. Deception in llms: Self-preservation and autonomous goals in large language models. arXiv preprint arXiv:2501.16513.
Bates and Gurevych (2023) Luke Bates and Iryna Gurevych. 2023. Like a good nearest neighbor: Practical content moderation and text classification. arXiv preprint arXiv:2302.08957.
Besta et al. (2024) Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, and Torsten Hoefler. 2024. Graph of thoughts: Solving elaborate problems with large language models. Proceedings of the AAAI Conference on Artificial Intelligence, 38(16):17682–17690.
Bondarenko et al. (2025) Alexander Bondarenko, Denis Volk, Dmitrii Volkov, and Jeffrey Ladish. 2025. Demonstrating specification gaming in reasoning models. arXiv preprint arXiv:2502.13295.
Chen et al. (2024) Xingyu Chen, Jiahao Xu, Tian Liang, Zhiwei He, Jianhui Pang, Dian Yu, Linfeng Song, Qiuzhi Liu, Mengfei Zhou, Zhuosheng Zhang, et al. 2024. Do not think that much for 2+ 3=? on the overthinking of o1-like llms. arXiv preprint arXiv:2412.21187.
Cheng et al. (2023) Jiale Cheng, Xiao Liu, Kehan Zheng, Pei Ke, Hongning Wang, Yuxiao Dong, Jie Tang, and Minlie Huang. 2023. Black-box prompt optimization: Aligning large language models without model training. arXiv preprint arXiv:2311.04155.
Cheng et al. (2025) Pengzhou Cheng, Wei Du, Zongru Wu, Fengwei Zhang, Libo Chen, Zhuosheng Zhang, and Gongshen Liu. 2025. Synghost: Invisible and universal task-agnostic backdoor attack via syntactic transfer.
Chi et al. (2024a) Jianfeng Chi, Ujjwal Karn, Hongyuan Zhan, Eric Smith, Javier Rando, Yiming Zhang, Kate Plawiak, Zacharie Delpierre Coudert, Kartikeya Upasani, and Mahesh Pasupuleti. 2024a. Llama guard 3 vision: Safeguarding human-ai image understanding conversations. arXiv preprint arXiv:2411.10414.
Chi et al. (2024b) Jianfeng Chi, Ujjwal Karn, Hongyuan Zhan, Eric Smith, Javier Rando, Yiming Zhang, Kate Plawiak, Zacharie Delpierre Coudert, Kartikeya Upasani, and Mahesh Pasupuleti. 2024b. Llama guard 3 vision: Safeguarding human-ai image understanding conversations. arXiv preprint arXiv:2411.10414.
Cuadron et al. (2025) Alejandro Cuadron, Dacheng Li, Wenjie Ma, Xingyao Wang, Yichuan Wang, Siyuan Zhuang, Shu Liu, Luis Gaspar Schroeder, Tian Xia, Huanzhi Mao, Nicholas Thumiger, Aditya Desai, Ion Stoica, Ana Klimovic, Graham Neubig, and Joseph E. Gonzalez. 2025. The danger of overthinking: Examining the reasoning-action dilemma in agentic tasks.
Cui et al. (2025) Yu Cui, Bryan Hooi, Yujun Cai, and Yiwei Wang. 2025. Process or result? manipulated ending tokens can mislead reasoning llms to ignore the correct reasoning steps.
DeepSeek-AI et al. (2025) DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Honghui Ding, Huajian Xin, Huazuo Gao, Hui Qu, Hui Li, Jianzhong Guo, Jiashi Li, Jiawei Wang, Jingchang Chen, Jingyang Yuan, Junjie Qiu, Junlong Li, J. L. Cai, Jiaqi Ni, Jian Liang, Jin Chen, Kai Dong, Kai Hu, Kaige Gao, Kang Guan, Kexin Huang, Kuai Yu, Lean Wang, Lecong Zhang, Liang Zhao, Litong Wang, Liyue Zhang, Lei Xu, Leyi Xia, Mingchuan Zhang, Minghua Zhang, Minghui Tang, Meng Li, Miaojun Wang, Mingming Li, Ning Tian, Panpan Huang, Peng Zhang, Qiancheng Wang, Qinyu Chen, Qiushi Du, Ruiqi Ge, Ruisong Zhang, Ruizhe Pan, Runji Wang, R. J. Chen, R. L. Jin, Ruyi Chen, Shanghao Lu, Shangyan Zhou, Shanhuang Chen, Shengfeng Ye, Shiyu Wang, Shuiping Yu, Shunfeng Zhou, Shuting Pan, S. S. Li, Shuang Zhou, Shaoqing Wu, Shengfeng Ye, Tao Yun, Tian Pei, Tianyu Sun, T. Wang, Wangding Zeng, Wanjia Zhao, Wen Liu, Wenfeng Liang, Wenjun Gao, Wenqin Yu, Wentao Zhang, W. L. Xiao, Wei An, Xiaodong Liu, Xiaohan Wang, Xiaokang Chen, Xiaotao Nie, Xin Cheng, Xin Liu, Xin Xie, Xingchao Liu, Xinyu Yang, Xinyuan Li, Xuecheng Su, Xuheng Lin, X. Q. Li, Xiangyue Jin, Xiaojin Shen, Xiaosha Chen, Xiaowen Sun, Xiaoxiang Wang, Xinnan Song, Xinyi Zhou, Xianzu Wang, Xinxia Shan, Y. K. Li, Y. Q. Wang, Y. X. Wei, Yang Zhang, Yanhong Xu, Yao Li, Yao Zhao, Yaofeng Sun, Yaohui Wang, Yi Yu, Yichao Zhang, Yifan Shi, Yiliang Xiong, Ying He, Yishi Piao, Yisong Wang, Yixuan Tan, Yiyang Ma, Yiyuan Liu, Yongqiang Guo, Yuan Ou, Yuduan Wang, Yue Gong, Yuheng Zou, Yujia He, Yunfan Xiong, Yuxiang Luo, Yuxiang You, Yuxuan Liu, Yuyang Zhou, Y. X. Zhu, Yanhong Xu, Yanping Huang, Yaohui Li, Yi Zheng, Yuchen Zhu, Yunxian Ma, Ying Tang, Yukun Zha, Yuting Yan, Z. Z. Ren, Zehui Ren, Zhangli Sha, Zhe Fu, Zhean Xu, Zhenda Xie, Zhengyan Zhang, Zhewen Hao, Zhicheng Ma, Zhigang Yan, Zhiyu Wu, Zihui Gu, Zijia Zhu, Zijun Liu, Zilin Li, Ziwei Xie, Ziyang Song, Zizheng Pan, Zhen Huang, Zhipeng Xu, Zhongyu Zhang, and Zhen Zhang. 2025. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.
Ding et al. (2024) Yi Ding, Bolian Li, and Ruqi Zhang. 2024. Eta: Evaluating then aligning safety of vision language models at inference time. arXiv preprint arXiv:2410.06625.
Du et al. (2024) Xuefeng Du, Reshmi Ghosh, Robert Sim, Ahmed Salem, Vitor Carvalho, Emily Lawton, Yixuan Li, and Jack W Stokes. 2024. Vlmguard: Defending vlms against malicious prompts via unlabeled data. arXiv preprint arXiv:2410.00296.
Dubey et al. (2024) Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. 2024. The llama 3 herd of models. arXiv preprint arXiv:2407.21783.
Ethayarajh et al. (2022) Kawin Ethayarajh, Yejin Choi, and Swabha Swayamdipta. 2022. Understanding dataset difficulty with mathcal v-usable information. In International Conference on Machine Learning. PMLR.
Fang et al. (2025) Junfeng Fang, Yukai Wang, Ruipeng Wang, Zijun Yao, Kun Wang, An Zhang, Xiang Wang, and Tat-Seng Chua. 2025. Safemlrm: Demystifying safety in multi-modal large reasoning models.
Feng et al. (2023) Xidong Feng, Ziyu Wan, Muning Wen, Stephen Marcus McAleer, Ying Wen, Weinan Zhang, and Jun Wang. 2023. Alphazero-like tree-search can guide large language model decoding and training. arXiv preprint arXiv:2309.17179.
Gao et al. (2024) Kuofeng Gao, Tianyu Pang, Chao Du, Yong Yang, Shu-Tao Xia, and Min Lin. 2024. Denial-of-service poisoning attacks against large language models.
Ghosal et al. (2024) Soumya Suvra Ghosal, Souradip Chakraborty, Vaibhav Singh, Tianrui Guan, Mengdi Wang, Ahmad Beirami, Furong Huang, Alvaro Velasquez, Dinesh Manocha, and Amrit Singh Bedi. 2024. Immune: Improving safety against jailbreaks in multi-modal llms via inference-time alignment. arXiv preprint arXiv:2411.18688.
Ghosh et al. (2024a) Shaona Ghosh, Prasoon Varshney, Erick Galinkin, and Christopher Parisien. 2024a. Aegis: Online adaptive ai content safety moderation with ensemble of llm experts. arXiv preprint arXiv:2404.05993.
Ghosh et al. (2024b) Shaona Ghosh, Prasoon Varshney, Makesh Narsimhan Sreedhar, Aishwarya Padmakumar, Traian Rebedea, Jibin Rajan Varghese, and Christopher Parisien. 2024b. Aegis2. 0: A diverse ai safety dataset and risks taxonomy for alignment of llm guardrails. In Neurips Safe Generative AI Workshop 2024.
Guan et al. (2024) Melody Y Guan, Manas Joglekar, Eric Wallace, Saachi Jain, Boaz Barak, Alec Heylar, Rachel Dias, Andrea Vallone, Hongyu Ren, Jason Wei, et al. 2024. Deliberative alignment: Reasoning enables safer language models. arXiv preprint arXiv:2412.16339.
Guo and Tourani (2025) Zhen Guo and Reza Tourani. 2025. Darkmind: Latent chain-of-thought backdoor in customized llms. arXiv preprint arXiv:2501.18617.
Han et al. (2024) Seungju Han, Kavel Rao, Allyson Ettinger, Liwei Jiang, Bill Yuchen Lin, Nathan Lambert, Yejin Choi, and Nouha Dziri. 2024. Wildguard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of llms. arXiv preprint arXiv:2406.18495.
Hashemi et al. (2025) Masoud Hashemi, Oluwanifemi Bamgbose, Sathwik Tejaswi Madhusudhan, Jishnu Sethumadhavan Nair, Aman Tiwari, and Vikas Yadav. 2025. Dnr bench: When silence is smarter–benchmarking over-reasoning in reasoning llms. arXiv preprint arXiv:2503.15793.
He et al. (2025) Yufei He, Yuexin Li, Jiaying Wu, Yuan Sui, Yulin Chen, and Bryan Hooi. 2025. Evaluating the paperclip maximizer: Are rl-based language models more likely to pursue instrumental goals? arXiv preprint arXiv:2502.12206.
Helff et al. (2024) Lukas Helff, Felix Friedrich, Manuel Brack, Patrick Schramowski, and Kristian Kersting. 2024. Llavaguard: Vlm-based safeguard for vision dataset curation and safety assessment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8322–8326.
Huang et al. (2025) Tiansheng Huang, Sihao Hu, Fatih Ilhan, Selim Furkan Tekin, Zachary Yahn, Yichang Xu, and Ling Liu. 2025. Safety tax: Safety alignment makes your large reasoning models less reasonable. arXiv preprint arXiv:2503.00555.
Huang et al. (2023) Xiaowei Huang, Wenjie Ruan, Wei Huang, Gaojie Jin, Yi Dong, Changshun Wu, Saddek Bensalem, Ronghui Mu, Yi Qi, Xingyu Zhao, Kaiwen Cai, Yanghao Zhang, Sihao Wu, Peipei Xu, Dengyu Wu, Andre Freitas, and Mustafa A. Mustafa. 2023. A survey of safety and trustworthiness of large language models through the lens of verification and validation.
Inan et al. (2023) Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, et al. 2023. Llama guard: Llm-based input-output safeguard for human-ai conversations. arXiv preprint arXiv:2312.06674.
Jain et al. (2024) Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. 2024. Livecodebench: Holistic and contamination free evaluation of large language models for code. arXiv preprint arXiv:2403.07974.
Ji et al. (2025) Jiaming Ji, Xinyu Chen, Rui Pan, Han Zhu, Conghui Zhang, Jiahao Li, Donghai Hong, Boyuan Chen, Jiayi Zhou, Kaile Wang, et al. 2025. Safe rlhf-v: Safe reinforcement learning from human feedback in multimodal large language models. arXiv preprint arXiv:2503.17682.
Jiang et al. (2025) Fengqing Jiang, Zhangchen Xu, Yuetai Li, Luyao Niu, Zhen Xiang, Bo Li, Bill Yuchen Lin, and Radha Poovendran. 2025. Safechain: Safety of language models with long chain-of-thought reasoning capabilities. arXiv preprint arXiv:2502.12025.
Jin et al. (2024) Haibo Jin, Leyang Hu, Xinuo Li, Peiyan Zhang, Chonghan Chen, Jun Zhuang, and Haohan Wang. 2024. Jailbreakzoo: Survey, landscapes, and horizons in jailbreaking large language and vision-language models.
Ke et al. (2023) Pei Ke, Bosi Wen, Zhuoer Feng, Xiao Liu, Xuanyu Lei, Jiale Cheng, Shengyuan Wang, Aohan Zeng, Yuxiao Dong, Hongning Wang, et al. 2023. Critiquellm: Towards an informative critique generation model for evaluation of large language model generation. arXiv preprint arXiv:2311.18702.
Kojima et al. (2022) Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2022. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213.
Kumar et al. (2025) Abhinav Kumar, Jaechul Roh, Ali Naseh, Marzena Karpinska, Mohit Iyyer, Amir Houmansadr, and Eugene Bagdasarian. 2025. Overthinking: Slowdown attacks on reasoning llms. arXiv preprint arXiv:2502.02542.
Kumar et al. (2024) Surender Suresh Kumar, M.L. Cummings, and Alexander Stimpson. 2024. Strengthening llm trust boundaries: A survey of prompt injection attacks surender suresh kumar dr. m.l. cummings dr. alexander stimpson. In 2024 IEEE 4th International Conference on Human-Machine Systems (ICHMS), pages 1–6.
Kuo et al. (2025) Martin Kuo, Jianyi Zhang, Aolin Ding, Qinsi Wang, Louis DiValentin, Yujia Bao, Wei Wei, Hai Li, and Yiran Chen. 2025. H-cot: Hijacking the chain-of-thought safety reasoning mechanism to jailbreak large reasoning models, including openai o1/o3, deepseek-r1, and gemini 2.0 flash thinking. arXiv preprint arXiv:2502.12893.
Li et al. (2024) Nathaniel Li, Ziwen Han, Ian Steneker, Willow Primack, Riley Goodside, Hugh Zhang, Zifan Wang, Cristina Menghini, and Summer Yue. 2024. Llm defenses are not robust to multi-turn human jailbreaks yet. arXiv preprint arXiv:2408.15221.
Li et al. (2023) Xuan Li, Zhanke Zhou, Jianing Zhu, Jiangchao Yao, Tongliang Liu, and Bo Han. 2023. Deepinception: Hypnotize large language model to be jailbreaker. arXiv preprint arXiv:2311.03191.
Li et al. (2025a) Xuying Li, Zhuo Li, Yuji Kosuga, and Victor Bian. 2025a. Optimizing safe and aligned language generation: A multi-objective grpo approach. arXiv preprint arXiv:2503.21819.
Li et al. (2025b) Zhong-Zhi Li, Duzhen Zhang, Ming-Liang Zhang, Jiaxin Zhang, Zengyan Liu, Yuxuan Yao, Haotian Xu, Junhao Zheng, Pei-Jie Wang, Xiuyi Chen, et al. 2025b. From system 1 to system 2: A survey of reasoning large language models. arXiv preprint arXiv:2502.17419.
Liang et al. (2023) Tian Liang, Zhiwei He, Wenxiang Jiao, Xing Wang, Yan Wang, Rui Wang, Yujiu Yang, Shuming Shi, and Zhaopeng Tu. 2023. Encouraging divergent thinking in large language models through multi-agent debate. arXiv preprint arXiv:2305.19118.
Lightman et al. (2023) Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. 2023. Let’s verify step by step. In The Twelfth International Conference on Learning Representations.
Lin et al. (2023a) Yong Lin, Hangyu Lin, Wei Xiong, Shizhe Diao, Jianmeng Liu, Jipeng Zhang, Rui Pan, Haoxiang Wang, Wenbin Hu, Hanning Zhang, et al. 2023a. Mitigating the alignment tax of rlhf. arXiv preprint arXiv:2309.06256.
Lin et al. (2023b) Zi Lin, Zihan Wang, Yongqi Tong, Yangkun Wang, Yuxin Guo, Yujia Wang, and Jingbo Shang. 2023b. Toxicchat: Unveiling hidden challenges of toxicity detection in real-world user-ai conversation. arXiv preprint arXiv:2310.17389.
Liu et al. (2025a) Qin Liu, Fei Wang, Chaowei Xiao, and Muhao Chen. 2025a. Vlm-guard: Safeguarding vision-language models via fulfilling safety alignment gap. arXiv preprint arXiv:2502.10486.
Liu et al. (2023) Yi Liu, Gelei Deng, Yuekang Li, Kailong Wang, Zihao Wang, Xiaofeng Wang, Tianwei Zhang, Yepang Liu, Haoyu Wang, Yan Zheng, et al. 2023. Prompt injection attack against llm-integrated applications. arXiv preprint arXiv:2306.05499.
Liu et al. (2025b) Yue Liu, Hongcheng Gao, Shengfang Zhai, Xia Jun, Tianyi Wu, Zhiwei Xue, Yulin Chen, Kenji Kawaguchi, Jiaheng Zhang, and Bryan Hooi. 2025b. Guardreasoner: Towards reasoning-based llm safeguards. arXiv preprint arXiv:2501.18492.
Liu et al. (2025c) Yue Liu, Jiaying Wu, Yufei He, Hongcheng Gao, Hongyu Chen, Baolong Bi, Jiaheng Zhang, Zhiqi Huang, and Bryan Hooi. 2025c. Efficient inference for large reasoning models: A survey. arXiv preprint arXiv:2503.23077.
Liu et al. (2024) Zhendong Liu, Yuanbi Nie, Yingshui Tan, Xiangyu Yue, Qiushi Cui, Chongjun Wang, Xiaoyong Zhu, and Bo Zheng. 2024. Safety alignment for vision language models. arXiv preprint arXiv:2405.13581.
Lu et al. (2023) Ximing Lu, Faeze Brahman, Peter West, Jaehun Jang, Khyathi Chandu, Abhilasha Ravichander, Lianhui Qin, Prithviraj Ammanabrolu, Liwei Jiang, Sahana Ramnath, et al. 2023. Inference-time policy adapters (ipa): Tailoring extreme-scale lms without fine-tuning. arXiv preprint arXiv:2305.15065.
Luong et al. (2024) Trung Quoc Luong, Xinbo Zhang, Zhanming Jie, Peng Sun, Xiaoran Jin, and Hang Li. 2024. Reft: Reasoning with reinforced fine-tuning. arXiv preprint arXiv:2401.08967, 3.
Marjanović et al. (2025) Sara Vera Marjanović, Arkil Patel, Vaibhav Adlakha, Milad Aghajohari, Parishad BehnamGhader, Mehar Bhatia, Aditi Khandelwal, Austin Kraft, Benno Krojer, Xing Han Lù, et al. 2025. Deepseek-r1 thoughtology: Let’s <think> about llm reasoning. arXiv preprint arXiv:2504.07128.
Meta (2024) Meta. 2024. The llama 3 herd of models.
Mnih et al. (2015) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. 2015. Human-level control through deep reinforcement learning. nature, 518(7540):529–533.
Mou et al. (2025) Yutao Mou, Yuxiao Luo, Shikun Zhang, and Wei Ye. 2025. Saro: Enhancing llm safety through reasoning-based alignment. arXiv preprint arXiv:2504.09420.
OpenAI (2024) OpenAI. 2024. Openai o1 system card. arXiv preprint arXiv:2412.16720.
Ouyang et al. (2022) Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35.
Qi et al. (2021) Fanchao Qi, Mukai Li, Yangyi Chen, Zhengyan Zhang, Zhiyuan Liu, Yasheng Wang, and Maosong Sun. 2021. Hidden killer: Invisible textual backdoor attacks with syntactic trigger.
Qiu et al. (2025) Jianing Qiu, Lin Li, Jiankai Sun, Hao Wei, Zhe Xu, Kyle Lam, and Wu Yuan. 2025. Emerging cyber attack risks of medical ai agents. arXiv preprint arXiv:2504.03759.
Qu et al. (2025) Xiaoye Qu, Yafu Li, Zhaochen Su, Weigao Sun, Jianhao Yan, Dongrui Liu, Ganqu Cui, Daizong Liu, Shuxian Liang, Junxian He, et al. 2025. A survey of efficient reasoning for large reasoning models: Language, multimodality, and beyond. arXiv preprint arXiv:2503.21614.
Qwen et al. (2025) Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tianyi Tang, Tingyu Xia, Xingzhang Ren, Xuancheng Ren, Yang Fan, Yang Su, Yichang Zhang, Yu Wan, Yuqiong Liu, Zeyu Cui, Zhenru Zhang, and Zihan Qiu. 2025. Qwen2.5 technical report.
Rafailov et al. (2024) Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. 2024. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36.
Rein et al. (2024) David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. 2024. Gpqa: A graduate-level google-proof qa benchmark. In First Conference on Language Modeling.
Ren et al. (2024) Qibing Ren, Hao Li, Dongrui Liu, Zhanxu Xie, Xiaoya Lu, Yu Qiao, Lei Sha, Junchi Yan, Lizhuang Ma, and Jing Shao. 2024. Derail yourself: Multi-turn llm jailbreak attack through self-discovered clues. arXiv preprint arXiv:2410.10700.
Romero-Arjona et al. (2025) Miguel Romero-Arjona, Pablo Valle, Juan C Alonso, Ana B Sánchez, Miriam Ugarte, Antonia Cazalilla, Vicente Cambrón, José A Parejo, Aitor Arrieta, and Sergio Segura. 2025. Red teaming contemporary ai models: Insights from spanish and basque perspectives. arXiv preprint arXiv:2503.10192.
Russinovich et al. (2024) Mark Russinovich, Ahmed Salem, and Ronen Eldan. 2024. Great, now write an article about that: The crescendo multi-turn llm jailbreak attack. arXiv preprint arXiv:2404.01833.
Shah et al. (2023) Rusheb Shah, Soroush Pour, Arush Tagade, Stephen Casper, Javier Rando, et al. 2023. Scalable and transferable black-box jailbreaks for language models via persona modulation. arXiv preprint arXiv:2311.03348.
Shen et al. (2023) Tianhao Shen, Renren Jin, Yufei Huang, Chuang Liu, Weilong Dong, Zishan Guo, Xinwei Wu, Yan Liu, and Deyi Xiong. 2023. Large language model alignment: A survey. arXiv preprint arXiv:2309.15025.
Shi et al. (2024) Dan Shi, Tianhao Shen, Yufei Huang, Zhigen Li, Yongqi Leng, Renren Jin, Chuang Liu, Xinwei Wu, Zishan Guo, Linhao Yu, Ling Shi, Bojian Jiang, and Deyi Xiong. 2024. Large language model safety: A holistic survey.
Shumailov et al. (2021) Ilia Shumailov, Yiren Zhao, Daniel Bates, Nicolas Papernot, Robert Mullins, and Ross Anderson. 2021. Sponge examples: Energy-latency attacks on neural networks. In 2021 IEEE European Symposium on Security and Privacy, pages 212–231.
Sun et al. (2024) Xiongtao Sun, Deyue Zhang, Dongdong Yang, Quanchen Zou, and Hui Li. 2024. Multi-turn context jailbreak attack on large language models from first principles. arXiv preprint arXiv:2408.04686.
Sutton et al. (1998) Richard S Sutton, Andrew G Barto, et al. 1998. Reinforcement learning: An introduction, volume 1. MIT press Cambridge.
Team et al. (2025) Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. 2025. Kimi k1. 5: Scaling reinforcement learning with llms. arXiv preprint arXiv:2501.12599.
Team (2024a) Qwen Team. 2024a. Qvq: To see the world with wisdom. https://qwenlm.github.io/blog/qvq-72b-preview/.
Team (2024b) Qwen Team. 2024b. Qwq: Reflect deeply on the boundaries of the unknown. https://qwenlm.github.io/blog/qwq-32b-preview/.
Upadhayay et al. (2025) Bibek Upadhayay, Vahid Behzadan, et al. 2025. X-guard: Multilingual guard agent for content moderation. arXiv preprint arXiv:2504.08848.
Wang et al. (2022a) Boxin Wang, Wei Ping, Chaowei Xiao, Peng Xu, Mostofa Patwary, Mohammad Shoeybi, Bo Li, Anima Anandkumar, and Bryan Catanzaro. 2022a. Exploring the limits of domain-adaptive training for detoxifying large-scale language models. Advances in Neural Information Processing Systems, 35:35811–35824.
Wang et al. (2023) Lei Wang, Wanyu Xu, Yihuai Lan, Zhiqiang Hu, Yunshi Lan, Roy Ka-Wei Lee, and Ee-Peng Lim. 2023. Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models. arXiv preprint arXiv:2305.04091.
Wang et al. (2024a) Pengyu Wang, Dong Zhang, Linyang Li, Chenkun Tan, Xinghao Wang, Ke Ren, Botian Jiang, and Xipeng Qiu. 2024a. Inferaligner: Inference-time alignment for harmlessness through cross-model guidance. arXiv preprint arXiv:2401.11206.
Wang et al. (2022b) Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A Smith, Daniel Khashabi, and Hannaneh Hajishirzi. 2022b. Self-instruct: Aligning language models with self-generated instructions. arXiv preprint arXiv:2212.10560.
Wang et al. (2022c) Yizhong Wang, Swaroop Mishra, Pegah Alipoormolabashi, Yeganeh Kordi, Amirreza Mirzaei, Anjana Arunkumar, Arjun Ashok, Arut Selvan Dhanasekaran, Atharva Naik, David Stap, et al. 2022c. Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks. arXiv preprint arXiv:2204.07705.
Wang et al. (2025) Zijun Wang, Haoqin Tu, Yuhan Wang, Juncheng Wu, Jieru Mei, Brian R Bartoldson, Bhavya Kailkhura, and Cihang Xie. 2025. Star-1: Safer alignment of reasoning llms with 1k data. arXiv preprint arXiv:2504.01903.
Wang et al. (2024b) Zilong Wang, Hao Zhang, Chun-Liang Li, Julian Martin Eisenschlos, Vincent Perot, Zifeng Wang, Lesly Miculicich, Yasuhisa Fujii, Jingbo Shang, Chen-Yu Lee, and Tomas Pfister. 2024b. Chain-of-table: Evolving tables in the reasoning chain for table understanding.
Watkins and Dayan (1992) Christopher JCH Watkins and Peter Dayan. 1992. Q-learning. Machine learning, 8:279–292.
Wei et al. (2022) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837.
Welbl et al. (2021) Johannes Welbl, Amelia Glaese, Jonathan Uesato, Sumanth Dathathri, John Mellor, Lisa Anne Hendricks, Kirsty Anderson, Pushmeet Kohli, Ben Coppin, and Po-Sen Huang. 2021. Challenges in detoxifying language models. arXiv preprint arXiv:2109.07445.
Wen et al. (2025) Xiaofei Wen, Wenxuan Zhou, Wenjie Jacky Mo, and Muhao Chen. 2025. Thinkguard: Deliberative slow thinking leads to cautious guardrails. arXiv preprint arXiv:2502.13458.
Weng et al. (2025) Fenghua Weng, Jian Lou, Jun Feng, Minlie Huang, and Wenjie Wang. 2025. Adversary-aware dpo: Enhancing safety alignment in vision language models via adversarial training. arXiv preprint arXiv:2502.11455.
Wu et al. (2021) Jeff Wu, Long Ouyang, Daniel M Ziegler, Nisan Stiennon, Ryan Lowe, Jan Leike, and Paul Christiano. 2021. Recursively summarizing books with human feedback. arXiv preprint arXiv:2109.10862.
Wu et al. (2025) Yuyang Wu, Yifei Wang, Tianqi Du, Stefanie Jegelka, and Yisen Wang. 2025. When more is less: Understanding chain-of-thought length in llms. arXiv preprint arXiv:2502.07266.
Xiang et al. (2024) Zhen Xiang, Fengqing Jiang, Zidi Xiong, Bhaskar Ramasubramanian, Radha Poovendran, and Bo Li. 2024. Badchain: Backdoor chain-of-thought prompting for large language models. arXiv preprint arXiv:2401.12242.
Xu et al. (2023) Jiashu Xu, Mingyu Derek Ma, Fei Wang, Chaowei Xiao, and Muhao Chen. 2023. Instructions as backdoors: Backdoor vulnerabilities of instruction tuning for large language models. arXiv preprint arXiv:2305.14710.
Xu et al. (2025) Rongwu Xu, Xiaojian Li, Shuo Chen, and Wei Xu. 2025. Nuclear deployed: Analyzing catastrophic risks in decision-making of autonomous llm agents. arXiv preprint arXiv:2502.11355.
Yang et al. (2025) Yi Yang, Xiaoxuan He, Hongkun Pan, Xiyan Jiang, Yan Deng, Xingtao Yang, Haoyu Lu, Dacheng Yin, Fengyun Rao, Minfeng Zhu, Bo Zhang, and Wei Chen. 2025. R1-onevision: Advancing generalized multimodal reasoning through cross-modal formalization.
Yao et al. (2024a) Hongwei Yao, Jian Lou, and Zhan Qin. 2024a. Poisonprompt: Backdoor attack on prompt-based large language models. In ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7745–7749.
Yao et al. (2024b) Huanjin Yao, Jiaxing Huang, Wenhao Wu, Jingyi Zhang, Yibo Wang, Shunyu Liu, Yingjie Wang, Yuxin Song, Haocheng Feng, Li Shen, and Dacheng Tao. 2024b. Mulberry: Empowering mllm with o1-like reasoning and reflection via collective monte carlo tree search.
Yao et al. (2023) Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik Narasimhan. 2023. Tree of thoughts: Deliberate problem solving with large language models.
Yao et al. (2025) Yang Yao, Xuan Tong, Ruofan Wang, Yixu Wang, Lujundong Li, Liang Liu, Yan Teng, and Yingchun Wang. 2025. A mousetrap: Fooling large reasoning models for jailbreak with chain of iterative chaos. arXiv preprint arXiv:2502.15806.
Ye et al. (2025) Mang Ye, Xuankun Rong, Wenke Huang, Bo Du, Nenghai Yu, and Dacheng Tao. 2025. A survey of safety on large vision-language models: Attacks, defenses and evaluations. arXiv preprint arXiv:2502.14881.
Yi et al. (2024) Sibo Yi, Yule Liu, Zhen Sun, Tianshuo Cong, Xinlei He, Jiaxing Song, Ke Xu, and Qi Li. 2024. Jailbreak attacks and defenses against large language models: A survey.
Ying et al. (2025a) Zonghao Ying, Deyue Zhang, Zonglei Jing, Yisong Xiao, Quanchen Zou, Aishan Liu, Siyuan Liang, Xiangzheng Zhang, Xianglong Liu, and Dacheng Tao. 2025a. Reasoning-augmented conversation for multi-turn jailbreak attacks on large language models. arXiv preprint arXiv:2502.11054.
Ying et al. (2025b) Zonghao Ying, Guangyi Zheng, Yongxin Huang, Deyue Zhang, Wenxin Zhang, Quanchen Zou, Aishan Liu, Xianglong Liu, and Dacheng Tao. 2025b. Towards understanding the safety boundaries of deepseek models: Evaluation and findings. arXiv preprint arXiv:2503.15092.
Zaremba et al. (2025) Wojciech Zaremba, Evgenia Nitishinskaya, Boaz Barak, Stephanie Lin, Sam Toyer, Yaodong Yu, Rachel Dias, Eric Wallace, Kai Xiao, Johannes Heidecke, and Amelia Glaese. 2025. Trading inference-time compute for adversarial robustness.
Zeng et al. (2024a) Wenjun Zeng, Yuchi Liu, Ryan Mullins, Ludovic Peran, Joe Fernandez, Hamza Harkous, Karthik Narasimhan, Drew Proud, Piyush Kumar, Bhaktipriya Radharapu, et al. 2024a. Shieldgemma: Generative ai content moderation based on gemma. arXiv preprint arXiv:2407.21772.
Zeng et al. (2024b) Yi Zeng, Hongpeng Lin, Jingwen Zhang, Diyi Yang, Ruoxi Jia, and Weiyan Shi. 2024b. How johnny can persuade llms to jailbreak them: Rethinking persuasion to challenge ai safety by humanizing llms. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14322–14350.
Zhang et al. (2025a) Wenjing Zhang, Xuejiao Lei, Zhaoxiang Liu, Ning Wang, Zhenhong Long, Peijun Yang, Jiaojiao Zhao, Minjie Hua, Chaoyang Ma, Kai Wang, et al. 2025a. Safety evaluation of deepseek models in chinese contexts. arXiv preprint arXiv:2502.11137.
Zhang et al. (2025b) Yichi Zhang, Zihao Zeng, Dongbai Li, Yao Huang, Zhijie Deng, and Yinpeng Dong. 2025b. Realsafe-r1: Safety-aligned deepseek-r1 without compromising reasoning capability. arXiv preprint arXiv:2504.10081.
Zhang et al. (2025c) Yichi Zhang, Siyuan Zhang, Yao Huang, Zeyu Xia, Zhengwei Fang, Xiao Yang, Ranjie Duan, Dong Yan, Yinpeng Dong, and Jun Zhu. 2025c. Stair: Improving safety alignment with introspective reasoning. arXiv preprint arXiv:2502.02384.
Zhang et al. (2024) Yongting Zhang, Lu Chen, Guodong Zheng, Yifeng Gao, Rui Zheng, Jinlan Fu, Zhenfei Yin, Senjie Jin, Yu Qiao, Xuanjing Huang, et al. 2024. Spa-vl: A comprehensive safety preference alignment dataset for vision language model. arXiv preprint arXiv:2406.12030.
Zhao et al. (2025) Gejian Zhao, Hanzhou Wu, Xinpeng Zhang, and Athanasios V Vasilakos. 2025. Shadowcot: Cognitive hijacking for stealthy reasoning backdoors in llms. arXiv preprint arXiv:2504.05605.
Zhao et al. (2024) Shuai Zhao, Meihuizi Jia, Zhongliang Guo, Leilei Gan, Xiaoyu Xu, Xiaobao Wu, Jie Fu, Yichao Feng, Fengjun Pan, and Luu Anh Tuan. 2024. A survey of backdoor attacks and defenses on large language models: Implications for security measures. Authorea Preprints.
Zhou et al. (2025) Kaiwen Zhou, Chengzhi Liu, Xuandong Zhao, Shreedhar Jangam, Jayanth Srinivasa, Gaowen Liu, Dawn Song, and Xin Eric Wang. 2025. The hidden risks of large reasoning models: A safety assessment of r1. arXiv preprint arXiv:2502.12659.
Zhou (2020) Xuhui Zhou. 2020. Challenges in automated debiasing for toxic language detection. University of Washington.
Zhu et al. (2025a) Junda Zhu, Lingyong Yan, Shuaiqiang Wang, Dawei Yin, and Lei Sha. 2025a. Reasoning-to-defend: Safety-aware reasoning can defend large language models from jailbreaking. arXiv preprint arXiv:2502.12970.
Zhu et al. (2025b) Zihao Zhu, Hongbao Zhang, Mingda Zhang, Ruotong Wang, Guanzong Wu, Ke Xu, and Baoyuan Wu. 2025b. Bot: Breaking long thought processes of o1-like large language models through backdoor attack. arXiv preprint arXiv:2502.12202.