Adversarial Attacks on LLM-as-a-Judge Systems: Insights from Prompt Injections - доклад на конференции | ИСТИНА – Интеллектуальная Система Тематического Исследования НАукометрических данных

Авторы: Maloyan Narek, Namiot Dmitry
Международная Конференция : DCCN 2025
Даты проведения конференции: 22-26 сентября 2025
Дата доклада: 25 сентября 2025
Тип доклада: Устный
Докладчик: Maloyan Narek
Место проведения: Москва, Россия
Аннотация доклада:
Large Language Models (LLMs) are increasingly used as automated judges for evaluating text quality, code correctness, and argument strength. However, these LLM-as-a-judge systems are vulnerable to adversarial attacks that can manipulate their assessments. This paper investigates the vulnerability of LLM-as-a-judge systems to prompt injection attacks, drawing insights from both academic literature and practical solutions from the "LLMs: You Can't Please Them All" Kaggle competition. We present a comprehensive framework for developing and evaluating adversarial attacks against LLM judges, distinguishing between content-author attacks and system-prompt attacks. Our experimental evaluation spans five models (including Gemma-3-27B-Instruct, Gemma-3-4B-Instruct, Llama-3.2-3B-Instruct, and frontier models like GPT-4 and Claude-3-Opus), four distinct evaluation tasks, and multiple defense mechanisms with precisely specified implementations. Through rigorous statistical analysis (n=50 prompts per condition, bootstrap confidence intervals), we demonstrate that sophisticated attacks can achieve success rates of up to 73.8\% against popular LLM judges, with Contextual Misdirection being the most effective method against Gemma models at 67.7\%. We find that smaller models like Gemma-3-4B-Instruct are more vulnerable (65.9\% average success rate) than their larger counterparts, and that attacks show high transferability (50.5-62.6\%) across different architectures. We compare our approach with recent work including Universal-Prompt-Injection \cite{liu2024automatic} and AdvPrompter \cite{paulus2024advprompter}, demonstrating both complementary insights and novel contributions. Our findings highlight critical vulnerabilities in current LLM-as-a-judge systems and provide recommendations for developing more robust evaluation frameworks, including using multi-model committees with diverse architectures and preferring comparative assessment over absolute scoring methods. To ensure reproducibility, we release our code, evaluation harness, and processed datasets.

Доклад на конференции выполнен в рамках проекта (проектов):

Конвергентные когнитивно-информационные технологии, интеллектуальные инструменты, сервисы и ресурсы безопасных распределенных информационно-вычислительных инфраструктур в науке, образовании, социуме

Добавил в систему: Намиот Дмитрий Евгеньевич

	ИСТИНА	Войти в систему Регистрация
	Интеллектуальная Система Тематического Исследования НАукометрических данных
	Главная Поиск Статистика О проекте Помощь

ИСТИНА

Интеллектуальная Система Тематического Исследования НАукометрических данных

Adversarial Attacks on LLM-as-a-Judge Systems: Insights from Prompt Injectionsдоклад на конференции