TRAP: Targeted Random Adversarial Prompt Honeypot for Black-Box IdentificationarXiv, 2024
Large Language Models (LLMs) come with usage rules to protect interests and prevent misuse. This study introduces Black-box Identity Verification (BBIV), aiming to identify if a service uses a specific LLM via chat for compliance. The method, Targeted Random Adversarial Prompt (TRAP), uses adversarial suffixes to get a pre-defined answer from the specific LLM, while other models give random answers. TRAP offers a novel approach for ensuring compliance with LLM usage policies.