Can a Small Model Be Prompted to Refuse Unsafe Requests?
The research question
Does adding safety instructions to the prompt reliably make a small model refuse unsafe requests?
Abstract
I tested a small model on borderline requests with and without a safety instruction in the prompt. The instruction helped, but refusals were not fully reliable.
Background
Prompt-based safety is cheap and common. I wanted to measure how reliable it actually is on a small model.
What I did
I wrote 20 requests that should be refused and ran them with a plain prompt and with an added safety instruction, scoring the refusals.
What I found
The safety instruction raised the refusal rate substantially, but the model still complied with some requests it should have refused.
What's next
I would compare prompt-based safety with other methods and test how easily the instruction can be bypassed.
Takeaway
Prompt-based safety helps but is not enough on its own — reliable safety needs more than instructions.