# Tuning thresholds

How to pick `auto_threshold` and `review_threshold` using data, not vibes.

## Don't tune on day one

Ship with the defaults (`auto_threshold = 0.95`, `review_threshold = 0.70`). Let at least a few hundred items flow through. You cannot tune what you cannot measure.

## The calibration plot

Open `/analytics` → **Calibration**. The plot bins items by confidence and shows the override rate in each bin. A well-calibrated model produces something like:

```
override rate
  │
  │ ████
  │ ████ ███
  │ ████ ███ ██
  │ ████ ███ ██ ██
  │ ████ ███ ██ ██ █
  └────────────────────── confidence
   0.5  0.6  0.7  0.8  0.9
```

Monotonically decreasing. The bars at the right (high confidence) are short — those items are usually right.

## What to look for

### Healthy: smooth decrease

Raise `auto_threshold` until the next bin's override rate exceeds your tolerance. That's your new auto-threshold.

### Broken: U-shape

High override rate at both low *and* high confidence. The model is confidently wrong on some pattern. Do not raise `auto_threshold`. Find the pattern (look at `/learnings` filtered to high-confidence overrides), turn it into a risk flag, and route those items to humans regardless of confidence.

### Broken: flat

Override rate is roughly the same across all confidence bins. Your confidence signal is not actually informative. Options:

1. Recalibrate your scoring (temperature scaling, Platt scaling, isotonic regression)
2. Stop relying on confidence; route everything by risk flags only
3. Replace the scoring component

### Broken: bimodal flags

A risk flag's override rate is either \~0% or \~100% — never in between. If \~0%, the flag is doing nothing useful (remove it). If \~100%, the flag should be a `hard_block` rule, not a queue trigger.

## Acceptable error rate

Your `auto_threshold` is implicitly an SLA: *"I accept that auto-approved items at this confidence have at most X% error rate."* Be explicit about X.

| Use case                             | Typical tolerance                                       |
| ------------------------------------ | ------------------------------------------------------- |
| Internal tagging                     | 5–10%                                                   |
| Customer-facing replies (low stakes) | 1–3%                                                    |
| Payments / refunds                   | <0.5% (often: human always)                             |
| Legal / medical                      | 0% (human always; LoopDesk gates with `escalate_flags`) |

If your tolerance is 0%, don't set `auto_threshold = 1.0` — set the project to route everything to humans with `force_review_flags` covering all cases.

## When to retune

Retune after **any** of:

* Model swap (GPT-X to GPT-Y, or vendor change)
* Prompt change beyond a typo
* Major guideline update
* Material change in input distribution (new region, new product line, new customer tier)
* Quarter boundary, as a sanity check

Calibration is not a property of your model. It's a property of the (model, prompt, input distribution) tuple. Any one of them changes, calibration changes.

## Common mistake: tuning by queue size

If the queue is overwhelming, the *first* response is not to raise `auto_threshold`. It is to ask:

1. Is the calibration plot still monotonic? → if not, fix the model, not the threshold
2. Are reviewers spending too long per item? → fix the queue UI / guidelines
3. Are the right items being queued? → check that `force_review_flags` aren't over-firing
4. *Then*, if all of the above are healthy, consider raising `auto_threshold` — and accept the implied increase in auto-error.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://hitl-01.gitbook.io/hitl-docs/guides/tuning-thresholds.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
