llama31-8b-ade-sft-v2

A LoRA adapter for meta-llama/Llama-3.1-8B-Instruct that answers adverse drug event (ADE) questions on single-sentence clinical text and extracts the implicated drug and event as structured JSON. Distilled from a Vertex-hosted Llama 3.3 70B teacher; trained with QLoRA on ~3k teacher-labeled sentences from ade_corpus_v2.

⚠️ Not clinical grade. This is a research / educational artifact. Do not use for patient-care decisions.

Intended use

Given a short clinical vignette (one or a few sentences), produce a JSON object:

{
  "answer": "yes | no | abstain",
  "drug": "<drug name or empty>",
  "event": "<adverse event or empty>",
  "evidence": "<quoted or closely paraphrased text>",
  "short_justification": "<one short sentence>",
  "confidence": 0.0
}
  • answer is yes only when the text supports a causally plausible drug-event relationship.
  • abstain is reserved for cases where the text names no plausible drug or no plausible event. Temporal co-occurrence with a clear external cause (e.g., "on metformin, slipped and fractured ankle") should be no, not abstain.

Evaluation

Held-out split (200 rows, balanced 100 positive / 100 negative) sampled from ade_corpus_v2 and never seen during training. Compared against a v1 baseline that did not use few-shots or hard negatives.

Metric v1 v2 (this model)
exact_match (yes/no/abstain) 0.555 0.715
abstain_rate 0.315 0.135
positive_f1 0.884 0.860
positive_precision 0.798 0.785
positive_recall 0.990 0.950
span_drug_exact_match (pos) 0.940 0.840
span_drug_token_f1 (pos) 0.952 0.883
span_event_exact_match (pos) 0.660 0.710
span_event_token_f1 (pos) 0.816 0.866

Tradeoff to know. v2 adds 600 "hard negatives" (drug mentioned, answer=no) to teach calibrated abstention. This halved the abstain rate and added 16 pts of exact_match, but cost ~10 pts of drug-span exact match vs v1 — the model learned to be more cautious about emitting a drug name. If your use case needs drug extraction on positives above all else, the earlier v1 checkpoint may be preferable.

Usage

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

base_id = "meta-llama/Llama-3.1-8B-Instruct"
adapter_id = "Ventali/llama31-8b-ade-sft-v2"

bnb = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
)
tokenizer = AutoTokenizer.from_pretrained(base_id)
model = AutoModelForCausalLM.from_pretrained(base_id, quantization_config=bnb, device_map="auto")
model = PeftModel.from_pretrained(model, adapter_id)
model.eval()

messages = [
    {"role": "system", "content": "You are a careful biomedical assistant. For each case, return a compact JSON answer grounded in the provided evidence. If the evidence is insufficient, abstain."},
    {"role": "user", "content": "Case: The patient developed diffuse urticaria three days after starting amoxicillin.\n\nIs this consistent with a possible adverse drug event? Identify the drug and event if so, or abstain if the evidence is insufficient."},
]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
    out = model.generate(**inputs, max_new_tokens=256, do_sample=False, pad_token_id=tokenizer.eos_token_id)
print(tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))

For Apple Silicon you can fuse the adapter into the base and run via mlx-lm:

pip install mlx-lm
mlx_lm.fuse --model meta-llama/Llama-3.1-8B-Instruct \
  --adapter-path <local-adapter-dir> \
  --save-path ~/models/llama31-ade-mlx
mlx_lm.generate --model ~/models/llama31-ade-mlx --prompt "..."

Training

  • Base: meta-llama/Llama-3.1-8B-Instruct, loaded in 4-bit (NF4, double-quant, bf16 compute).
  • LoRA: r=32, alpha=64, dropout=0.05, target modules {q,k,v,o,gate,up,down}_proj. 41.9M trainable params (0.52% of base).
  • Data: 2,999 (prompt, teacher JSON) pairs. Prompts drawn from ade_corpus_v2 as 1,200 positive (from drug_ade_relation) + 1,200 easy-negative + 600 hard-negative (classification label=0 rows whose text mentions a drug from the positive-split vocabulary). Teacher: Vertex AI managed llama-3.3-70b-instruct-maas (temperature 0.2), seeded with 3 yes/no/abstain few-shots and prompted to reserve abstention for cases with no plausible drug or no plausible event.
  • Filter: required non-empty answer and evidence, confidence ≥ 0.65, evidence-source word overlap ≥ 0.6. 2,999/3,000 retained.
  • Optimizer: AdamW, lr=2e-4, warmup_ratio=0.03, weight_decay=0.01, bf16, gradient_checkpointing on.
  • 3 epochs with load_best_model_at_end=True on eval_loss; the epoch-1 checkpoint (eval_loss 0.506) was restored, eclipsing the overfit epochs 2–3 (0.547, 0.676).
  • Hardware: single A100 40GB on GCP a2-highgpu-1g. Training wall time ~94 min.

Limitations

  • Trained on single-sentence, literature-style clinical text. Longer narratives (discharge summaries, EHR free-text) are out of distribution and will likely perform worse.
  • Teacher labels are synthetic. A clinician-reviewed eval set was not used; regressions against human judgment have not been measured.
  • The model occasionally produces an empty drug or event field on positive cases, which is a regression from v1 on drug-span extraction. See the tradeoff note above.
  • English only.

Reproducibility

Full pipeline (seed building, teacher generation config, filter, SFT prep, training, evaluation) lives at https://github.com/ventali/medical-distill. Commit 547629f records this adapter's metrics.

License

Inherits the Llama 3.1 Community License from the base model.

Downloads last month
3
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Ventali/llama31-8b-ade-sft-v2

Adapter
(2394)
this model

Dataset used to train Ventali/llama31-8b-ade-sft-v2

Evaluation results

  • exact_match (answer ∈ {yes,no,abstain}) on ade_corpus_v2 (200 held-out)
    self-reported
    0.715
  • positive_f1 (answer=yes) on ade_corpus_v2 (200 held-out)
    self-reported
    0.860
  • positive_precision on ade_corpus_v2 (200 held-out)
    self-reported
    0.785
  • positive_recall on ade_corpus_v2 (200 held-out)
    self-reported
    0.950
  • span_drug_token_f1 (positives only) on ade_corpus_v2 (200 held-out)
    self-reported
    0.883
  • span_event_token_f1 (positives only) on ade_corpus_v2 (200 held-out)
    self-reported
    0.866