Instructions to use Ventali/llama31-8b-ade-sft-v2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use Ventali/llama31-8b-ade-sft-v2 with PEFT:
from peft import PeftModel from transformers import AutoModelForCausalLM base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B-Instruct") model = PeftModel.from_pretrained(base_model, "Ventali/llama31-8b-ade-sft-v2") - Notebooks
- Google Colab
- Kaggle
llama31-8b-ade-sft-v2
A LoRA adapter for meta-llama/Llama-3.1-8B-Instruct that answers adverse drug event (ADE) questions on single-sentence clinical text and extracts the implicated drug and event as structured JSON. Distilled from a Vertex-hosted Llama 3.3 70B teacher; trained with QLoRA on ~3k teacher-labeled sentences from ade_corpus_v2.
⚠️ Not clinical grade. This is a research / educational artifact. Do not use for patient-care decisions.
Intended use
Given a short clinical vignette (one or a few sentences), produce a JSON object:
{
"answer": "yes | no | abstain",
"drug": "<drug name or empty>",
"event": "<adverse event or empty>",
"evidence": "<quoted or closely paraphrased text>",
"short_justification": "<one short sentence>",
"confidence": 0.0
}
answerisyesonly when the text supports a causally plausible drug-event relationship.abstainis reserved for cases where the text names no plausible drug or no plausible event. Temporal co-occurrence with a clear external cause (e.g., "on metformin, slipped and fractured ankle") should beno, notabstain.
Evaluation
Held-out split (200 rows, balanced 100 positive / 100 negative) sampled from ade_corpus_v2 and never seen during training. Compared against a v1 baseline that did not use few-shots or hard negatives.
| Metric | v1 | v2 (this model) |
|---|---|---|
| exact_match (yes/no/abstain) | 0.555 | 0.715 |
| abstain_rate | 0.315 | 0.135 |
| positive_f1 | 0.884 | 0.860 |
| positive_precision | 0.798 | 0.785 |
| positive_recall | 0.990 | 0.950 |
| span_drug_exact_match (pos) | 0.940 | 0.840 |
| span_drug_token_f1 (pos) | 0.952 | 0.883 |
| span_event_exact_match (pos) | 0.660 | 0.710 |
| span_event_token_f1 (pos) | 0.816 | 0.866 |
Tradeoff to know. v2 adds 600 "hard negatives" (drug mentioned, answer=no) to teach calibrated abstention. This halved the abstain rate and added 16 pts of exact_match, but cost ~10 pts of drug-span exact match vs v1 — the model learned to be more cautious about emitting a drug name. If your use case needs drug extraction on positives above all else, the earlier v1 checkpoint may be preferable.
Usage
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch
base_id = "meta-llama/Llama-3.1-8B-Instruct"
adapter_id = "Ventali/llama31-8b-ade-sft-v2"
bnb = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=torch.bfloat16,
)
tokenizer = AutoTokenizer.from_pretrained(base_id)
model = AutoModelForCausalLM.from_pretrained(base_id, quantization_config=bnb, device_map="auto")
model = PeftModel.from_pretrained(model, adapter_id)
model.eval()
messages = [
{"role": "system", "content": "You are a careful biomedical assistant. For each case, return a compact JSON answer grounded in the provided evidence. If the evidence is insufficient, abstain."},
{"role": "user", "content": "Case: The patient developed diffuse urticaria three days after starting amoxicillin.\n\nIs this consistent with a possible adverse drug event? Identify the drug and event if so, or abstain if the evidence is insufficient."},
]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
out = model.generate(**inputs, max_new_tokens=256, do_sample=False, pad_token_id=tokenizer.eos_token_id)
print(tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
For Apple Silicon you can fuse the adapter into the base and run via mlx-lm:
pip install mlx-lm
mlx_lm.fuse --model meta-llama/Llama-3.1-8B-Instruct \
--adapter-path <local-adapter-dir> \
--save-path ~/models/llama31-ade-mlx
mlx_lm.generate --model ~/models/llama31-ade-mlx --prompt "..."
Training
- Base:
meta-llama/Llama-3.1-8B-Instruct, loaded in 4-bit (NF4, double-quant, bf16 compute). - LoRA: r=32, alpha=64, dropout=0.05, target modules {q,k,v,o,gate,up,down}_proj. 41.9M trainable params (0.52% of base).
- Data: 2,999 (prompt, teacher JSON) pairs. Prompts drawn from
ade_corpus_v2as 1,200 positive (fromdrug_ade_relation) + 1,200 easy-negative + 600 hard-negative (classification label=0 rows whose text mentions a drug from the positive-split vocabulary). Teacher: Vertex AI managedllama-3.3-70b-instruct-maas(temperature 0.2), seeded with 3 yes/no/abstain few-shots and prompted to reserve abstention for cases with no plausible drug or no plausible event. - Filter: required non-empty
answerandevidence,confidence ≥ 0.65, evidence-source word overlap ≥ 0.6. 2,999/3,000 retained. - Optimizer: AdamW, lr=2e-4, warmup_ratio=0.03, weight_decay=0.01, bf16, gradient_checkpointing on.
- 3 epochs with
load_best_model_at_end=Trueoneval_loss; the epoch-1 checkpoint (eval_loss 0.506) was restored, eclipsing the overfit epochs 2–3 (0.547, 0.676). - Hardware: single A100 40GB on GCP
a2-highgpu-1g. Training wall time ~94 min.
Limitations
- Trained on single-sentence, literature-style clinical text. Longer narratives (discharge summaries, EHR free-text) are out of distribution and will likely perform worse.
- Teacher labels are synthetic. A clinician-reviewed eval set was not used; regressions against human judgment have not been measured.
- The model occasionally produces an empty
drugoreventfield on positive cases, which is a regression from v1 on drug-span extraction. See the tradeoff note above. - English only.
Reproducibility
Full pipeline (seed building, teacher generation config, filter, SFT prep, training, evaluation) lives at https://github.com/ventali/medical-distill. Commit 547629f records this adapter's metrics.
License
Inherits the Llama 3.1 Community License from the base model.
- Downloads last month
- 3
Model tree for Ventali/llama31-8b-ade-sft-v2
Base model
meta-llama/Llama-3.1-8BDataset used to train Ventali/llama31-8b-ade-sft-v2
Evaluation results
- exact_match (answer ∈ {yes,no,abstain}) on ade_corpus_v2 (200 held-out)self-reported0.715
- positive_f1 (answer=yes) on ade_corpus_v2 (200 held-out)self-reported0.860
- positive_precision on ade_corpus_v2 (200 held-out)self-reported0.785
- positive_recall on ade_corpus_v2 (200 held-out)self-reported0.950
- span_drug_token_f1 (positives only) on ade_corpus_v2 (200 held-out)self-reported0.883
- span_event_token_f1 (positives only) on ade_corpus_v2 (200 held-out)self-reported0.866