Update README.md

638d6e3 verified 4 months ago

3.77 kB

base_model: Qwen/Qwen3-4B-Thinking-2507
tags:
  - uncensored
  - gabliteration
datasets:
  - mlabonne/harmless_alpaca
  - mlabonne/harmful_behaviors
library_name: gabliteration
arxiv: '2512.18901'
model-index:
  - name: Qwen_Qwen3-4B-Thinking-2507-gabliterated
    results:
      - task:
          type: text-generation
        dataset:
          type: harmless_alpaca
          name: Harmless Alpaca
        metrics:
          - name: KL Divergence
            type: pass@1
            value: 0.114
      - task:
          type: text-generation
        dataset:
          type: harmful_behaviors
          name: Harmful Behaviors
        metrics:
          - name: Refusal Rate
            type: pass@1
            value: 0.02

Gabliterated Model Series

Overview

With this model series, I introduce the first Gabliteration, a novel neural weight modification technique that advances beyond traditional abliteration methods through adaptive multi-directional projections with regularized layer selection. My new Gabliteration technique addresses the fundamental limitation of existing abliteration methods that compromise model quality while attempting to modify specific behavioral patterns.

Refusal: 2/100
KL Div: 0.1140
Config:
    Samples: 400
    Skip: [4, 3]
    Layer: 0.50 (selected: 18)
    Scale: 0.56
    λ: 0.10
    k: 1
    β: 0.56
    Adaptive: True
    τ: 0.73

Benchmarks UGI Leader board:

Gabliterated:

#P: 4
UGI: 32.25
W/10: 9.5
Writing: 11.3
NatInt: 16.67
Political lean: -26.0%

Base:

#P: 4
UGI: 20.78
W/10: 2.8
Writing: 10.87
NatInt: 16.16
Political lean: -26.1%

The Galbliterated version the worlds first 4B model with a W/10 benchmark of 9.5, proving the effectiveness of Gabliteration.

Model Variants

This series includes models ranging from 0.6B to 32B parameters, demonstrating the scalability and effectiveness of the Gabliteration technique across different model sizes.

Quants

Ollama

ollama pull Gabliterated-Qwen3:latest
ollama pull Gabliterated-Qwen3:4b-thinking
ollama pull Gabliterated-Qwen3:4b-thinking-q4_k_m
ollama pull Gabliterated-Qwen3:4b-thinking-q5_k_m
ollama pull Gabliterated-Qwen3:4b-thinking-q6_k
ollama pull Gabliterated-Qwen3:4b-thinking-q8_0

Technical Background

Building upon the foundational work of Arditi et al. (2024) on single-direction abliteration, Gabliteration extends to a comprehensive multi-directional framework with theoretical guarantees. My method employs singular value decomposition on difference matrices between harmful and harmless prompt representations to extract multiple refusal directions.

Dynamic Layer Selection

This model was created using fixed layer selection. A fixed layer fraction was used based on empirical tuning.

Selected layer: 18 (out of 36 total layers)

Citation

If you use these models, please cite the original research (paper coming later this year):

Gülmez, G. (2025). Gabliteration: Adaptive Multi-Directional Neural Weight Modification for Selective Behavioral Alteration in Large Language Models. https://arxiv.org/abs/2512.18901

Acknowledgments

This work builds upon the foundational research by Arditi et al. (2024) on refusal direction identification in large language models.

Bias, Risks, and Limitations

This model has reduced safety filtering and may generate sensitive or controversial outputs. Use responsibly and at your own risk.