Atom Classifier

A multilingual token classifier for semantic hypergraph parsing. It classifies each token in a sentence into one of 39 semantic atom types/subtypes, serving as the first stage (alpha) of the Alpha-Beta semantic hypergraph parser.

Model Details

  • Architecture: DistilBertForTokenClassification
  • Base model: distilbert-base-multilingual-cased
  • Labels: 39 semantic atom types
  • Max sequence length: 512

Label Taxonomy

Atoms are typed according to the Semantic Hyperedge (SH) notation system. The 7 main types and their subtypes:

Concepts (C)

Label Description
C Generic concept
Cc Common noun
Cp Proper noun
Ca Adjective (as concept)
Ci Pronoun
Cd Determiner (as concept)
Cm Nominal modifier
Cw Interrogative word
C# Number

Predicates (P)

Label Description
P Generic predicate
Pd Declarative predicate
P! Imperative predicate

Modifiers (M)

Label Description
M Generic modifier
Ma Adjective modifier
Mc Conceptual modifier
Md Determiner modifier
Me Adverbial modifier
Mi Infinitive particle
Mj Conjunctional modifier
Ml Particle
Mm Modal (auxiliary verb)
Mn Negation
Mp Possessive modifier
Ms Superlative modifier
Mt Prepositional modifier
Mv Verbal modifier
Mw Specifier
M# Number modifier
M= Comparative modifier
M^ Degree modifier

Builders (B)

Label Description
B Generic builder
Bp Possessive builder
Br Relational builder (preposition)

Triggers (T)

Label Description
T Generic trigger
Tt Temporal trigger
Tv Verbal trigger

Conjunctions (J)

Label Description
J Generic conjunction
Jr Relational conjunction

Special

Label Description
X Excluded token (punctuation, etc.)

Usage

from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

tokenizer = AutoTokenizer.from_pretrained("hyperquest/atom-classifier")
model = AutoModelForTokenClassification.from_pretrained("hyperquest/atom-classifier")

sentence = "Berlin is the capital of Germany."
encoded = tokenizer(sentence, return_tensors="pt", return_offsets_mapping=True)
offset_mapping = encoded.pop("offset_mapping")

with torch.no_grad():
    outputs = model(**encoded)

predictions = outputs.logits.argmax(-1)[0].tolist()
word_ids = encoded.word_ids(0)

for idx, word_id in enumerate(word_ids):
    if word_id is not None:
        start, end = offset_mapping[0][idx].tolist()
        label = model.config.id2label[predictions[idx]]
        print(f"{sentence[start:end]:15s} -> {label}")

Intended Use

This model is designed to be used as the first stage of the Alpha-Beta semantic hypergraph parser (hyperbase-parser-ab). It assigns atom types to tokens, which are then combined into nested hypergraph structures by rule-based grammar in the beta stage.

Part of

Downloads last month
32
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for hyperquest/atom-classifier

Finetuned
(431)
this model