SetFit with BAAI/bge-small-en-v1.5

This is a SetFit model trained on the davanstrien/hf-dataset-domain-labels-v0 dataset that can be used for Text Classification. This SetFit model uses BAAI/bge-small-en-v1.5 as the Sentence Transformer embedding model. A LogisticRegression instance is used for classification.

The model has been trained using an efficient few-shot learning technique that involves:

Fine-tuning a Sentence Transformer with contrastive learning.
Training a classification head with features from the fine-tuned Sentence Transformer.

Model Details

Model Description

Model Type: SetFit
Sentence Transformer body: BAAI/bge-small-en-v1.5
Classification head: a LogisticRegression instance
Maximum Sequence Length: 512 tokens
Number of Classes: 9 classes
Training Dataset: davanstrien/hf-dataset-domain-labels-v0

Model Sources

Repository: SetFit on GitHub
Paper: Efficient Few-Shot Learning Without Prompts
Blogpost: SetFit: Efficient Few-Shot Learning Without Prompts

Model Labels

Label	Examples
biology	'# Blind Spots of google/gemma-3-1b-pt\n\n## Model Tested\nModel: google/gemma-3-1b-pt \nParameters: 1B \nType: Pre-trained base language model (not instruction-tuned) \nTested by: Toka-Tarek
chemistry	'# Concrete Compressive Strength Testing Dataset for West Africa\n\n## Abstract\n\nThis dataset provides synthetic but research-grounded data on concrete compressive strength testing across West Africa, with particular focus on Nigerian construction practices. The data encompasses laboratory testing parameters, mix design variables, and compressive strength outcomes based on ASTM C39 standards and published empirical studies from the region. Each record contains detailed information on mix proportions, water-cement ratios, curing conditions, aggregate sources, and corresponding compressive strength values at various test ages.\n\nKeywords: concrete compressive strength, ASTM C39, mix design, water-cement ratio, West Africa, Nigeria, structural concrete\n\n---\n\n## 1. Introduction\n\n### 1.1 Background\n\nConcrete remains the most widely used construction material in West Africa, with compressive strength serving as the primary indicator of structural quality and performance. Understanding the relationships between mix parameters and resulting strength is essential for quality control, structural design, and construction practice optimization in the region.\n\n### 1.2 Problem Statement\n\nTraditional concrete mix design in Nigeria and neighboring West African countries relies heavily on empirical practices and local material sources. However, significant variability exists in:\n- Cement brands and their performance characteristics\n- Aggregate sources and quality\n- Curing conditions (especially in tropical climates)\n- Testing laboratory standards and practices\n\n### 1.3 Research Objectives\n\nThis dataset aims to provide:\n1. Comprehensive data on concrete compressive strength testing parameters\n2. Realistic statistical distributions based on Nigerian and West African empirical studies\n3. Variable combinations reflecting actual construction site conditions\n4. Quality control benchmarks aligned with international standards\n\n---\n\n## 2. Methodology\n\n### 2.1 Data Generation Framework\n\nThe synthetic data generation follows a DAG-based (Directed Acyclic Graph) sampling approach, where parent variables are sampled before dependent child variables. This ensures realistic correlations between parameters.\n\n### 2.2 Parameter Evidence Table\n\n
climate	'\n\n# The ClimateCheck Dataset\n\nThis dataset is used for the ClimateCheck: Scientific Fact-checking of Social Media Posts on Climate Change Shared Task.\nThe 2025 iteration was hosted at the Scholarly Document Processing workshop at ACL 2025, and a new 2026 iteration will be hosted at the Natural Scientific Language Processing workshop at LREC 2026.\n\n## 2026 Update\n\nFor running the next iteration of the task, we added manually labelled training data, resulting in 3023 claim-abstract pairs overall. The claims used for testing are unchanged. \n\n## Dataset Development Process\n\nClaims\n\nThe claims used for this dataset were gathered from the following existing resources: ClimaConvo, DEBAGREEMENT, Climate-Fever, MultiFC, and ClimateFeedback. \nSome of which are extracted from social media (Twitter/X and Reddit) and some were created synthetically from news and media outlets using text style transfer techniques to resemble tweets. \nAll claims underwent a process of scientific check-worthiness detection and are formed as atomic claims (i.e. containing only one core claim).\n\nPublications Corpus\n\nTo retrieve relevant abstracts, a corpus of publications was gathered from OpenAlex and S2ORC, containining 394,269 abstracts. It can be accessed here: https://huggingface.co/datasets/rabuahmad/climatecheck_publications_corpus\n\nAnnotation Processes\n\nThe training and testing data for claim verification were annotated by five graduate students in the Climate and Environmental Sciences. \nUsing a TREC-like pooling approach, we retrieved the top 20 abstracts for each claim using BM25 followed by a neural cross-encoder trained on the MSMARCO data. \nThen we used 6 state-of-the-art models to classify claim-abstract pairs. If a pair resulted in at least 3 evidentiary predictions, it was added to the annotation corpus. \nEach claim-abstract pair was annotated by two students and resolved by a curator in cases of disagreements.\n\nThe training and testing data for narrative classification were annotated by four graduate students, all of whom annotated every unique claim in the dataset. \nThe final labels were chosen using a majority vote approach. When there was no majority, two curators annotated and discussed the final label choice. \n\nTraining Data\n\nThe training data contains the following:\n\n- claim: a string value of a claim about climate change.\n- abstract: a string value of an abstract relating to the claim.\n- abstract_id: the ID of the connected abstract, which corresponds to the publications corpus (see above) and can be used to retrieve more metadata about the abstract.\n- annotation: a label of 'Supports', 'Refutes', or 'Not Enough Information', describing the relation of the connected abstract to the claim.\n- data_version: 2025 if the claim was released during the 1st iteration of the task and 2026 if it was added in the 2nd iteration.\n- narrative: a label according to the CARDS taxonomy denoting whether the claim is an example of a known climate disinformation narrative. Only the first two levels of the taxonomy were used. \n \nThe training data consists of 3023 instances with 782 unique claims. Each claim is connected to at least 1 and at most 5 abstracts. \n\nThe distribution of the labels for claim verification is as follows:\n\n
code	'# Dataset Card for python_code_instructions_18k_alpaca\n\nThe dataset contains problem descriptions and code in python language.\nThis dataset is taken from sahil2801/code_instructions_120k, which adds a prompt column in alpaca style. Refer to the source here.' '# SERA — Consolidated & Rectified\n\n211,360 multi-turn SWE-agent coding trajectories from the SERA (Soft-Verified Efficient Repository Agents) project, consolidated from 4 source datasets into a single file with strict reasoning + tool-call format and validated FSM transitions.\n\n## Origin\n\nDerived from Allen AI's Open Coding Agents release:\n\n
cybersecurity	'# Bug Bounty & Méthodologies de Pentest\n\nMéthodologies (OWASP, PTES), checklists par type d app, techniques d attaque, plateformes, templates de rapports et outils.\n\n## Links\n- Version anglaise\n- AYI NEDJIMI Consultants' '# 🛡️ Security Reasoning Dataset for Prompt injection and PII (senstive data) detection\n\nThis dataset contains 2,139 high-quality synthetic examples designed for training lightweight security models—specifically targeting the `firewall-gemma-3-4b-it` architecture—using the Distilling Step-by-Step methodology. \n\n## 📊 Dataset Analytics & Distribution\n\nThe dataset is engineered to handle real-world enterprise edge cases, specifically the "needle-in-a-haystack" problem where malicious payloads or sensitive data are buried deep within massive contexts.\n\n* Total Samples: 2,139 \n* Training Set: 2,000 examples\n* Test Set: 139 examples (Note: test sets can be adjusted to a strict 100 split based on the files used)\n* Length Distribution: Ranges from short 10-word direct triggers to complex payloads exceeding 1,500 characters. \n* Format: Multi-turn conversational formats, raw document text, and code blocks.\n\n### Category Breakdown & Domain Coverage\nThe dataset spans 50+ technical and business domains to ensure the firewall remains highly accurate across different enterprise environments.\n\n
finance	'# AMM-Events: A Multi-Protocol DeFi Event Dataset\n\n## Dataset Description\n\nAMM-Events is a high-fidelity, block-level dataset capturing 8.9 million on-chain events from the Ethereum mainnet, specifically designed for event-aware forecasting and market microstructure analysis in Decentralized Finance (DeFi). \n\nUnlike traditional financial datasets based on Limit Order Books (LOB), this dataset focuses on Automated Market Makers (AMMs), where price dynamics are triggered exclusively by discrete on-chain events (e.g., swaps, mints, burns) rather than continuous off-chain information.\n\n- Paper Title: Towards Event-Aware Forecasting in DeFi: Insights from On-chain Automated Market Maker Protocols\n- Total Events: 8,917,353\n- Time Span: Jan 1, 2024 – Sep 16, 2025\n- Block Range: 18,908,896 – 23,374,292\n- Protocols: Uniswap V3, Aave, Morpho, Pendle\n- Granularity: Block-level timestamps & transaction-level event types\n\n### Supported Tasks\n- Event Forecasting: Predicting the next event type (classification/TPP) and time-to-next-event (regression/TPP).\n- Market Microstructure Analysis: Analyzing causal synchronization between liquidity events and price shocks.\n- Anomaly Detection: Identifying "Black Swan" traffic surges or congestion events.\n\n---\n\n## Dataset Structure\n\nThe data is organized into a standardized JSON format. Each entry decouples complex smart contract logic into interpretable metrics.\n\n### Data Fields\n\n- `block_number` (int): The Ethereum block height where the event occurred.\n- `timestamp` (int): Unix timestamp of the block.\n- `transaction_hash` (string): Unique identifier for the transaction.\n- `protocol` (string): Origin protocol (`Uniswap V3`, `Aave`, `Morpho`, or `Pendle`).\n- `event_type` (string): The category of the event (`Swap`, `Mint`, `Burn`, `UpdateImpliedRate`, etc.).\n- `payload` (dict): Protocol-specific metrics (e.g., `amount0`, `amount1`, `liquidity`, `tick` for Uniswap).\n\n### Data Splits\n\nThe dataset covers 359 liquidity pools selected for high activity and representativeness:\n- Pendle: 296 pools (Yield Trading)\n- Aave: 53 pools (Lending)\n- Uniswap V3: 5 pools (Spot Trading)\n- Morpho: 5 pools (Lending Optimization)\n\n---\n\n## Usage\n\n### Loading the Data\nYou can load this dataset directly using the Hugging Face `datasets` library:\n\n```python\nfrom datasets import load_dataset\n\ndataset = load_dataset("Jackson668/AMM-Events")\n\n# Example: Accessing the first train example\nprint(dataset['train'])' 'Dataset: Mimi1782/kaggle\n\n' 'Dataset: Menoviar28/menov131\n\n'
legal	'# UAE Laws Q&A Dataset (IRAC Format)\n\nA high-quality dataset of 9,477 question-answer pairs about UAE laws, formatted in IRAC (Issue, Rule, Application, Conclusion) legal reasoning structure.\n\n## Dataset Creation\n\n### Source Documents\nThe dataset was built from a comprehensive collection of UAE legal documents, including:\n- Federal Decrees and Laws\n- Cabinet Resolutions\n- Ministerial Decisions\n- Civil and Commercial Codes\n- Labor Law\n- Traffic Law\n- And more\n\n### Creation Process\n\n1. Human-Written Seed Data: Manually crafted ~1,000 high-quality question-answer pairs directly from the source legal documents, ensuring accuracy and proper legal reasoning.\n\n2. Synthetic Generation: The remaining samples were synthetically generated using the source documents as context, following the patterns established by the human-written examples.\n\n3. Human Feedback & Validation: All synthetically generated samples underwent human review for:\n - Legal accuracy and correctness\n - Proper citation of articles and laws\n - Clarity and completeness of explanations\n - Adherence to IRAC format\n\n4. IRAC Formatting: All responses were standardized to follow the IRAC legal reasoning format for consistency and educational value.\n\n## Dataset Description\n\nAll responses follow the IRAC format:\n- Issue: Identifies the legal question or problem\n- Rule: States the relevant law, article, or regulation\n- Application: Explains how the rule applies to the situation\n- Conclusion: Provides the final answer\n\n## Dataset Statistics\n\n
math	'# FineProofs SFT\n\n## Dataset Description\n\nFineProofs SFT is a high-quality supervised fine-tuning dataset containing mathematical Olympiad problems paired with chain-of-thought reasoning and formal proofs distilled from DeepSeek-Math-V2. The dataset comprises 7,777 samples (4,300 unique problems) sourced from international Olympiad competitions and Art of Problem Solving (AoPS), each annotated with:\n\n- Detailed reasoning traces (thinking content) generated by deepseek-ai/DeepSeek-Math-V2\n- Formal mathematical proofs generated by deepseek-ai/DeepSeek-Math-V2\n- Expert grades from Gemini-3-Pro (0-7 point scale)\n- Model-based difficulty/reward scores from Qwen/Qwen3-4B-Thinking-2507\n- Problem categorization and competition metadata\n\nThis dataset is specifically designed for training reasoning models to generate proofs for competition-level mathematics.\n\n### Key Features\n\n- Chain-of-Thought Reasoning: Each solution includes explicit reasoning traces enclosed in `<think>` tags, distilled from deepseek-ai/DeepSeek-Math-V2\n- Formal Proofs: LaTeX-formatted mathematical proofs\n- Dual Quality Signals:\n - Expert grades from Gemini-3-Pro (continuous 0-7 scale)\n - Difficulty estimates from Qwen/Qwen3-4B-Thinking-2507 reward@128 scored by openai/gpt-oss-20b\n- Unfiltered Teacher Distillation: Includes both correct and incomplete solutions from teacher model to maximize learning signal for smaller models\n- Diverse Sources: Problems from IMO, APMO, USA(J)MO, and other prestigious competitions\n- Rich Metadata: Includes problem type categorization, competition information, and source attribution\n- Benchmark Decontaminated: No overlap with IMO-ProofBench or ProofBench evaluation sets\n\n## Dataset Structure\n\n### Data Fields\n\n
medical	"# Dataset Card for Dataset Name\n\n\n\nThis medium sized dataset 20K samples has been created with AOKVQA Train & Val split, Path-VQA Train & Val Split, TDIUC Val Split (Quantitative and Physical Reasoning Questions only). This is a multidomain dataset solely created to test the multidomain knowledge of VLM's, it can be used for inference or rapid prototyping. This is for educational and research purposes only. All the copyright belongs to the original owners of the datasets." '# Medication-Specific QA Benchmark\n\n## Dataset Details\n\nTo construct a controlled evaluation benchmark, we selected 25 widely prescribed medications in Brazil across four therapeutic categories: antibiotics, analgesics and anti-inflammatory agents, antihypertensives, and antidiabetics. These categories were chosen to ensure clinical diversity across infectious, inflammatory, cardiovascular, and metabolic conditions.\n\nFor each selected medication, we verified the presence of its corresponding leaflet in the cleaned corpus. We then identified 10 standardized sections consistently present across documents, including indications, dosage, contraindications, warnings, adverse reactions, and drug interactions.\n\nQuestion generation was performed using a large language model conditioned on the medication name and structured leaflet content. To ensure balanced coverage and section-level control, we generated:\n\n- 4 questions per medication per section;\n- 40 questions per section across medications;\n- 400 total open-ended questions.\n\nThis benchmark, referred to as the Medication-Specific QA Benchmark, is explicitly designed to measure factual recall, section-level grounding, and evidence attribution when answers are directly localized within regulatory documents.\n\n## Citation\n\nThis work was accepted at The First Workshop on Language Technologies for Health (Lang4Health) is a workshop dedicated to the development and application of Natural Language Processing (NLP) technologies in the healthcare field.' '# What this repo does\n\nThis repository provides a Clarus v0.8 clinical five-node dataset for detecting and reasoning about multi-organ failure cascade boundary transitions.\n\nThe dataset models situations where a patient state is no longer contained within a single metabolic-organ failure basin but is shifting between competing regimes such as:\n\n- metabolic overload with early organ strain\n- renal-hepatic failure transition\n- perfusion-linked organ cascade\n- refractory multi-organ failure\n\nThis is the conceptual upgrade introduced in Clarus v0.8.\n\nEarlier ladder versions detect instability, forecast deterioration, estimate collapse boundaries, model recovery geometry, and reason about intervention.\n\nv0.8 introduces regime transition geometry.\n\nThe system now measures not only distance to the nearest failure boundary but also distance to the nearest competing regime boundary.\n\nThis allows models to detect:\n\n- regime switching\n- competing failure modes\n- unstable regime identity\n- transition-aware intervention reasoning\n\n# Core five-node cascade\n\nThe five core variables in this dataset are:\n\n- `metabolic_stress`\n- `physiologic_buffer`\n- `response_lag`\n- `organ_coupling`\n- `perfusion_stability`\n\nOperational definitions:\n\nmetabolic_stress\n\nTotal metabolic burden imposed by acidosis, catabolism, electrolyte instability, and impaired substrate handling.\n\nphysiologic_buffer\n\nRemaining reserve available to absorb metabolic insult without organ-level cascade.\n\nresponse_lag\n\nDelay in correcting metabolic derangement or restoring systemic stability.\n\norgan_coupling\n\nDegree to which metabolic instability is propagating into coordinated failure across organs.\n\nperfusion_stability\n\nCurrent ability to maintain pressure-flow coherence across tissues despite worsening metabolic and organ stress.\n\n# Clinical variable mapping\n\n

Evaluation

Metrics

Label	Accuracy
all	0.8333

Uses

Direct Use for Inference

First install the SetFit library:

pip install setfit

Then you can load this model and run inference.

from setfit import SetFitModel

# Download from the 🤗 Hub
model = SetFitModel.from_pretrained("davanstrien/setfit-hf-dataset-domain-v0")
# Run inference
preds = model("# Dataset Card for The Wilds Bioacoustics Monitors

This dataset contains passive acoustic recordings collected at [The Wilds safari park](https://www.thewilds.org/) in Ohio during Summer 2025. 
Recorders captured ambient soundscapes to support ecological monitoring, animal behavior analysis, and acoustic biodiversity modeling.

## Dataset Details

### Dataset Description

- **Curated by:** Tanishka Wani, Vedant Patil, Rugved Katole, Bharath Pillai, Anirudh Potlapally, Ally Bonney, and Jenna Kline
- **Repository:** [https://github.com/Imageomics/naturelab](https://github.com/Imageomics/naturelab)  
- **Paper:** [SmartWilds: Multimodal Wildlife Monitoring Dataset](https://arxiv.org/abs/2509.18894)

This dataset was created to support multimodal wildlife monitoring research using passive acoustic monitoring. Bioacoustic data were collected using Wildlife Acoustics Song Meter devices deployed across four field sites at The Wilds. The recordings capture natural soundscapes including wildlife vocalizations, environmental sounds, and ambient audio that can be used for species detection, behavioral analysis, and biodiversity assessment.

### Supported Tasks and Leaderboards

- **Audio Classification:** Species identification from acoustic recordings
- **Sound Event Detection:** Detection and localization of animal vocalizations
- **Biodiversity Assessment:** Acoustic diversity indices and community analysis
- **Behavioral Analysis:** Temporal activity patterns and acoustic behavior studies
- **Soundscape Ecology:** Environmental audio analysis and habitat characterization

[No benchmarks currently available]

## Dataset Structure

The dataset is organized hierarchically by site and deployment session:

/dataset/ bioacoustic.txt The_Wilds_Bioacoustic_Log2025-06-30_21_54_59.csv The_Wilds_Bioacoustic_Log2025-07-04_20_18_38.csv TW05-SM01/ metadata.md SD01_20250630_20250703/ SM001_20250630_195900.wav SM001_20250630_200402.wav SM001_20250630_200902.wav ... SM001_20250703_064902.wav SM001_20250703_065402.wav SM001_20250703_065902.wav TW06-SM03/ metadata.md SD03_20250630_20250703/ SM03_20250630_140000.wav SM03_20250630_150000.wav SM03_20250630_160000.wav SM03_20250630_170000.wav ... SM03_20250703_140000.wav SM03_20250703_150000.wav SM03_20250703_160000.wav TW07-SM02/ metadata.md SD02_20250630_20250703/ SM002_20250630_195900.wav SM002_20250630_205902.wav SM002_20250701_050300.wav ... SM002_20250702_205902.wav SM002_20250703_050400.wav SM002_20250703_060402.wav TW08-SM04/ metadata.md SD04_20250630_20250703/ SM04_20250630_120000.wav SM04_20250630_130000.wav SM04_20250630_140000.wav ... SM04_20250703_150000.wav SM04_20250703_160000.wav SM04_20250703_170000.wav


### Data Instances

Each bioacoustic deployment folder contains:
- **Audio files:** .wav format recordings captured by scheduled recording
- **Metadata file:** `metadata.md` with deployment information and recorder settings

**File Counts by Recorder:**
- **TW05-SM01:** 144 audio files (.wav recordings)
- **TW06-SM03:** 75 audio files (.wav recordings)
- **TW07-SM02:** 12 audio files (.wav recordings)
- **TW08-SM04:** 78 audio files (.wav recordings)

**Audio File Specifications:**
- **Format:** .wav (uncompressed)
- **Channels:** Mono
- **Bit depth:** 16-bit
- **Sample rate:** 48 kHz
- **Duration:** Variable based on recording schedule

**Filename Conventions:**
- **SM001/SM03/SM04 series:** SM0##_YYYYMMDD_HHMMSS.wav (TW05-SM01, TW06-SM03, TW08-SM04)
- **SM002 series:** SM002_YYYYMMDD_HHMMSS.wav (TW07-SM02)

**Total Dataset Size:** 311 audio files across all bioacoustic monitor deployments.

Each .wav file is a field recording captured according to programmed recording schedules. File names include timestamps indicating the start time of each recording session.

### Data Fields

**metadata.md** (found in each recorder deployment folder):
- **Recorder ID:** Unique device identifier (SM01, SM02, SM03, SM04)
- **Device Model:** Song Meter model name (e.g., Song Meter Micro 2)
- **Device Serial Number:** Manufacturer-assigned serial number
- **Site ID:** Location code where deployed (TW05, TW06, TW07, TW08)
- **Deployment Location Description:** Text description of exact location and surroundings
- **GPS Coordinates:** Latitude and longitude in decimal format
- **Deployment Date and Time:** Recorder deployment timestamp (YYYY-MM-DD HH:MM format)
- **Retrieval Date and Time:** Recorder retrieval timestamp (YYYY-MM-DD HH:MM format)
- **Orientation / Microphone Facing:** Direction and environmental considerations (e.g., \"East, away from wind and road\")
- **Mounting Height:** Approximate height of microphone from ground in meters
- **Recording Schedule Preset:** Schedule or settings used for recording (e.g., \"1 hour at sunrise and sunset\")
- **Time Zone Set on Device:** Local time zone configured (e.g., \"USA Eastern (UTC-5)\")
- **Maintenance Notes:** Issues, configuration changes, or deviations from standard settings
- **Observer:** Name or initials of person completing metadata

**CSV Log Files:**
- `The_Wilds_Bioacoustic_Log2025-06-30_21_54_59.csv`: Deployment log from June 30, 2025
- `The_Wilds_Bioacoustic_Log2025-07-04_20_18_38.csv`: Retrieval log from July 4, 2025

### Data Splits

This dataset has no predefined training/validation/test splits. Data are organized by site (TW05-TW08) and deployment session. Users may create their own splits based on:
- **Temporal splits:** Using recording timestamps across the deployment period
- **Spatial splits:** Using different site locations (TW05, TW06, TW07, TW08)
- **Recorder-based splits:** Using different Song Meter devices (SM01, SM02, SM03, SM04)

Recommended approach depends on modeling goals and research questions.

## Dataset Creation

### Curation Rationale

This dataset supports biodiversity monitoring, behavioral ecology research, and the development of automated species detection and classification models from passive acoustic recordings. Bioacoustic monitoring provides complementary data to camera trap surveys and enables detection of cryptic or nocturnal species that may be missed by visual methods.

### Source Data

#### Data Collection and Processing

Recordings were collected at The Wilds safari park during summer 2025 using Wildlife Acoustics Song Meter devices. Four recorders (SM01-SM04) were strategically deployed at sites TW05-TW08 from June 30 to July 3, 2025. 

Devices were programmed for scheduled recordings with different sampling strategies across sites. Recorders were mounted on trees or posts at appropriate heights and orientations to minimize wind noise and maximize acoustic detection. Upon retrieval, audio files were organized by deployment session and basic metadata were recorded. No audio processing, filtering, or annotation was applied to preserve the raw acoustic data.

#### Who are the source data producers?

The dataset was collected and curated by researchers and students from the Imageomics Institute and Ohio State University in collaboration with conservation staff at The Wilds safari park in Ohio.

### Annotations

#### Annotation process

No species identification or acoustic annotations are currently provided with this initial dataset release. Manual and AI-assisted labeling efforts for species detection, vocalization classification, and acoustic event annotation are planned for future versions.

#### Who are the annotators?

N/A - annotations will be added in future releases

### Personal and Sensitive Information

The dataset includes GPS coordinates within The Wilds, a public conservation ")

Training Details

Training Set Metrics

Training set	Min	Median	Max
Word count	2	400.3986	4498

Label	Training Sample Count
biology	149
chemistry	89
climate	135
code	200
cybersecurity	200
finance	200
legal	200
math	185
medical	200

Training Hyperparameters

batch_size: (32, 32)
num_epochs: (1, 1)
max_steps: -1
sampling_strategy: oversampling
num_iterations: 5
body_learning_rate: (2e-05, 1e-05)
head_learning_rate: 0.01
loss: CosineSimilarityLoss
distance_metric: cosine_distance
margin: 0.25
end_to_end: False
use_amp: False
warmup_proportion: 0.1
l2_weight: 0.01
seed: 42
eval_max_steps: -1
load_best_model_at_end: False

Training Results

Epoch	Step	Training Loss	Validation Loss
0.0021	1	0.2723	-
0.1027	50	0.2194	-
0.2053	100	0.1241	-
0.3080	150	0.0837	-
0.4107	200	0.0693	-
0.5133	250	0.0579	-
0.6160	300	0.0501	-
0.7187	350	0.0443	-
0.8214	400	0.0415	-
0.9240	450	0.0394	-

Framework Versions

Python: 3.12.12
SetFit: 1.1.3
Sentence Transformers: 5.3.0
Transformers: 4.50.3
PyTorch: 2.11.0+cu130
Datasets: 4.8.4
Tokenizers: 0.21.4

Citation

BibTeX

@article{https://doi.org/10.48550/arxiv.2209.11055,
    doi = {10.48550/ARXIV.2209.11055},
    url = {https://arxiv.org/abs/2209.11055},
    author = {Tunstall, Lewis and Reimers, Nils and Jo, Unso Eun Seo and Bates, Luke and Korat, Daniel and Wasserblat, Moshe and Pereg, Oren},
    keywords = {Computation and Language (cs.CL), FOS: Computer and information sciences, FOS: Computer and information sciences},
    title = {Efficient Few-Shot Learning Without Prompts},
    publisher = {arXiv},
    year = {2022},
    copyright = {Creative Commons Attribution 4.0 International}
}

Downloads last month: 29

Safetensors

Model size

33.4M params

Tensor type

F32

Model tree for davanstrien/setfit-hf-dataset-domain-v0

Base model

BAAI/bge-small-en-v1.5

Finetuned

(295)

this model

Dataset used to train davanstrien/setfit-hf-dataset-domain-v0

Papers for davanstrien/setfit-hf-dataset-domain-v0

Evaluation results

Accuracy on davanstrien/hf-dataset-domain-labels-v0
test set self-reported

0.833