1.9 KiB
clint-dataset
Dataset for labelling queries containing tasks in natural language, with a focus on command-line operations pertaining to popular CLI tools.
These queries were generated by prompting various commercially available LLMs and were pre-annotated using Gemini-2.5-flash-lite. They were then converted to a label studio supported format and then annotations were manually revised.
Training
uv run mini_lm.py
uv run inference.py --top-k
Setup
uv sync
Set your Gemini API key:
export GEMINI_API_KEY="your-key"
Optional environment variables:
export GEMINI_MODEL="gemini-2.5-flash-lite"
export GEMINI_RAW_LOG_FILE="logs/gemini_raw.log"
Pre-annotate raw datasets
Raw datasets live in datasets/raw and contain:
[{ "text": "Trim the first 15 seconds from 'video.mp4'." }]
Run the pre-annotator:
uv run python main.py --mode preannotate --input-dir datasets/raw --output-dir datasets/preannotated --batch-size 20
Output format (per item):
{
"text": "Trim the first 15 seconds from 'video.mp4'.",
"tags": [
{ "span": "Trim", "label": "ACTION" },
{ "span": "15", "label": "NUMBER" }
]
}
Raw Gemini responses are logged to logs/gemini_raw.log (override with --raw-log-file or GEMINI_RAW_LOG_FILE).
Convert preannotated → annotated
Convert pre-annotated files to Label Studio–style annotated JSON:
uv run python main.py --mode convert --input-dir datasets/preannotated --output-dir datasets/annotated
The converter generates IDs in XXX-XXXXXX format for annotation results and sets annotations[].id to a sequential number.
Analyze annotated datasets
dataset_analysis.parse_annotated(path) returns a dict of label counts:
from dataset_analysis import parse_annotated
counts = parse_annotated("datasets/annotated/ffmpeg_gpt_v1.json")
print(counts)