Files
clint-dataset/README.md
2026-04-08 17:27:11 +05:30

1.9 KiB
Raw Permalink Blame History

clint-dataset

Dataset for labelling queries containing tasks in natural language, with a focus on command-line operations pertaining to popular CLI tools.

These queries were generated by prompting various commercially available LLMs and were pre-annotated using Gemini-2.5-flash-lite. They were then converted to a label studio supported format and then annotations were manually revised.

Training

uv run mini_lm.py
uv run inference.py --top-k

Setup

uv sync

Set your Gemini API key:

export GEMINI_API_KEY="your-key"

Optional environment variables:

export GEMINI_MODEL="gemini-2.5-flash-lite"
export GEMINI_RAW_LOG_FILE="logs/gemini_raw.log"

Pre-annotate raw datasets

Raw datasets live in datasets/raw and contain:

[{ "text": "Trim the first 15 seconds from 'video.mp4'." }]

Run the pre-annotator:

uv run python main.py --mode preannotate --input-dir datasets/raw --output-dir datasets/preannotated --batch-size 20

Output format (per item):

{
  "text": "Trim the first 15 seconds from 'video.mp4'.",
  "tags": [
    { "span": "Trim", "label": "ACTION" },
    { "span": "15", "label": "NUMBER" }
  ]
}

Raw Gemini responses are logged to logs/gemini_raw.log (override with --raw-log-file or GEMINI_RAW_LOG_FILE).

Convert preannotated → annotated

Convert pre-annotated files to Label Studiostyle annotated JSON:

uv run python main.py --mode convert --input-dir datasets/preannotated --output-dir datasets/annotated

The converter generates IDs in XXX-XXXXXX format for annotation results and sets annotations[].id to a sequential number.

Analyze annotated datasets

dataset_analysis.parse_annotated(path) returns a dict of label counts:

from dataset_analysis import parse_annotated

counts = parse_annotated("datasets/annotated/ffmpeg_gpt_v1.json")
print(counts)