# clint-dataset Dataset for labelling queries containing tasks in natural language, with a focus on command-line operations pertaining to popular CLI tools. These queries were generated by prompting various commercially available LLMs and were pre-annotated using Gemini-2.5-flash-lite. They were then converted to a label studio supported format and then annotations were manually revised. ## Training ```bash uv run mini_lm.py ``` ```bash uv run inference.py --top-k ``` ## Setup ```bash uv sync ``` Set your Gemini API key: ```bash export GEMINI_API_KEY="your-key" ``` Optional environment variables: ```bash export GEMINI_MODEL="gemini-2.5-flash-lite" export GEMINI_RAW_LOG_FILE="logs/gemini_raw.log" ``` ## Pre-annotate raw datasets Raw datasets live in `datasets/raw` and contain: ```json [{ "text": "Trim the first 15 seconds from 'video.mp4'." }] ``` Run the pre-annotator: ```bash uv run python main.py --mode preannotate --input-dir datasets/raw --output-dir datasets/preannotated --batch-size 20 ``` Output format (per item): ```json { "text": "Trim the first 15 seconds from 'video.mp4'.", "tags": [ { "span": "Trim", "label": "ACTION" }, { "span": "15", "label": "NUMBER" } ] } ``` Raw Gemini responses are logged to `logs/gemini_raw.log` (override with `--raw-log-file` or `GEMINI_RAW_LOG_FILE`). ## Convert preannotated → annotated Convert pre-annotated files to Label Studio–style annotated JSON: ```bash uv run python main.py --mode convert --input-dir datasets/preannotated --output-dir datasets/annotated ``` The converter generates IDs in `XXX-XXXXXX` format for annotation results and sets `annotations[].id` to a sequential number. ## Analyze annotated datasets `dataset_analysis.parse_annotated(path)` returns a dict of label counts: ```python from dataset_analysis import parse_annotated counts = parse_annotated("datasets/annotated/ffmpeg_gpt_v1.json") print(counts) ```