clint-dataset/README.md

# clint-dataset

Dataset for labelling queries containing tasks in natural language, with a focus on command-line operations pertaining to popular CLI tools.

These queries were generated by prompting various commercially available LLMs and were pre-annotated using Gemini-2.5-flash-lite. They were then converted to a label studio supported format and then annotations were manually revised.

## Training

```bash
uv run mini_lm.py
```

```bash
uv run inference.py --top-k
```

## Setup

```bash
uv sync
```

Set your Gemini API key:

```bash
export GEMINI_API_KEY="your-key"
```

Optional environment variables:

```bash
export GEMINI_MODEL="gemini-2.5-flash-lite"
export GEMINI_RAW_LOG_FILE="logs/gemini_raw.log"
```

## Pre-annotate raw datasets

Raw datasets live in `datasets/raw` and contain:

```json
[{ "text": "Trim the first 15 seconds from 'video.mp4'." }]
```

Run the pre-annotator:

```bash
uv run python main.py --mode preannotate --input-dir datasets/raw --output-dir datasets/preannotated --batch-size 20
```

Output format (per item):

```json
{
  "text": "Trim the first 15 seconds from 'video.mp4'.",
  "tags": [
    { "span": "Trim", "label": "ACTION" },
    { "span": "15", "label": "NUMBER" }
  ]
}
```

Raw Gemini responses are logged to `logs/gemini_raw.log` (override with `--raw-log-file` or `GEMINI_RAW_LOG_FILE`).

## Convert preannotated → annotated

Convert pre-annotated files to Label Studio–style annotated JSON:

```bash
uv run python main.py --mode convert --input-dir datasets/preannotated --output-dir datasets/annotated
```

The converter generates IDs in `XXX-XXXXXX` format for annotation results and sets `annotations[].id` to a sequential number.

## Analyze annotated datasets

`dataset_analysis.parse_annotated(path)` returns a dict of label counts:

```python
from dataset_analysis import parse_annotated

counts = parse_annotated("datasets/annotated/ffmpeg_gpt_v1.json")
print(counts)
```