Files
clint-dataset/README.md
2026-04-08 17:27:11 +05:30

84 lines
1.9 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# clint-dataset
Dataset for labelling queries containing tasks in natural language, with a focus on command-line operations pertaining to popular CLI tools.
These queries were generated by prompting various commercially available LLMs and were pre-annotated using Gemini-2.5-flash-lite. They were then converted to a label studio supported format and then annotations were manually revised.
## Training
```bash
uv run mini_lm.py
```
```bash
uv run inference.py --top-k
```
## Setup
```bash
uv sync
```
Set your Gemini API key:
```bash
export GEMINI_API_KEY="your-key"
```
Optional environment variables:
```bash
export GEMINI_MODEL="gemini-2.5-flash-lite"
export GEMINI_RAW_LOG_FILE="logs/gemini_raw.log"
```
## Pre-annotate raw datasets
Raw datasets live in `datasets/raw` and contain:
```json
[{ "text": "Trim the first 15 seconds from 'video.mp4'." }]
```
Run the pre-annotator:
```bash
uv run python main.py --mode preannotate --input-dir datasets/raw --output-dir datasets/preannotated --batch-size 20
```
Output format (per item):
```json
{
"text": "Trim the first 15 seconds from 'video.mp4'.",
"tags": [
{ "span": "Trim", "label": "ACTION" },
{ "span": "15", "label": "NUMBER" }
]
}
```
Raw Gemini responses are logged to `logs/gemini_raw.log` (override with `--raw-log-file` or `GEMINI_RAW_LOG_FILE`).
## Convert preannotated → annotated
Convert pre-annotated files to Label Studiostyle annotated JSON:
```bash
uv run python main.py --mode convert --input-dir datasets/preannotated --output-dir datasets/annotated
```
The converter generates IDs in `XXX-XXXXXX` format for annotation results and sets `annotations[].id` to a sequential number.
## Analyze annotated datasets
`dataset_analysis.parse_annotated(path)` returns a dict of label counts:
```python
from dataset_analysis import parse_annotated
counts = parse_annotated("datasets/annotated/ffmpeg_gpt_v1.json")
print(counts)
```