74 lines
1.8 KiB
Markdown
74 lines
1.8 KiB
Markdown
# clint-dataset
|
||
|
||
Dataset for labelling queries containing tasks in natural language, with a focus on command-line operations pertaining to popular CLI tools.
|
||
|
||
These queries were generated by prompting various commercially available LLMs and were pre-annotated using Gemini-2.5-flash-lite. They were then converted to a label studio supported format and then annotations were manually revised.
|
||
|
||
## Setup
|
||
|
||
```bash
|
||
uv sync
|
||
```
|
||
|
||
Set your Gemini API key:
|
||
|
||
```bash
|
||
export GEMINI_API_KEY="your-key"
|
||
```
|
||
|
||
Optional environment variables:
|
||
|
||
```bash
|
||
export GEMINI_MODEL="gemini-2.5-flash-lite"
|
||
export GEMINI_RAW_LOG_FILE="logs/gemini_raw.log"
|
||
```
|
||
|
||
## Pre-annotate raw datasets
|
||
|
||
Raw datasets live in `datasets/raw` and contain:
|
||
|
||
```json
|
||
[{ "text": "Trim the first 15 seconds from 'video.mp4'." }]
|
||
```
|
||
|
||
Run the pre-annotator:
|
||
|
||
```bash
|
||
uv run python main.py --mode preannotate --input-dir datasets/raw --output-dir datasets/preannotated --batch-size 20
|
||
```
|
||
|
||
Output format (per item):
|
||
|
||
```json
|
||
{
|
||
"text": "Trim the first 15 seconds from 'video.mp4'.",
|
||
"tags": [
|
||
{ "span": "Trim", "label": "ACTION" },
|
||
{ "span": "15", "label": "NUMBER" }
|
||
]
|
||
}
|
||
```
|
||
|
||
Raw Gemini responses are logged to `logs/gemini_raw.log` (override with `--raw-log-file` or `GEMINI_RAW_LOG_FILE`).
|
||
|
||
## Convert preannotated → annotated
|
||
|
||
Convert pre-annotated files to Label Studio–style annotated JSON:
|
||
|
||
```bash
|
||
uv run python main.py --mode convert --input-dir datasets/preannotated --output-dir datasets/annotated
|
||
```
|
||
|
||
The converter generates IDs in `XXX-XXXXXX` format for annotation results and sets `annotations[].id` to a sequential number.
|
||
|
||
## Analyze annotated datasets
|
||
|
||
`dataset_analysis.parse_annotated(path)` returns a dict of label counts:
|
||
|
||
```python
|
||
from dataset_analysis import parse_annotated
|
||
|
||
counts = parse_annotated("datasets/annotated/ffmpeg_gpt_v1.json")
|
||
print(counts)
|
||
```
|