added base datasets

This commit is contained in:
2026-04-07 22:00:40 +05:30
commit ec6fbe40e4
14 changed files with 9253 additions and 0 deletions

73
README.md Normal file
View File

@@ -0,0 +1,73 @@
# clint-dataset
Dataset for labelling queries containing tasks in natural language, with a focus on command-line operations pertaining to popular CLI tools.
These queries were generated by prompting various commercially available LLMs and were pre-annotated using Gemini-2.5-flash-lite. They were then converted to a label studio supported format and then annotations were manually revised.
## Setup
```bash
uv sync
```
Set your Gemini API key:
```bash
export GEMINI_API_KEY="your-key"
```
Optional environment variables:
```bash
export GEMINI_MODEL="gemini-2.5-flash-lite"
export GEMINI_RAW_LOG_FILE="logs/gemini_raw.log"
```
## Pre-annotate raw datasets
Raw datasets live in `datasets/raw` and contain:
```json
[{ "text": "Trim the first 15 seconds from 'video.mp4'." }]
```
Run the pre-annotator:
```bash
uv run python main.py --mode preannotate --input-dir datasets/raw --output-dir datasets/preannotated --batch-size 20
```
Output format (per item):
```json
{
"text": "Trim the first 15 seconds from 'video.mp4'.",
"tags": [
{ "span": "Trim", "label": "ACTION" },
{ "span": "15", "label": "NUMBER" }
]
}
```
Raw Gemini responses are logged to `logs/gemini_raw.log` (override with `--raw-log-file` or `GEMINI_RAW_LOG_FILE`).
## Convert preannotated → annotated
Convert pre-annotated files to Label Studiostyle annotated JSON:
```bash
uv run python main.py --mode convert --input-dir datasets/preannotated --output-dir datasets/annotated
```
The converter generates IDs in `XXX-XXXXXX` format for annotation results and sets `annotations[].id` to a sequential number.
## Analyze annotated datasets
`dataset_analysis.parse_annotated(path)` returns a dict of label counts:
```python
from dataset_analysis import parse_annotated
counts = parse_annotated("datasets/annotated/ffmpeg_gpt_v1.json")
print(counts)
```