added base datasets

2026-04-07 22:00:40 +05:30
commit ec6fbe40e4
14 changed files with 9253 additions and 0 deletions
--- a/README.md
+++ b/README.md
@@ -0,0 +1,73 @@
+# clint-dataset
+
+Dataset for labelling queries containing tasks in natural language, with a focus on command-line operations pertaining to popular CLI tools.
+
+These queries were generated by prompting various commercially available LLMs and were pre-annotated using Gemini-2.5-flash-lite. They were then converted to a label studio supported format and then annotations were manually revised.
+
+## Setup
+
+```bash
+uv sync
+```
+
+Set your Gemini API key:
+
+```bash
+export GEMINI_API_KEY="your-key"
+```
+
+Optional environment variables:
+
+```bash
+export GEMINI_MODEL="gemini-2.5-flash-lite"
+export GEMINI_RAW_LOG_FILE="logs/gemini_raw.log"
+```
+
+## Pre-annotate raw datasets
+
+Raw datasets live in `datasets/raw` and contain:
+
+```json
+[{ "text": "Trim the first 15 seconds from 'video.mp4'." }]
+```
+
+Run the pre-annotator:
+
+```bash
+uv run python main.py --mode preannotate --input-dir datasets/raw --output-dir datasets/preannotated --batch-size 20
+```
+
+Output format (per item):
+
+```json
+{
+  "text": "Trim the first 15 seconds from 'video.mp4'.",
+  "tags": [
+    { "span": "Trim", "label": "ACTION" },
+    { "span": "15", "label": "NUMBER" }
+  ]
+}
+```
+
+Raw Gemini responses are logged to `logs/gemini_raw.log` (override with `--raw-log-file` or `GEMINI_RAW_LOG_FILE`).
+
+## Convert preannotated → annotated
+
+Convert pre-annotated files to Label Studio–style annotated JSON:
+
+```bash
+uv run python main.py --mode convert --input-dir datasets/preannotated --output-dir datasets/annotated
+```
+
+The converter generates IDs in `XXX-XXXXXX` format for annotation results and sets `annotations[].id` to a sequential number.
+
+## Analyze annotated datasets
+
+`dataset_analysis.parse_annotated(path)` returns a dict of label counts:
+
+```python
+from dataset_analysis import parse_annotated
+
+counts = parse_annotated("datasets/annotated/ffmpeg_gpt_v1.json")
+print(counts)
+```