From cf97e935f211e3c36fbbe4230f3b546a68662c6b Mon Sep 17 00:00:00 2001
From: Caleb Dean <realcalebdean@gmail.com>
Date: Sun, 28 Jun 2026 00:32:15 -0500
Subject: [PATCH 1/2] Add Extract Embedded Subtitles plugin

Extracts embedded text subtitle tracks from video files into external
.srt sidecar files that Stash recognises as captions (stashapp/stash#3875).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 plugins/extractSubtitles/README.md            |  73 ++++
 plugins/extractSubtitles/extractSubtitles.py  | 335 ++++++++++++++++++
 plugins/extractSubtitles/extractSubtitles.yml |  27 ++
 plugins/extractSubtitles/requirements.txt     |   1 +
 4 files changed, 436 insertions(+)
 create mode 100644 plugins/extractSubtitles/README.md
 create mode 100644 plugins/extractSubtitles/extractSubtitles.py
 create mode 100644 plugins/extractSubtitles/extractSubtitles.yml
 create mode 100644 plugins/extractSubtitles/requirements.txt
diff --git a/plugins/extractSubtitles/README.md b/plugins/extractSubtitles/README.md
new file mode 100644
index 00000000..58be0686
--- /dev/null
+++ b/plugins/extractSubtitles/README.md
@@ -0,0 +1,73 @@
+# Extract Embedded Subtitles
+
+A Stash plugin that pulls **embedded text subtitle tracks** out of your video
+files and writes them as external `.srt` sidecar files. Stash already displays
+external captions, so once the files exist they show up in the video player's
+caption menu — this just automates creating them.
+
+Addresses [stashapp/stash#3875](https://github.com/stashapp/stash/issues/3875).
+
+## Why a plugin?
+
+Video.js (Stash's player) cannot render *embedded* subtitle tracks directly, and
+the maintainers' preferred path for this feature is a plugin that generates
+sidecar `.srt` files (see the issue discussion). This plugin does exactly that,
+naming the files so Stash's built-in caption matcher attaches them automatically.
+
+## What it does
+
+- Uses `ffprobe` to find subtitle streams in each scene's file(s).
+- Extracts each **text** track (SubRip, ASS/SSA, mov_text, WebVTT, …) to
+  `videoname.<lang>.srt` next to the video, matching Stash's caption naming
+  convention (`pkg/file/video/caption.go`).
+- Triggers a scan of the affected folders so the captions appear without you
+  having to manually rescan the whole library.
+
+### Limitations
+
+- **Image-based subtitles** (PGS / Blu-ray, VobSub / DVD, DVB) are *skipped* —
+  they're bitmaps and can't be converted to text without OCR.
+- **ASS/SSA styling** (fonts, positioning, karaoke) is lost when converting to
+  SRT. This is a deliberate, predictable trade-off; the original file is never
+  modified.
+- Stash stores one caption per language, so if a file has two tracks in the same
+  language only the `default` one is extracted.
+
+## Install
+
+1. Copy the `extractSubtitles` folder into your Stash `plugins` directory.
+2. Make sure `stashapp-tools` is available to the Python that Stash uses:
+   `pip install stashapp-tools` (or use the **PythonDepManager** community plugin
+   with the bundled `requirements.txt`).
+3. `ffmpeg`/`ffprobe` must be on `PATH`, or set their paths in Stash
+   **Settings → System → FFMPEG**.
+4. In Stash, **Settings → Plugins → Reload Plugins**.
+
+## Use
+
+- **One-off / whole library:** **Settings → Tasks → Plugin Tasks →
+  Extract Embedded Subtitles → run.** This walks every scene, extracts subs,
+  then rescans the touched folders.
+- **Ongoing additions:** enable **Extract On Scan** in the plugin settings. New
+  scenes get their subtitles extracted on creation and their folder rescanned.
+  For a large *initial* import, prefer the one-off task instead of the hook to
+  avoid many small rescans.
+
+### Settings
+
+| Setting | Default | Description |
+| --- | --- | --- |
+| Overwrite Existing | off | Re-extract even if a matching `.srt` already exists. |
+| Extract On Scan | off | Auto-extract when a new scene is created. |
+
+## Notes
+
+This is a community plugin and not part of Stash core; it complements, rather
+than replaces, in-player rendering of embedded tracks.
+
+## Development
+
+LLM-assisted, human-reviewed, and tested. Subtitle extraction was verified
+end-to-end against a real `ffmpeg`/`ffprobe` using a multi-track MKV (SubRip +
+ASS→SRT conversion + an untagged track), and the output filenames were checked
+against Stash's caption-matching rules in `pkg/file/video/caption.go`.
diff --git a/plugins/extractSubtitles/extractSubtitles.py b/plugins/extractSubtitles/extractSubtitles.py
new file mode 100644
index 00000000..e7247e8a
--- /dev/null
+++ b/plugins/extractSubtitles/extractSubtitles.py
@@ -0,0 +1,335 @@
+"""Extract Embedded Subtitles - Stash plugin.
+
+Extracts embedded *text* subtitle tracks (subrip/ass/ssa/mov_text/webvtt/...)
+from video files into external `.srt` sidecar files named the way Stash expects,
+then triggers a scan so Stash attaches them as captions.
+
+Stash's caption matching (pkg/file/video/caption.go) recognises files named:
+    <video_basename>.<lang>.srt   (e.g. movie.eng.srt)
+    <video_basename>.srt          (when the language is unknown)
+where <lang> must be a valid ISO-639 code. We reproduce that naming exactly so
+the existing external-caption support picks the files up on the next scan.
+
+Image-based subtitle codecs (PGS, VobSub, DVB, etc.) cannot be converted to text
+and are skipped. Styled formats (ASS/SSA) are converted to SRT, which drops
+positioning/styling - this is the same trade-off as most players' SRT export.
+"""
+
+import json
+import os
+import subprocess
+import sys
+
+import stashapi.log as log
+from stashapi.stashapp import StashInterface
+
+# --- subtitle codec classification -----------------------------------------
+
+# Image-based subtitle codecs are bitmaps and cannot be converted to text SRT
+# without OCR, so we skip them. Every other subtitle stream is handed to ffmpeg,
+# which converts text formats (SubRip/ASS/SSA/mov_text/WebVTT/...) to SRT; if a
+# stream genuinely can't be converted, the ffmpeg call fails and is counted as
+# an error rather than silently dropped. (A blocklist beats an allowlist here:
+# we won't miss a valid-but-unlisted text format.)
+IMAGE_SUB_CODECS = {
+    "hdmv_pgs_subtitle",
+    "pgssub",
+    "dvd_subtitle",
+    "dvdsub",
+    "dvb_subtitle",
+    "dvbsub",
+    "dvb_teletext",
+    "xsub",
+}
+
+# ISO 639-2 "undefined" and the like - treated as "no language".
+UNKNOWN_LANGS = {"", "und", "unknown", "none", "mis", "mul", "zxx"}
+
+# Stash's CaptionExts (vtt preferred, but we emit srt as ffmpeg converts to it).
+EXISTING_CAPTION_EXTS = ("srt", "vtt")
+
+
+def normalize_lang(raw):
+    """Return a filename language segment, or '' if unknown.
+
+    Stash validates the segment with Go's language.ParseBase, which accepts
+    2- or 3-letter ISO-639 codes. ffprobe usually emits 3-letter ISO-639-2
+    (e.g. 'eng'), which is valid. Anything we don't trust maps to '' so the
+    file becomes <basename>.srt (unknown language).
+    """
+    if raw is None:
+        return ""
+    lang = str(raw).strip().lower()
+    if lang in UNKNOWN_LANGS:
+        return ""
+    # Only 2- or 3-letter alpha codes are valid ISO-639 bases.
+    if len(lang) in (2, 3) and lang.isalpha():
+        return lang
+    return ""
+
+
+def caption_path(video_path, lang):
+    """Reproduce Stash's GetCaptionPath(path, lang, 'srt')."""
+    base, _ext = os.path.splitext(video_path)
+    if lang == "":
+        return base + ".srt"
+    return base + "." + lang + ".srt"
+
+
+def has_existing_external_caption(video_path, lang):
+    """True if Stash would already see an external caption for this language."""
+    base, _ext = os.path.splitext(video_path)
+    for ext in EXISTING_CAPTION_EXTS:
+        if lang == "":
+            candidate = base + "." + ext
+        else:
+            candidate = base + "." + lang + "." + ext
+        if os.path.isfile(candidate):
+            return True
+    return False
+
+
+# --- ffprobe / ffmpeg --------------------------------------------------------
+
+def probe_subtitle_streams(ffprobe, video_path):
+    """Return the list of subtitle stream dicts from ffprobe, or [] on error."""
+    cmd = [
+        ffprobe,
+        "-v", "error",
+        "-print_format", "json",
+        "-show_streams",
+        "-select_streams", "s",
+        video_path,
+    ]
+    try:
+        out = subprocess.run(
+            cmd, capture_output=True, text=True, check=False
+        )
+    except FileNotFoundError:
+        log.error("ffprobe not found at '%s' - set the path or add it to PATH" % ffprobe)
+        raise
+    if out.returncode != 0:
+        log.warning("ffprobe failed for %s: %s" % (video_path, out.stderr.strip()))
+        return []
+    try:
+        data = json.loads(out.stdout or "{}")
+    except json.JSONDecodeError:
+        log.warning("Could not parse ffprobe output for %s" % video_path)
+        return []
+    return data.get("streams", [])
+
+
+def extract_stream(ffmpeg, video_path, stream_index, dest, overwrite):
+    """Run ffmpeg to extract one subtitle stream to dest as SRT.
+
+    Returns True on success.
+    """
+    cmd = [
+        ffmpeg,
+        "-loglevel", "error",
+        "-y" if overwrite else "-n",
+        "-i", video_path,
+        "-map", "0:%d" % stream_index,
+        "-c:s", "srt",
+        dest,
+    ]
+    try:
+        out = subprocess.run(cmd, capture_output=True, text=True, check=False)
+    except FileNotFoundError:
+        log.error("ffmpeg not found at '%s' - set the path or add it to PATH" % ffmpeg)
+        raise
+    if out.returncode != 0:
+        # ffmpeg writes a tiny/empty file on some failures; clean it up.
+        if os.path.isfile(dest) and os.path.getsize(dest) == 0:
+            try:
+                os.remove(dest)
+            except OSError:
+                pass
+        log.warning("ffmpeg failed extracting stream %d of %s: %s"
+                    % (stream_index, video_path, out.stderr.strip()))
+        return False
+    if not os.path.isfile(dest) or os.path.getsize(dest) == 0:
+        log.warning("ffmpeg produced no usable subtitle for stream %d of %s"
+                    % (stream_index, video_path))
+        if os.path.isfile(dest):
+            try:
+                os.remove(dest)
+            except OSError:
+                pass
+        return False
+    return True
+
+
+def process_file(ffprobe, ffmpeg, video_path, overwrite, stats):
+    """Extract every distinct-language text subtitle from one video file.
+
+    Stash stores at most one caption per (language, type), so we extract one
+    SRT per language, preferring the stream marked 'default'. Returns the set
+    of directories that gained new subtitle files (for later rescanning).
+    """
+    if not os.path.isfile(video_path):
+        log.debug("Skipping missing file %s" % video_path)
+        return set()
+
+    streams = probe_subtitle_streams(ffprobe, video_path)
+    if not streams:
+        return set()
+
+    # Prefer default-disposition streams first so the "best" track wins a tie.
+    def sort_key(s):
+        disp = s.get("disposition", {}) or {}
+        return (
+            0 if disp.get("default") else 1,
+            0 if disp.get("forced") else 1,
+            s.get("index", 0),
+        )
+
+    touched_dirs = set()
+    used_targets = set()
+
+    for s in sorted(streams, key=sort_key):
+        codec = (s.get("codec_name") or "").lower()
+        index = s.get("index")
+        if index is None:
+            continue
+
+        if codec in IMAGE_SUB_CODECS:
+            log.debug("Skipping image-based subtitle (%s) in %s" % (codec, video_path))
+            stats["image_skipped"] += 1
+            continue
+
+        tags = s.get("tags", {}) or {}
+        lang = normalize_lang(tags.get("language"))
+        dest = caption_path(video_path, lang)
+
+        # One caption per language: don't write a second file to the same name.
+        if dest in used_targets:
+            continue
+        used_targets.add(dest)
+
+        if not overwrite and os.path.isfile(dest):
+            log.debug("Subtitle already exists, skipping: %s" % dest)
+            stats["already_exists"] += 1
+            continue
+
+        # If the user already provided an external caption in this language
+        # (e.g. a .vtt), don't clobber/duplicate it.
+        if not overwrite and has_existing_external_caption(video_path, lang):
+            log.debug("External caption already present for lang '%s': %s" % (lang or "und", video_path))
+            stats["already_exists"] += 1
+            continue
+
+        if extract_stream(ffmpeg, video_path, index, dest, overwrite):
+            log.info("Extracted %s subtitle -> %s" % (lang or "und", os.path.basename(dest)))
+            stats["extracted"] += 1
+            touched_dirs.add(os.path.dirname(dest))
+        else:
+            stats["errors"] += 1
+
+    return touched_dirs
+
+
+# --- scene helpers -----------------------------------------------------------
+
+SCENE_FRAGMENT = "id title files { path }"
+
+
+def process_scene(stash, ffprobe, ffmpeg, scene, overwrite, stats):
+    touched = set()
+    for f in scene.get("files", []) or []:
+        path = f.get("path")
+        if path:
+            touched |= process_file(ffprobe, ffmpeg, path, overwrite, stats)
+    return touched
+
+
+def rescan_dirs(stash, dirs):
+    """Trigger a Stash scan limited to the given directories so the new
+    sidecar files are associated as captions."""
+    paths = sorted(d for d in dirs if d)
+    if not paths:
+        return
+    log.info("Triggering scan of %d folder(s) to attach captions" % len(paths))
+    try:
+        stash.metadata_scan(paths=paths)
+    except Exception as e:  # noqa: BLE001 - surface but don't crash the plugin
+        log.error("Failed to trigger scan: %s - run a Library scan manually to "
+                  "attach the new subtitles. (%s)" % (paths, e))
+
+
+# --- entry points ------------------------------------------------------------
+
+def extract_all(stash, ffprobe, ffmpeg, overwrite):
+    scenes = stash.find_scenes(fragment=SCENE_FRAGMENT)
+    total = len(scenes)
+    log.info("Checking %d scene(s) for embedded subtitles" % total)
+
+    stats = {
+        "extracted": 0,
+        "already_exists": 0,
+        "image_skipped": 0,
+        "errors": 0,
+    }
+    touched = set()
+    for i, scene in enumerate(scenes):
+        touched |= process_scene(stash, ffprobe, ffmpeg, scene, overwrite, stats)
+        if total:
+            log.progress((i + 1) / total)
+
+    rescan_dirs(stash, touched)
+
+    log.info(
+        "Done. Extracted %d, skipped %d existing, %d image-based, %d errors."
+        % (stats["extracted"], stats["already_exists"], stats["image_skipped"],
+           stats["errors"])
+    )
+
+
+def main():
+    raw = sys.stdin.read()
+    json_input = json.loads(raw)
+
+    stash = StashInterface(json_input["server_connection"])
+    config = stash.get_configuration()
+    general = config.get("general", {}) or {}
+    plugin_settings = (config.get("plugins", {}) or {}).get("extractSubtitles", {}) or {}
+
+    # ffmpeg/ffprobe: prefer Stash's configured binaries, else rely on PATH.
+    ffmpeg = general.get("ffmpegPath") or "ffmpeg"
+    ffprobe = general.get("ffprobePath") or "ffprobe"
+
+    overwrite = bool(plugin_settings.get("overwrite", False))
+    extract_on_scan = bool(plugin_settings.get("extractOnScan", False))
+
+    args = json_input.get("args", {}) or {}
+
+    # Allow a task to override overwrite via defaultArgs if desired.
+    if "overwrite" in args:
+        overwrite = bool(args["overwrite"])
+
+    mode = args.get("mode")
+    if mode == "extractAll":
+        extract_all(stash, ffprobe, ffmpeg, overwrite)
+        return
+
+    # Hook path: Scene.Create.Post
+    hook = args.get("hookContext")
+    if hook and hook.get("type") == "Scene.Create.Post":
+        if not extract_on_scan:
+            log.debug("Extract On Scan disabled; ignoring scene-create hook")
+            return
+        scene = stash.find_scene(hook.get("id"), fragment=SCENE_FRAGMENT)
+        if not scene:
+            return
+        stats = {
+            "extracted": 0, "already_exists": 0, "image_skipped": 0, "errors": 0,
+        }
+        touched = process_scene(stash, ffprobe, ffmpeg, scene, overwrite, stats)
+        rescan_dirs(stash, touched)
+        return
+
+    log.debug("No actionable mode/hook in input; nothing to do")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/plugins/extractSubtitles/extractSubtitles.yml b/plugins/extractSubtitles/extractSubtitles.yml
new file mode 100644
index 00000000..4dcce79a
--- /dev/null
+++ b/plugins/extractSubtitles/extractSubtitles.yml
@@ -0,0 +1,27 @@
+name: Extract Embedded Subtitles
+description: Extracts embedded text subtitle tracks from video files into external .srt files that Stash recognises and displays as captions.
+version: 0.1.0
+url: https://github.com/stashapp/stash/issues/3875
+exec:
+  - python
+  - "{pluginDir}/extractSubtitles.py"
+interface: raw
+settings:
+  overwrite:
+    displayName: Overwrite Existing
+    description: Re-extract and overwrite subtitle files even if a matching .srt already exists next to the video.
+    type: BOOLEAN
+  extractOnScan:
+    displayName: Extract On Scan
+    description: Automatically extract embedded subtitles when a new scene is created, then rescan its folder so captions appear. Best for incremental additions, not large initial imports.
+    type: BOOLEAN
+tasks:
+  - name: Extract Embedded Subtitles
+    description: Scan all scenes, extract embedded text subtitles to external .srt files, then trigger a scan so Stash attaches them as captions.
+    defaultArgs:
+      mode: extractAll
+hooks:
+  - name: Extract on scene create
+    description: Extract embedded subtitles after a scene is created (only runs if "Extract On Scan" is enabled in settings).
+    triggeredBy:
+      - Scene.Create.Post
diff --git a/plugins/extractSubtitles/requirements.txt b/plugins/extractSubtitles/requirements.txt
new file mode 100644
index 00000000..257e80a0
--- /dev/null
+++ b/plugins/extractSubtitles/requirements.txt
@@ -0,0 +1 @@
+stashapp-tools>=0.2.50

From 619ef9e1f8edc72daa9c616db9b95c153afca80e Mon Sep 17 00:00:00 2001
From: DogmaDragon <103123951+DogmaDragon@users.noreply.github.com>
Date: Sun, 28 Jun 2026 09:14:10 +0300
Subject: [PATCH 2/2] Refactor README for clarity

---
 plugins/extractSubtitles/README.md | 51 ++++++++----------------------
 1 file changed, 13 insertions(+), 38 deletions(-)

diff --git a/plugins/extractSubtitles/README.md b/plugins/extractSubtitles/README.md
index 58be0686..fca651f2 100644
--- a/plugins/extractSubtitles/README.md
+++ b/plugins/extractSubtitles/README.md
@@ -1,57 +1,36 @@
 # Extract Embedded Subtitles
 
-A Stash plugin that pulls **embedded text subtitle tracks** out of your video
-files and writes them as external `.srt` sidecar files. Stash already displays
-external captions, so once the files exist they show up in the video player's
-caption menu — this just automates creating them.
+A Stash plugin that pulls **embedded text subtitle tracks** out of your video files and writes them as external `.srt` sidecar files. Stash already displays external captions, so once the files exist they show up in the video player's caption menu—this just automates creating them.
 
 Addresses [stashapp/stash#3875](https://github.com/stashapp/stash/issues/3875).
 
 ## Why a plugin?
 
-Video.js (Stash's player) cannot render *embedded* subtitle tracks directly, and
-the maintainers' preferred path for this feature is a plugin that generates
-sidecar `.srt` files (see the issue discussion). This plugin does exactly that,
-naming the files so Stash's built-in caption matcher attaches them automatically.
+Video.js (Stash's player) cannot render *embedded* subtitle tracks directly, and the maintainers' preferred path for this feature is a plugin that generates sidecar `.srt` files (see the issue discussion). This plugin does exactly that, naming the files so Stash's built-in caption matcher attaches them automatically.
 
 ## What it does
 
 - Uses `ffprobe` to find subtitle streams in each scene's file(s).
-- Extracts each **text** track (SubRip, ASS/SSA, mov_text, WebVTT, …) to
-  `videoname.<lang>.srt` next to the video, matching Stash's caption naming
-  convention (`pkg/file/video/caption.go`).
-- Triggers a scan of the affected folders so the captions appear without you
-  having to manually rescan the whole library.
+- Extracts each **text** track (SubRip, ASS/SSA, mov_text, WebVTT, …) to `videoname.<lang>.srt` next to the video, matching Stash's caption naming convention (`pkg/file/video/caption.go`).
+- Triggers a scan of the affected folders so the captions appear without you having to manually rescan the whole library.
 
 ### Limitations
 
-- **Image-based subtitles** (PGS / Blu-ray, VobSub / DVD, DVB) are *skipped* —
-  they're bitmaps and can't be converted to text without OCR.
-- **ASS/SSA styling** (fonts, positioning, karaoke) is lost when converting to
-  SRT. This is a deliberate, predictable trade-off; the original file is never
-  modified.
-- Stash stores one caption per language, so if a file has two tracks in the same
-  language only the `default` one is extracted.
+- **Image-based subtitles** (PGS / Blu-ray, VobSub / DVD, DVB) are *skipped*—they're bitmaps and can't be converted to text without OCR.
+- **ASS/SSA styling** (fonts, positioning, karaoke) is lost when converting to SRT. This is a deliberate, predictable trade-off; the original file is never modified.
+- Stash stores one caption per language, so if a file has two tracks in the same language only the `default` one is extracted.
 
 ## Install
 
 1. Copy the `extractSubtitles` folder into your Stash `plugins` directory.
-2. Make sure `stashapp-tools` is available to the Python that Stash uses:
-   `pip install stashapp-tools` (or use the **PythonDepManager** community plugin
-   with the bundled `requirements.txt`).
-3. `ffmpeg`/`ffprobe` must be on `PATH`, or set their paths in Stash
-   **Settings → System → FFMPEG**.
+2. Make sure `stashapp-tools` is available to the Python that Stash uses: `pip install stashapp-tools` (or use the **PythonDepManager** community plugin with the bundled `requirements.txt`).
+3. `ffmpeg`/`ffprobe` must be on `PATH`, or set their paths in Stash **Settings → System → FFMPEG**.
 4. In Stash, **Settings → Plugins → Reload Plugins**.
 
 ## Use
 
-- **One-off / whole library:** **Settings → Tasks → Plugin Tasks →
-  Extract Embedded Subtitles → run.** This walks every scene, extracts subs,
-  then rescans the touched folders.
-- **Ongoing additions:** enable **Extract On Scan** in the plugin settings. New
-  scenes get their subtitles extracted on creation and their folder rescanned.
-  For a large *initial* import, prefer the one-off task instead of the hook to
-  avoid many small rescans.
+- **One-off / whole library:** **Settings → Tasks → Plugin Tasks → Extract Embedded Subtitles → run.** This walks every scene, extracts subs, then rescans the touched folders.
+- **Ongoing additions:** enable **Extract On Scan** in the plugin settings. New scenes get their subtitles extracted on creation and their folder rescanned. For a large *initial* import, prefer the one-off task instead of the hook to avoid many small rescans.
 
 ### Settings
 
@@ -62,12 +41,8 @@ naming the files so Stash's built-in caption matcher attaches them automatically
 
 ## Notes
 
-This is a community plugin and not part of Stash core; it complements, rather
-than replaces, in-player rendering of embedded tracks.
+This is a community plugin and not part of Stash core; it complements, rather than replaces, in-player rendering of embedded tracks.
 
 ## Development
 
-LLM-assisted, human-reviewed, and tested. Subtitle extraction was verified
-end-to-end against a real `ffmpeg`/`ffprobe` using a multi-track MKV (SubRip +
-ASS→SRT conversion + an untagged track), and the output filenames were checked
-against Stash's caption-matching rules in `pkg/file/video/caption.go`.
+LLM-assisted, human-reviewed, and tested. Subtitle extraction was verified end-to-end against a real `ffmpeg`/`ffprobe` using a multi-track MKV (SubRip + ASS→SRT conversion + an untagged track), and the output filenames were checked against Stash's caption-matching rules in `pkg/file/video/caption.go`.