Eight pieces of automation that replace the timeline editor, the captioning tool, the loudness pass, and the cross-poster. Configured once per streamer, then it runs every broadcast.
The recorder watches every channel registered in your dashboard and captures broadcasts in 12-minute MPEG-TS segments. We use streamlink under the hood, so quality follows your Twitch source โ 720p or 1080p, 30 or 60 fps, whatever you stream.
MPEG-TS is the format we picked deliberately: even a partial file (Twitch hiccup, server reboot, hard kill mid-segment) is still readable by ffprobe, so a single dropped segment never costs you the whole stream's worth of clips. A robust 3-level resolution probe handles edge cases like 720p sources where some metadata fields are missing.
Each recorded segment is transcribed by faster-whisper (small model, runs locally on the server โ no transcript ever leaves the box). Then a sliding-window scoring engine ranks every possible 15โ55s clip on a combination of signals: emotion-keyword density (RU + EN dictionaries), peak audio loudness, hook-phrase presence, anti-music penalty, and short-burst bonuses.
The scoring is absolute, not relative. A boring segment produces zero clips โ we never publish filler just to hit a quota. The threshold (MIN_CLIP_SCORE = 10, virality 0โ100) was tuned over four weeks of real broadcasts to match what creators actually wanted to share.
Before any GPU rendering happens, the top transcript is sent to DeepSeek for a 1โ10 quality gate. Anything that scores below 6 is skipped โ saves render time on segments where the speech is technically present but not interesting.
Word-level timing comes straight from Whisper. The renderer builds an ASS subtitle file where each word becomes a separate Dialogue line โ exactly one word is yellow at any instant, the rest are white, and the active word scales up by 25% for emphasis. No smooth fill (\kf), no sliding gradient โ just the snappy pop creators on TikTok already recognize.
Liberation Sans 72pt with hard outline is rendered directly into the video by ffmpeg, so the subtitles survive any re-encode the destination platform applies. The torn-box border style and disabled ScaledBorderAndShadow keep the text crisp on every device.
If profanity is detected mid-word, the word is masked with stars in the subtitle and silenced in the audio at the same timestamp โ both edits happen in the same render pass.
16:9 source video gets converted to 9:16 in one of three modes, configured per streamer:
Gameplay-only โ straight 9:16 center-crop of the full frame. Default for streamers whose webcam is small / cornered, or who don't use one at all.
Webcam-only โ for just-chatting and IRL. The largest possible 9:16 window centered on the streamer's face, with normalized region coordinates configured in the dashboard.
Webcam top-split โ webcam on top (40% / 768px), gameplay below (60% / 1152px). The webcam panel includes a face-zoom step (smart crop biased to the top 30%) so the streamer's face takes most of the panel rather than disappearing in negative space.
The first 1โ3 seconds of any short-form clip determine whether the viewer keeps watching. The pipeline picks each clip's start so that the peak emotion / loudness moment lands inside the first 20% of the clip duration, and any detected hook phrase ("look at this", "no way", "wait what") is positioned at the very beginning.
This means clips don't open with 8 seconds of "uh, so, basically" โ they open at the moment that already won the audience on the original stream.
Once the highlight is selected, its transcript is sent to DeepSeek with a tuned system prompt that asks for a short hook-style title (max 50 characters, no emoji), a 2โ3 sentence caption, and 4โ6 thematic hashtags drawn from the actual content of the clip โ not generic #viral #fyp spam.
The title is also burned into the video as a fade-out hook overlay (5 seconds or 40% of the clip length, whichever is shorter), so it's visible even on platforms that hide the caption by default.
Twitch source audio varies wildly โ a quiet voice-over followed by a loud explosion routinely jumps 20+ dB. The renderer runs every clip through EBU R128 loudness normalization (target โ16 LUFS) plus a soft-knee compressor and short audio fade-in / fade-out, so every clip lands in the band TikTok and YouTube Shorts expect.
Combined with CRF 23 video and 160 kbps audio, the output sits comfortably below the 50 MB ceiling reviewers care about, with automatic fallback to CRF 28 if the file would otherwise exceed 45 MB.
You connect TikTok, YouTube, and Instagram once via OAuth on the Postiz dashboard (postiz.webscenen8n.xyz). Every rendered clip is pushed to all of them through their official content APIs โ TikTok Content Posting API, YouTube Data API v3, Instagram Graph API.
Each integration owns only the credentials for the connected creator account. The pipeline never aggregates third-party content, never republishes clips between unrelated accounts. API quotas and rate limits are tracked per-platform; clips are queued and retried, not silently dropped.
Every rendered clip also goes to a Telegram channel of your choice โ handy for previewing the queue before it auto-publishes, or for archiving everything in a chat history.
Operational alerts (low disk, ffmpeg failure on a render, exception in the publishing step, recorder restart) are throttled to one per 10 minutes and pushed to the operator's Telegram so you find out within seconds when something needs attention, not at the end of the stream.
The how-it-works page walks through the same stages from a system-architecture angle.