Building a video editing pipeline for my YouTube channel
I recently started a YouTube channel where I review ice cream and tiramisù in different places I visit around the world. From the start, I wanted to automate all the boring editing, partly because I’m not used to it and learning proper editing tools would take me significantly longer.
As a software developer, with Claude’s help this was easier to just build myself.
Audio processing
On the audio side I wanted to apply the following:
- Noise attenuation and voice enhancement, because I’m recording the videos on the spot simply with my iPhone and no professional microphones at the moment
- Automatically remove swear words with bleeps. This was more of a fun exercise to test how good ML models are at doing this rather than a real need.
This is the final script I produced:
uv run process_video.py input.mov -s 0.8 --bleep-word shit
The parameter -s 0.8 determines the intensity of denoising and the parameter --bleep-word determines what words to automatically bleep.
Noise attenuation and voice enhancement
I looked for a local ML model that could leverage MLX in order to take advantage of Apple Silicon, so this part has been done with a Python script.
mlx-audio is a suite of audio processing tools that can take care of all of this while performing well thanks to hardware acceleration.
I always rely on uv as a Python package manager and virtual environment manager.
Here is the full function that performs the following pipeline:
- Extract the audio track with
ffmpeg - Run the ML model to apply the enhancements
- Normalize the audio track to keep a good peak volume
def denoise_audio(
input_path: Path, strength: float, audio_in: str, audio_out: str,
length: float | None = None,
) -> str:
print("[denoise] Extracting audio...")
cmd = ["ffmpeg", "-y", "-loglevel", "error", "-i", str(input_path),
"-vn", "-ar", "48000", "-ac", "1"]
if length is not None:
cmd += ["-t", str(length)]
cmd.append(audio_in)
_run(cmd)
print("[denoise] Running model...")
model = DeepFilterNetModel.from_pretrained()
original, sr = audio_io.read(audio_in, always_2d=False, dtype="float32")
enhanced = model.enhance_array(original)
blended = np.clip(strength * enhanced + (1.0 - strength) * original, -1.0, 1.0)
peak = np.max(np.abs(blended))
if peak > 0:
blended = blended / peak * 0.99
print("[denoise] Normalized.")
audio_io.write(audio_out, blended, sr)
print("[denoise] Done.")
return audio_out
Automatic bleep of swear words
To find the words to bleep I’m running Whisper, which transcribes the audio with word-level timestamps.
def _find_word_segments(
audio_path: str,
target_words: set[str],
model_name: str,
pad: float = 0.05,
) -> list[tuple[float, float]]:
from mlx_audio.stt import load as stt_load
from mlx_audio.stt.utils import load_audio
print(f"[bleep] Loading Whisper model: {model_name}")
model = stt_load(model_name)
# MLX community models often omit processor files; fall back to the matching openai/ model.
if getattr(model, "_processor", None) is None:
from transformers import WhisperProcessor
base = model_name.split("/")[-1] # e.g. "whisper-large-v3-turbo"
fallback = f"openai/{base}"
print(f"[bleep] Processor missing, loading from {fallback}...")
model._processor = WhisperProcessor.from_pretrained(fallback)
print("[bleep] Running transcription with word timestamps...")
audio_mx = load_audio(audio_path)
result = model.generate(audio_mx, word_timestamps=True, language="en", verbose=False)
segments: list[tuple[float, float]] = []
for seg in (result.segments or []):
for w in seg.get("words", []):
word_clean = w["word"].strip().lower().strip("\"'.,!?;:-")
if word_clean in target_words:
segments.append((max(0.0, w["start"] - pad), w["end"] + pad))
return segments
Video processing
On the video side, I wanted to add an overlay to show
- the rating, from 1 to 5, with steps of 0.5
- map icon pin
- name of the place
- the city and country
- an icon to determine if the video is about ice cream or tiramisù
I chose to use the Super Mario stars as the rating, so a rating of 4.5 would equal to 4 full stars and 1 half star. For the location I’m using the classic map pin and for the icons I asked Gemini to generate them based on the initial avatar I created for the website.
I created a simple UI to facilitate the process rather than calling the CLI manually, with the advantage of seeing the rendered overlay instantly.
Behind the scenes, I use canvas (node-canvas) to render the various elements of the overlay and ffmpeg to mux the canvas together with the video, frame by frame.
import { createCanvas, loadImage } from "canvas";
// canvas rendering of overlay elements
function renderFrame(f) {
const opacity = overlayOpacity(f);
const canvas = createCanvas(W, H);
if (opacity <= 0) return canvas.toBuffer("raw");
const ctx = canvas.getContext("2d");
ctx.save();
ctx.globalAlpha = opacity;
// background
ctx.fillStyle = BLUE;
ctx.fillRect(0, containerY, W, containerH);
// render all the other elements ...
return canvas.toBuffer("raw");
}
// spawn ffmpeg and pipe the overlay canvas
const ff = spawn("ffmpeg", ffArgs, { stdio: ["pipe", "inherit", "pipe"] });
// find video duration
const probeJson = execSync(
`ffprobe -v quiet -print_format json -show_streams "${resolve(ROOT, inputVideo)}"`,
{ encoding: "utf8" },
);
const videoStream = JSON.parse(probeJson).streams.find(
(s) => s.codec_type === "video",
);
const TOTAL_FRAMES = Math.ceil(parseFloat(videoStream.duration) * FPS);
for (let f = 0; f < TOTAL_FRAMES; f++) {
const buf = f === 0 && frame0Buf ? frame0Buf : renderFrame(f);
const ok = ff.stdin.write(buf);
if (!ok) await new Promise((r) => ff.stdin.once("drain", r));
}
This is the final result:
This pipeline saves me a lot of time on every video and I have plans to further integrate it with a website where I will collect all the reviews.