The seam our tiled upscaler left on every 4K product render

#machinelearning #computervision #mlops #pytorch

TL;DR: We tile high-res images through our upscaler because a full 4096×4096 pass blows past 24GB of VRAM. For months every render had a faint cross down the middle. The fix was not a bigger GPU. It was admitting that hard tile boundaries break any model with a receptive field, and feathering the overlap with a raised-cosine weight instead of averaging it.

At Photoroom I work on the generative side, mostly diffusion for product photography. One of our smaller models is a convolutional upscaler that takes a 1024px cutout and pushes it to print resolution. Nothing exotic. A residual-in-residual dense block network, the kind of thing that has been around since ESRGAN in 2018.

It worked fine in the notebook. In production, on large images, it left a seam.

What a seam actually is

You cannot run a 4096×4096 image through this model on a single 24GB card. So you tile. Cut the image into 512px squares, upscale each, stitch them back. The naive version of this is three lines of code and it is wrong.

The reason is the receptive field. To be precise, every output pixel near a tile edge was computed from a partial neighborhood. The convolutions on the right edge of the left tile never saw the pixels that lived in the right tile. So the two halves disagreed by a small amount, maybe 2-3 grey levels, and the human eye is very good at finding a straight vertical line of consistent 2-3 level error. On a flat grey studio background it was obvious. On busy texture it hid.

We measured it. Sampling 200 renders, the mean absolute difference across the stitch line was 4.1 on an 8-bit scale, versus 0.9 for an adjacent non-seam column. Small number, very visible artifact.

Overlap is necessary but not sufficient

The first fix everyone reaches for is overlapping tiles. Take 512px tiles but step by 448, so each pair shares a 64px strip. Then in the shared region you have two predictions and you blend them.

The nuance here is how you blend. If you average the overlap with a flat 0.5/0.5 weight, you have moved the discontinuity, not removed it. The blend region now has a soft step at each of its two edges where the weighting suddenly kicks in. Better than before. Still a seam, just blurrier.

What works is a weight that goes smoothly to zero at the tile border, so a pixel contributes nothing exactly where its receptive field ran out. A raised-cosine (Hann) window does this. Each tile is multiplied by its window, the windows are accumulated, and you divide by the summed weight.

import torch

def hann_2d(size: int, overlap: int) -> torch.Tensor:
    # ramp up over the overlap, flat in the middle, ramp down
    w = torch.ones(size)
    ramp = torch.hann_window(2 * overlap, periodic=False)[:overlap]
    w[:overlap] = ramp
    w[-overlap:] = ramp.flip(0)
    return w[:, None] * w[None, :]   # outer product -> 2D

def blend_tile(canvas, weight, tile, win, y, x):
    h, w = tile.shape[-2:]
    canvas[..., y:y+h, x:x+w] += tile * win
    weight[..., y:y+h, x:x+w] += win
    # caller does canvas / weight.clamp_min(1e-8) at the end

After switching to this, the seam difference dropped from 4.1 to 1.0, statistically indistinguishable from a normal column. Same model weights. Same GPU. Just honest about where each tile's information ends.

Catching it before customers do

The annoying part was that nobody noticed the seam for a while because our eval set was mostly 1024px crops that never tiled. The artifact only existed at the resolution we did not test.

So we built a regression check on full-size output. For each render we compute the per-column mean absolute gradient and flag any column whose value spikes above its neighbors by more than 3x at a known tile boundary. Cheap, deterministic, runs on CPU.

For the fuzzier cases (texture seams, slight color drift) we run a vision-language model over a sample of outputs and ask it to describe any visible discontinuity. Those calls go through a gateway, Bifrost, which is one of a few ways we keep provider config and rate limits in one place instead of scattered across scripts. The numeric check catches the obvious ones; the VLM catches the ones a metric misses.

Comparison

Strategy	Seam MAD (8-bit)	VRAM (4K)	Extra compute
Single pass	0	~31 GB (OOM on 24GB)	baseline
Hard tiles, no overlap	4.1	6 GB	none
Overlap + flat average	2.3	7 GB	+14%
Overlap + Hann window	1.0	7 GB	+16%

Trade-offs and Limitations

Overlap is not free. A 64px overlap on 512px tiles means roughly 16% more pixels get processed, so throughput drops by about that much. Wider overlap blends better and costs more, and past ~96px we saw no further quality gain, only the bill.

Hann windowing assumes the two predictions in the overlap are both reasonable and close. They usually are for this upscaler. For a diffusion model with stochastic sampling per tile they can diverge enough that blending produces a ghost, and you need a shared noise seed or latent-space tiling instead.

This also does nothing for semantic seams, where two tiles hallucinate different details. Window blending fixes geometry and color continuity, not content disagreement. That is a harder problem and the honest answer is you tile in latent space or you do not tile at all.