What does a depth map ControlNet control?

It controls the spatial layout and three-dimensional composition of the scene — what is near, what is far, and the overall sense of volume and perspective. Unlike Canny, it does not constrain fine edges, so the model has freedom in surface detail while keeping your scene structure.

MiDaS vs LeReS vs ZoeDepth — which estimator?

MiDaS is the fast, reliable default for general scenes. LeReS captures more fine foreground detail and sharper depth boundaries, useful for portraits and objects. ZoeDepth produces metric (real-scale) depth, which helps when accurate relative distances matter. Start with MiDaS and switch only if depth separation looks wrong.

How is a depth map read?

It is a grayscale image where brightness encodes distance — by convention lighter pixels are closer to the camera and darker pixels are farther away. Checking the map before generating tells you whether the foreground and background have separated correctly.

Can I combine depth with other ControlNets?

Yes, and it is common. Depth handles overall scene volume while another module adds specifics — pair it with Canny to keep both composition and edges, or with OpenPose to place a posed figure correctly within a 3D scene. Lower each module's strength when stacking so they do not over-constrain the result.

What is the Depth Map ControlNet Guide?

Guide to depth map ControlNet. Compares MiDaS, LeReS, and ZoeDepth estimators, explains depth map visualization and strength settings, and shows how to combine depth control with other ControlNet modules. It runs free in your browser on Gera Tools, with nothing uploaded.

Depth Map ControlNet Guide

Name: Depth Map ControlNet Guide
Creator: Gera Tools
License: https://creativecommons.org/licenses/by/4.0/

Depth map ControlNet guide

A depth map ControlNet controls the spatial composition of an image — the sense of near and far, of volume and perspective — without locking down fine edges. It is the right tool when you want to preserve a scene’s three-dimensional layout but give the model freedom over surface detail. The two choices that drive quality are the depth estimator you use to build the map and the control strength you apply.

How it works

A depth estimator analyzes the source image and outputs a grayscale depth map: lighter pixels are nearer the camera, darker pixels are farther away. MiDaS is the fast, general-purpose default; LeReS resolves finer foreground detail and crisper depth boundaries; ZoeDepth produces metric-scale depth for accurate relative distances. ControlNet feeds this map to the diffusion model, which builds a new image whose spatial structure matches the map. Control strength scales how strictly that structure is enforced.

Comparing the three depth estimators

MiDaS (Monocular Depth in the Wild)

MiDaS is the most widely used default for depth ControlNet. It produces smooth, reliable relative depth maps across a broad range of scene types — portraits, landscapes, interiors, product shots. The depth values are relative (no real-world scale), meaning a near object and a far object are separated correctly in the grayscale, but the absolute distances are not calibrated.

Best for: general scenes, quick testing, any case where you just need foreground and background to separate cleanly.

Weakness: tends to smooth out depth boundaries between objects that are at similar distances. A person standing a few meters from a background wall may not separate well.

LeReS (Learning Robust Monocular Depth)

LeReS is designed to recover fine relative depth and sharper boundaries between nearby objects. It tends to produce crisper transitions at object edges and better separation of subjects from backgrounds, which is particularly useful for portraits, product photography references, and architectural interior scenes.

Best for: portraits, objects with defined depth against backgrounds, architectural interiors.

Weakness: slower than MiDaS and sometimes introduces artefacts at complex depth boundaries.

ZoeDepth

ZoeDepth is a metric depth estimator — it produces depth values calibrated to real-world scale in metres. This is only meaningful if your generation task cares about accurate relative distances (for example, placing objects at physically plausible positions in a room). For most creative image generation, metric accuracy is irrelevant.

Best for: technical visualisations, scene reconstruction, cases where true scale matters.

Weakness: slowest of the three; overkill for most creative depth control tasks.

Reading and validating your depth map

Before using a depth map as ControlNet input, inspect it visually:

Good separation: the foreground subject appears clearly lighter than the background, with a visible gradient between them. Different depth layers have distinguishable gray values.
Depth inversion: occasionally estimators misread the scene and place the background lighter than the foreground (inverted depth). You will see this as a flat or confused layout in the generated image. Solution: try a different estimator or adjust the image crop.
Flat maps: if everything in the scene is the same approximate distance from the camera (a flat wall, for example), the depth map will be uniformly gray and will provide almost no useful composition control. In these cases, consider whether depth control is actually what you need.

Control strength: a practical guide

Depth ControlNet strength is typically a value between 0.0 and 1.5 (in most UIs the slider caps at 1.0 or 2.0):

Strength range	Effect
0.3–0.5	Loose guidance; composition hints at depth but varies freely
0.6–0.8	Moderate enforcement; recommended starting range for most uses
0.9–1.0	Strong enforcement; composition is closely maintained
Above 1.0	Very strict; can introduce stiffness and artefacts

Start at 0.7 for most depth control applications and adjust based on whether you want more creative freedom (lower) or stricter composition preservation (higher).

Stacking depth with other ControlNet modules

Depth and other modules are often combined. Common pairings:

Depth + Canny: depth controls overall scene volume and spatial layout; Canny adds fine edge structure. Lower both to 0.5–0.6 when stacking to avoid over-constraining the model.
Depth + OpenPose: places a posed figure correctly within a 3D scene. Depth handles the spatial context; OpenPose handles the body.
Depth + tile: useful for upscaling with preserved composition and consistent depth.

When combining multiple ControlNets, reduce each module’s strength — the combined constraint is multiplicative, and modules that each seem reasonable at 0.8 individually can produce stiff, over-controlled results when stacked.

Tips for spatial control

Start with MiDaS. It is reliable for most scenes; only switch to LeReS or ZoeDepth when foreground detail or true scale matters.
Inspect the map first. If foreground and background blur into the same gray, the model will not separate them either — try a different estimator or re-crop the source.
Use moderate strength for natural depth. High strength enforces layout rigidly; mid strength keeps the composition while letting the scene breathe.
Stack with other modules at reduced strength. Combine depth with Canny for composition plus edges, or with OpenPose to seat a figure correctly in 3D — lower each strength when stacking so they do not over-constrain the result.