What does IP-Adapter actually do?

IP-Adapter conditions a Stable Diffusion generation on a reference image rather than only text. It encodes the reference and injects those features so the output borrows the reference's style, subject, or facial identity depending on the model variant and weight.

Which IP-Adapter model should I use?

Use the Base or Plus model for general style and content, Plus-Face for transferring a specific face, and the Light model when you want a gentle stylistic nudge without overpowering your prompt. Plus variants capture more detail than the base.

What weight should I set?

Around 0.4 to 0.6 transfers style while leaving room for your text prompt. 0.7 to 1.0 makes the reference dominate, useful for faces or strong content transfer. Below 0.4 is a subtle hint. Always tune to your specific images.

Can I use IP-Adapter with ControlNet?

Yes, and it is a common combination. IP-Adapter handles style or identity from a reference image while ControlNet locks pose, depth, or composition from a separate control image. Lower the IP-Adapter weight slightly when stacking so the two don't fight.

What is the IP-Adapter Style Reference Guide?

Guide to IP-Adapter models for Stable Diffusion image-prompt conditioning — choosing between Base, Plus, Plus-Face, and Light variants, picking a weight for style versus content transfer, and combining IP-Adapter with ControlNet. It runs free in your browser on Gera Tools, with nothing uploaded.

IP-Adapter Style Reference Guide

Name: IP-Adapter Style Reference Guide
Creator: Gera Tools
License: https://creativecommons.org/licenses/by/4.0/

IP-Adapter style reference guide

IP-Adapter lets Stable Diffusion take an image as a prompt alongside your text, borrowing the reference’s style, content, or facial identity. The results hinge on two choices: which model variant you load and what weight you set. This guide recommends both based on your transfer goal, and explains how to combine IP-Adapter with ControlNet for the most control.

How IP-Adapter works

Traditional text-to-image generation conditions entirely on text embeddings. IP-Adapter adds a parallel image conditioning path: your reference image is passed through a CLIP image encoder (not the text encoder) to produce image embeddings, which are then injected into the cross-attention layers of the diffusion U-Net alongside the text embeddings.

The key advantage over earlier approaches is the decoupled architecture — the image features use their own projection and cross-attention weights, not shared ones with text. This means text and image prompts can each operate at their intended strength independently, and swapping the IP-Adapter model doesn’t require retraining the base model.

Model variants: which to choose

Variant	Best for	Notes
Base	General style and content	Good starting point; medium detail capture
Plus	Detailed style transfer	Higher fidelity; more faithful to reference
Plus-Face	Facial identity transfer	Tuned specifically for face features
Light	Gentle stylistic nudge	Low-footprint; leaves text prompt in control
SDXL variants	Any goal on SDXL	Match the variant to your goal; SDXL models produce higher res

For portraits where the person’s face needs to be recognizable, use Plus-Face. For aesthetic style transfer — “make my image look like this painting” — use Plus. For everything else, Base is a solid default.

Weight settings and what they do

The weight controls how loudly the reference image speaks relative to your text prompt:

0.1 – 0.3   Subtle hint; text prompt still dominates
0.4 – 0.6   Balanced; style borrowed without overriding content
0.7 – 0.9   Reference-dominant; strong style or identity transfer
1.0 +       Reference controls; useful for faces, risky for style

Start at 0.5 and adjust. If the output doesn’t resemble the reference enough, raise by 0.1. If the output ignores your text prompt, lower by 0.1.

Settings by transfer goal

General style transfer (palette, texture, mood):

Model: Plus
Weight: 0.4–0.6
Reference: the image whose aesthetic you want to borrow
Text prompt: describe your actual subject and scene

Facial identity (portrait consistency):

Model: Plus-Face
Weight: 0.7–0.85
Tip: use a clean, front-facing reference with good lighting; avoid group photos
Text prompt: describe the scene, clothing, expression — not the person’s appearance

Subtle creative inspiration:

Model: Light
Weight: 0.2–0.35
The reference acts as a mood board without constraining the generation

Combining IP-Adapter with ControlNet

This is the most powerful combination. IP-Adapter handles look (style, color, identity) while ControlNet handles structure (pose, depth, edges, composition). They condition the model on different channels and work additively.

Practical tips for stacking them:

Reduce IP-Adapter weight by about 0.1 when using ControlNet (to avoid competing conditioning).
Feed ControlNet its own control image (a pose extract or depth map) separate from the IP-Adapter reference.
If the ControlNet structure is fighting the IP-Adapter style, lower the ControlNet conditioning scale slightly.

A typical two-image workflow: reference photo for style → IP-Adapter Plus at 0.5; skeleton or depth map of your desired pose → ControlNet OpenPose at 0.7. Text prompt adds scene context. The output adopts the reference’s aesthetic while following the ControlNet structure.