PPromptCount.ai
← All posts
image·4 min read

Midjourney prompt length — the 77-token rule and what it really means

Image models have a hard token cap most users never see. Here's how to write image prompts that respect it.

by PromptCount Team

If you've ever written a beautifully detailed Midjourney or Stable Diffusion prompt and wondered why the result ignores half of it, the answer is almost always the same: CLIP truncation at 77 tokens.

Most image generators use CLIP as their text encoder. CLIP has a fixed input size — 77 tokens, period. Anything longer is either truncated, or split into multiple chunks and averaged, depending on the implementation. Either way, the second half of a long prompt rarely lands the way you intended.

This isn't a Midjourney bug. It's a CLIP architectural fact. Once you understand it, you write better image prompts.

What 77 tokens looks like

In practice, 77 tokens is about 55–60 English words for descriptive image prompts. Less for prompts with many adjectives or hyphenated terms.

Here's a 50-word prompt that fits comfortably under the limit:

Cinematic portrait of a futuristic fashion model in a neon-lit Tokyo street, shot on 85mm lens, soft rim lighting, editorial photography style, moody atmosphere, sharp focus on eyes, shallow depth of field, 9:16 vertical.

And here's a prompt that almost certainly exceeds 77 tokens:

Cinematic portrait of a futuristic fashion model walking through a neon-lit Tokyo street at night, surrounded by holographic billboards and floating drone advertisements, dressed in a metallic silver outfit with intricate cybernetic accessories, shot on a Sony A7R IV with an 85mm prime lens, soft rim lighting from purple and cyan neon signs, editorial photography style influenced by Petra Collins and Wong Kar-wai, moody atmosphere, sharp focus on the eyes with shallow depth of field, slight film grain, 9:16 vertical aspect ratio, ultra-detailed, photorealistic, professional photography, masterpiece quality.

This second version is roughly 95–110 tokens. The model will only "see" the first chunk reliably. Everything after the cutoff gets diluted or dropped.

How to know where you stand

The AI Prompt Counter with the Stable Diffusion preset shows you the percentage of the 77-token budget you've used. Green under 50%, amber under 85%, red after that.

Counting tokens isn't optional for image prompts the way it's optional for chat prompts. With chat models, going from 200 tokens to 400 tokens costs you a bit of latency and money. With CLIP, going from 70 tokens to 80 tokens silently changes what the model sees.

What makes image prompts long

Three habits push image prompts past the limit:

1. Stacking style words

...moody, atmospheric, dramatic, cinematic, professional, photorealistic, ultra-detailed, hyperrealistic, masterpiece, 8K, sharp focus...

Each adjective is a token. Pick three or four that actually point at different things. "Cinematic" and "dramatic" and "moody" largely overlap — pick one.

2. Naming influences in long form

...in the style of Petra Collins meets Wong Kar-wai meets Ridley Scott meets Blade Runner...

Replace with a single chosen influence. The model can only blend so many references in 77 tokens.

3. Adding meaningless quality words

...masterpiece, best quality, ultra-detailed, professional photography, award-winning...

These were once useful when older Stable Diffusion checkpoints rewarded them. Modern models trained on better data don't need them. They burn tokens that could carry actual visual information.

The frame that works

A reusable structure for image prompts under 77 tokens:

Slot Example
Subject (5–10 tokens) A futuristic fashion model
Setting (5–10 tokens) in a neon-lit Tokyo street
Lighting / camera (10–15 tokens) 85mm lens, soft rim lighting, shallow depth of field
Style (5–10 tokens) editorial photography style
Composition / aspect (3–6 tokens) 9:16 vertical, sharp focus on eyes

Total: ~40–60 tokens. Room to spare for one or two specifics.

Negative prompts count too — but separately

Most image tools handle negative prompts (--no in Midjourney) in a separate CLIP pass. So a 30-token positive prompt + a 30-token negative prompt fits without conflict. Don't pile every concern into the positive prompt when a short negative prompt would do the same job.

A good negative prompt is short and aimed:

blurry, distorted hands, text, watermark, plastic skin, low resolution

That's 11 tokens that prevents 90% of common failure modes.

Two short prompts often beat one long one

If you have lots of specifics you can't fit, split into two prompts:

  1. First pass: composition, subject, setting
  2. Second pass: variations on lighting, style, mood

Or use an image-to-image refinement step. Stuffing every detail into one CLIP-77 budget usually loses information.

A 30-second test

Take your last Midjourney or Stable Diffusion prompt. Paste it into the AI Prompt Counter with Stable Diffusion selected as the model. If you're red, trim. If you're at 70–80% of the limit, you're using the budget well.

The goal isn't to write the shortest possible prompt. It's to fit the most useful information into the limited window the model actually reads.

Continue reading