Why analyse open-text feedback at all if it is hard?

Because the most actionable insight lives there. Rating scales tell you something scored low; only the free-text answers tell you why. Teams skip open text because reading hundreds of responses by hand is impractical, which is exactly the bottleneck AI removes by clustering and summarising at scale.

How does theme clustering actually work in production?

You convert each response into an embedding — a numeric vector capturing its meaning — then group vectors that sit close together, so responses about the same topic cluster even when they use different words. The demo here approximates this with shared keywords to make the concept tangible in the browser.

Where does the LLM fit in?

After clustering. Once responses are grouped, you send each cluster to a language model and ask for a short, plain-language summary of what that group is saying, plus a suggested theme label. The LLM turns a pile of similar quotes into a one-line insight a stakeholder can act on.

How do I avoid the AI inventing themes that are not there?

Ground every summary in the actual responses by passing the real text to the model and asking it to summarise only what is present, not to speculate. Always keep representative quotes alongside each theme so a human can verify the summary against what people actually said.

Can this scale to thousands of responses?

Yes. Embedding and clustering are cheap and fast, and you only call the LLM once per cluster rather than once per response, so cost stays low even with thousands of answers. The summarisation step scales with the number of themes, which stays small, not the number of responses.

How to Build an AI-Powered Feedback Collection Tool

What you are building

Most surveys collect a rating and a free-text comment, and most teams analyse the rating and ignore the comments — because reading hundreds of open responses by hand is impractical. Yet the free text is where the why lives. This tutorial builds a feedback analyser that fixes that: it takes open-text responses, groups them into themes by similarity, summarises each theme, and ranks them by how many people raised them. The result is a one-page view of what your users are actually saying, generated in seconds. The demo below runs a working version of the clustering step right in your browser so you can paste responses and watch themes emerge.

How it works

The pipeline has three stages. Collect the open-text responses into a list. Cluster them so answers about the same topic land together — in production you turn each response into an embedding (a numeric vector that captures meaning) and group vectors that sit close, which clusters “the app is too slow” with “loading takes forever” even though they share no words. The in-browser demo approximates this with shared significant keywords, which is enough to show the concept live without any model. Summarise each cluster by counting its size, surfacing its defining terms and a representative quote, and — in the full version — sending the cluster to an LLM for a one-line plain-language summary. You call the model once per theme, not once per response, so cost stays low even at thousands of responses.

Tips and notes

Paste one response per line into the tool below to try it. Notice how the biggest clusters surface first — that ranking by volume is the whole point, because a theme raised by sixty people matters more than one raised by two. In production, the keyword approach becomes embedding similarity, which catches paraphrases and synonyms the demo misses. Keep representative quotes attached to every theme so a human can always verify a summary against what people actually wrote, and instruct any LLM summariser to describe only what is present rather than speculate — that single instruction is what stops it inventing themes. Build the cluster-and-count step first; the LLM summary is a thin, optional layer on top of a pipeline that already delivers value on its own.