Thematic Analysis: Coding Qualitative Data Into Themes

01What thematic analysis is

Thematic analysis is a method for identifying, organizing, and reporting patterns of meaning across a body of qualitative data. You start with unstructured text — what people actually wrote or said — and end with a small set of named themes that describe what is going on in that text, each one backed by specific examples.

It is one of the most widely used approaches in qualitative research, in large part because it is flexible: it does not commit you to a particular theory of the world up front, and it works on almost any kind of text. The version most people learn is the six-phase approach described by Virginia Braun and Victoria Clarke, whose 2006 paper made the steps explicit and repeatable. This guide follows that structure and attributes it to them, but the underlying idea — read closely, label what you see, group the labels — is older and broadly shared across qualitative traditions.

The honest framing matters from the start. Thematic analysis describes patterns in the data you collected. It tells you what themes are present and how widely they appear in your sample. It does not, on its own, tell you how common those themes are in a wider population. That is a separate, quantitative question. Keep that boundary in view through everything below.

Codes answer what is in this passage; themes answer what is this body of data telling me. If you cannot say what a theme is about in one sentence, it is still a cluster of codes.

02Codes versus themes

The two core units of thematic analysis are codes and themes, and conflating them is the most common beginner mistake.

A code is a short label attached to a specific chunk of text — a sentence, a phrase, a passage. It captures one observation: cancelled because of price, confused by onboarding, mentions a competitor by name. Codes are granular and close to the raw text. A single document might carry several codes; the same code might appear across dozens of documents.

A theme is a higher-level pattern that pulls related codes together into something meaningful. Confused by onboarding, could not find the export button, and gave up before finishing setup might all roll up into a theme like friction in the first session. A theme is not just a bucket of codes — it is a claim about a pattern that says something coherent about the data. If you cannot describe in a sentence what a theme is about, it is probably still a code cluster, not a theme.

A useful test: codes answer what is in this passage; themes answer what is this body of data telling me.

03Inductive versus deductive coding

There are two directions you can code from, and most real projects mix them.

Inductive (bottom-up) coding lets the codes emerge from the data. You read with as few preconceptions as you can manage and label what is actually there. This is the right default when you are exploring and do not yet know what you will find — it surfaces surprises.

Deductive (top-down) coding starts from a predefined framework. You decide in advance the categories you care about — say, the stages of a purchase journey, or a fixed list of feature areas — and you sort the data into them. This is faster and more comparable across analysts, but it can only ever find what your framework anticipated; anything outside it gets dropped or forced.

In practice many analysts run a deductive pass for the structure they already need to report against, then an inductive pass to catch what the framework missed. Be explicit with yourself about which mode you are in, because they fail in opposite ways: pure inductive drifts and produces too many one-off codes; pure deductive misses the unexpected.

04The six phases, applied to a concrete example

1. Familiarization

Read the whole corpus before you label anything. Read the threads and tickets end to end, take rough notes, and let first impressions form. The goal is immersion, not coding. You are looking for the texture of the data — recurring complaints, the language people use, the moments they get stuck. Skipping this phase is how analysts end up coding the first ten documents in detail and the rest carelessly.

2. Generating initial codes

Go through the data systematically and attach a code to every passage that says something relevant. On our example set: tag a comment about a confusing setup screen as onboarding friction, a post comparing prices as price sensitivity, a ticket asking for a missing integration as feature gap integrations. Keep a running code list and reuse exact labels — a consistent code vocabulary is what lets you tally later. F5Bot and f5bot must be the same code, not two.

3. Searching for themes

Now step up a level. Sort your codes into candidate themes by grouping ones that belong together. Onboarding friction, could not find export, and gave up during setup might cluster into first-run experience. Several pricing codes might form willingness to pay and price anchoring. Some codes will not fit anywhere yet — park them. You are sketching a structure, not finalizing it.

4. Reviewing themes

Pressure-test the candidate themes against the data. Two checks: internal coherence (do the coded extracts inside a theme actually belong together) and external distinctness (are two themes really different, or should they merge). Re-read the extracts under each theme. Split themes that are doing two jobs; collapse ones that overlap. Crucially, look for disconfirming cases — passages that contradict the theme. A theme that survives contact with the data that argues against it is far stronger than one built only from supporting quotes.

5. Defining and naming themes

For each surviving theme, write a short definition — a sentence or two stating exactly what it covers and what it excludes — and give it a clear, concrete name. First-run friction: users who stall or quit before completing setup is better than a vague label like UX issues. If you cannot write the definition cleanly, the theme is not yet well formed and needs another review pass.

6. Writing up

Produce the report: each theme, its definition, how prevalent it was (in how many distinct sources it appeared), and two or three traceable quotes that illustrate it. Every quote should link back to its source thread or ticket so a reader can verify it in context. The write-up is also analysis — articulating a theme in prose often exposes that it needs splitting or merging, sending you back a phase. That is expected.

05Practical tips that keep an analysis honest

Four habits separate a defensible thematic analysis from a tidy-looking one that does not hold up.

Use a consistent code vocabulary. Decide on exact code labels and reuse them verbatim. Normalize case and spelling before you count, or the same idea fragments across near-duplicate codes and the totals lie.
Count prevalence by distinct sources, not raw mentions. A theme that appears in forty different threads is stronger evidence than one that appears forty times in a single ranty thread. Track how many separate people or documents raise each theme.
Hunt for disconfirming cases on purpose. Actively search for text that contradicts each theme. If you only collect supporting quotes, you will confirm whatever you went in believing.
Keep every quote traceable. Store a link or ID back to the original source for each extract, so any claim in the report can be checked against the passage it came from. Untraceable quotes are unverifiable claims.

06Codes rolling up into a theme

CodeRolls up into themeDistinct sources

onboarding frictionFirst-run frictionmany

could not find exportFirst-run frictionseveral

gave up during setupFirst-run frictiona few

price comparisonPrice sensitivitymany

asked for discountPrice sensitivitya few

Counts are illustrative buckets, not measured figures — the point is the structure, not the numbers. Prevalence is always counted per distinct source.

07Where automated classification helps — and where it does not

Coding by hand is slow, and the slowest part is the first pass: reading every passage and attaching initial codes. This is where automated classification earns its place. A model can read a large corpus and apply a fixed set of labels far faster than a person, giving you a coded draft to work from instead of a blank page. It is genuinely useful for phase two on a large set.

But automation accelerates the mechanical part, not the interpretive part. Defining what a theme means, deciding whether two clusters are really distinct, spotting the disconfirming case, judging whether a pattern is meaningful or just frequent — those are human judgments. A classifier applies the vocabulary you give it; it does not decide which patterns matter. Treat machine coding as a fast first draft that a person reviews, corrects, and shapes into themes, not as the finished analysis.

Two cautions specific to automated coding. First, models are most reliable when constrained to a fixed enum of labels; let them invent free-form labels and you get a unique phrase per passage that never tallies. Second, a confident-looking label is still just a label — spot-check a sample against the raw text before you trust the totals.

Where rawneed fits

rawneed is one way to run an observational pass over public discussion. You give it a plain-English question; it gathers relevant Reddit threads and classifies them along fixed dimensions — pain, willingness to pay, sentiment, and tools mentioned — then returns a ranked report where every signal links back to the source thread. That covers familiarization and a first coding pass on a large corpus quickly. The interpretive phases — reviewing themes, checking disconfirming cases, deciding what is meaningful — are still yours, and the output describes the threads it read, not a representative population.

See how the classification works →

08The boundary to keep in mind

Thematic analysis is a strong way to understand what people are saying and why, in their own words. It is not a survey. The themes you find describe the data in front of you and how widely they appear within that sample — they do not establish how common those views are across everyone you did not read. When a theme looks important, the right next move is often to size it with a quantitative method, not to assume the proportions in your sample carry over. Used with that boundary respected, thematic analysis is one of the most useful tools for turning messy qualitative text into something you can act on.

Frequently asked questions

A code is a short label on a specific chunk of text capturing one observation, like confused by onboarding. A theme is a higher-level pattern that groups related codes into a coherent claim about the data, like friction in the first session. Codes answer what is in a passage; themes answer what the whole body of data is telling you.

The six phases described by Braun and Clarke are: familiarization with the data, generating initial codes, searching for themes, reviewing themes, defining and naming themes, and writing up. The phases are iterative rather than strictly linear — you loop back as later phases expose problems in earlier ones.

Inductive coding lets codes emerge from the data with as few preconceptions as possible, which is best for exploring. Deductive coding sorts data into a framework you defined in advance, which is faster and more comparable but can only find what the framework anticipated. Many projects mix both: a deductive pass for required structure, an inductive pass to catch what it missed.

Automated classification can accelerate the mechanical first pass — reading a large corpus and applying initial codes — but it does not replace the analyst. Defining what a theme means, judging whether patterns are meaningful, and finding disconfirming cases are human judgments. Treat machine coding as a fast first draft that a person reviews and shapes into themes.

No. Thematic analysis describes patterns in the data you collected and how widely they appear within that sample. It does not establish how common those themes are across a wider population — that is a quantitative question. If a theme looks important, size it with a separate quantitative method rather than assuming your sample proportions generalize.

Thematic Analysis: Coding Qualitative Data Into Themes

01What thematic analysis is

02Codes versus themes

03Inductive versus deductive coding

04The six phases, applied to a concrete example

1. Familiarization

2. Generating initial codes

3. Searching for themes

4. Reviewing themes

5. Defining and naming themes

6. Writing up

05Practical tips that keep an analysis honest

06Codes rolling up into a theme

07Where automated classification helps — and where it does not

Where rawneed fits

08The boundary to keep in mind

Frequently asked questions

Related guides & use cases.

Write content about what your audience actually asks

Qualitative research methods, compared honestly

How to do qualitative research

Netnography: ethnography for online communities

Customer research without recruiting a single participant

Reddit research tool: the honest guide to every type

Qualitative vs quantitative research

Validate what people actually say, not what you wish they would.