Methodology

How the pipeline actually works

No black box. Here is exactly how a plain-English claim becomes a ranked, sourced report — the steps, the scoring, the cost, and how we test that it holds up.

Start a claim See a sample report

The run

From a claim to a report, in six steps.

State the hypothesis

You write your claim in plain English — e.g. “indie founders struggle to get their first 100 customers.”

Clarify the scope

A short set of AI clarifying questions sharpens who you mean and what counts, so the search does not drift.

Suggest & tier subreddits

Candidates are validated against Reddit itself — dead, private, or sub-500-member communities are dropped, and the rest tiered Bullseye / Decent / Off-topic.

Mine real search queries

Queries are built from the actual title phrases people post in your subs, with frequency counts — so you search the way your audience writes.

Set the pipeline knobs

Choose how many threads to classify and how many run in parallel; a live cost estimate updates as you adjust, so there are no surprises.

Review & launch

Watch live logs, a running cost meter, and a per-thread status grid. Stop and resume any time — runs are resumable.

The classification schema

Every thread is scored into a fixed schema.

A fixed schema is what lets the fields aggregate across hundreds of posts — instead of becoming unique, un-comparable prose per thread.

thread_classification8 fields · per thread

pain_signalint · 0–100Intensity of the frustration in the thread.

wtp_tierenumWillingness to pay — high / medium / low / none.

tools_mentioned[]string[]The products and services named in the discussion.

sentiment_toward_toolsenum · per toolpositive / negative / mixed / neutral, for each tool.

primary_use_caseenumMarket research, lead gen, brand monitoring, content, or other.

relevance_scoreint · 0–10Match between the thread and your claim.

key_quotesstring[]Verbatim, with links back to source — plus best from OP & top reply.

summarystringA one-line, plain-English gist of the thread.

Cost & time

What a run costs, stage by stage.

StageTimeCost

Fetch threads2–3 minFree

Filter (rule-based prune)< 1 secFree

Classify with AI10–15 min · ~300 threadsA few dollars

Render the report< 1 secFree

Cost scales with how many comments you pull per thread — more comments mean richer quotes and a slightly higher run cost. Classification is the only paid step, and it stays inexpensive; each thread takes only seconds to process.

Built to survive Reddit’s API

The foundation that outlasts API changes.

Public data only

Every thread comes from Reddit’s public data — nothing to authenticate, no quota approval to wait on.

Resumable & rate-limit tolerant

A 429 or dropped connection mid-run is caught and retried rather than losing progress. A top-N cap bounds exactly how many threads get classified — the main cost lever.

Not exposed to API pricing

That public-data foundation is why the pipeline isn’t exposed to the pricing changes that shut down other Reddit tools.

How we test that it holds up

We measure the parts that could quietly fail.

Not assertions — reproducible scripts anyone can re-run, with caveats published next to the numbers.

90/100

AI-suggested subreddits are real

We probed 100 suggestions across ten domains: 90 were live, public communities and only one was a hallucinated name.

Read the test →

3in 4

Exact-phrase search wins

A quoted-vs-tokenized A/B found roughly three in four multi-word phrases return enough exact-match results to use directly.

Read the test →

The caveats stay in view. Small samples, and a search ranker that shifts over time — we publish those next to the numbers rather than burying them.

Maintained by Bhupendra Singh Chauhan

Founder, Reddit Research Pipeline

Keep reading

Go deeper.

Use case

Validate what people actually say, not what you wish they would.

Start a claim Browse use cases