How the pipeline actually works
No black box. Here is exactly how a plain-English claim becomes a ranked, sourced report — the steps, the scoring, the cost, and how we test that it holds up.
From a claim to a report in six steps
- 1
State the hypothesis
You write your claim in plain English — e.g. “indie founders struggle to get their first 100 customers.”
- 2
Clarify the scope
A short set of AI clarifying questions sharpens who you mean and what counts, so the search does not drift.
- 3
Suggest & tier subreddits
Candidate subreddits are validated against r/<sub>/about.json — dead, private, or sub-500-member communities are dropped, and the rest are tiered Bullseye / Decent / Off-topic.
- 4
Mine real search queries
Queries are built from the actual title phrases people post in your chosen subs, with frequency counts, so you search the way your audience writes — not the way you guess they do.
- 5
Set the pipeline knobs
Choose how many threads to classify and how many run in parallel; a live cost estimate updates as you adjust, so there are no surprises.
- 6
Review & launch
Watch live logs, a running cost meter, and a per-thread status grid as the run executes. Stop and resume any time — runs are resumable.
What each thread is scored on
Every thread is classified into a fixed schema so the fields actually aggregate across hundreds of posts instead of becoming unique per-thread prose:
- pain_signal — 0–100 intensity of the frustration in the thread
- wtp_tier — willingness to pay, bucketed high / medium / low / none
- tools_mentioned[] — the products and services named in the discussion
- sentiment_toward_tools — positive / negative / mixed / neutral, per tool
- primary_use_case — market research, lead gen, brand monitoring, content ideation, or other
- relevance_score — 0–10 match between the thread and your claim
- key_quotes, best_quote_from_OP, best_quote_from_top_reply — verbatim, with links back to source
- summary — a one-line, plain-English gist of the thread
What a run costs and how long it takes
Cost scales with comments per thread: ~$0.13 at five comments each, ~$0.30–0.40 at fifty to a hundred. Gemini 2.5 Flash is billed at $0.30 per million input tokens and $2.50 per million output, and each thread takes roughly 11 seconds to classify.
Built to survive Reddit’s API
Every thread comes from Reddit’s public JSON endpoints — no OAuth, no API key, no quota approval to wait on.
Runs are resumable and tolerate rate-limiting: a 429 or a dropped connection mid-run is caught and retried rather than losing progress, and a top-N cap lets you bound exactly how many threads get classified, which is the main cost lever.
That public-JSON foundation is also why the pipeline is not exposed to the API-pricing changes that have shut down other Reddit research tools.
How we test that it holds up
We do not just assert the pipeline works — we measure the parts that could quietly fail, with scripts anyone can re-run.
When we needed to know whether the wizard’s AI-suggested subreddits could be trusted, we probed 100 suggestions across ten domains: 90 were live, public communities and only one was a hallucinated name. When we needed to know whether exact-phrase Reddit search beats loose keyword matching, we ran a quoted-vs-tokenized A/B and found roughly three in four multi-word phrases return enough exact-match results to use directly.
Both tests live as reproducible scripts, and both carry caveats we keep in view — small samples, and a search ranker that shifts over time. We publish those caveats next to the numbers rather than burying them.
Maintained by Bhupendra Singh Chauhan · Founder, Reddit Research Pipeline.
Keep reading
Validate your startup idea with evidence, not optimism
Test a SaaS or startup idea against real demand before you build.
Read →How we score a thread for pain and willingness to pay
The exact schema that turns a messy Reddit thread into a rankable pain and willingness-to-pay score.
Read →Does AI hallucinate subreddit names? We tested 100
We probed 100 AI-suggested subreddits against Reddit’s own API. 90 were real. Here is the 1% that wasn’t.
Read →Quoted vs tokenized Reddit search: an A/B test
Loose keyword search buries you in off-topic posts. We A/B-tested exact-phrase search — 3 in 4 multi-word queries were better off quoted.
Read →