About This Project
Endo Central is a community-sourced directory of endometriosis specialists, built by mining patient experiences from Reddit. The goal is to help patients make more informed decisions by providing transparent, evidence-based doctor reviews backed by actual posts.
How It Works
Data Collection
We scanned 332,000 Reddit posts across 10 endometriosis-related subreddits, including r/Endo, r/endometriosis, r/hysterectomy, r/adenomyosis, r/pelvicfloor, and related communities.
Doctor Discovery
Doctors are identified using LLM-based extraction that reads posts in context, producing 16,799 doctor mention extractions. Each doctor is then verified against the NPI (National Provider Identifier) registry to confirm their identity, location, credentials, and specialties — 275 doctors were matched with verified NPI data.
Sentiment Classification
Each patient's experience is classified as positive, negative (minor or major), or "mentioned only" (no evaluative content) using LLM-based sentiment analysis that reads the full post in context. Posts where a doctor is mentioned without any opinion — such as scheduling updates or factual questions — are separated from actual reviews.
Patient vs. Other Users
Posts are classified as "patient experiences" if they contain first-person language indicating direct interaction with the doctor. Other mentions (recommendations, questions, secondhand reports) are shown separately.
Approval Rates
Approval rates are calculated from reviewed patients only (excluding "mentioned only"): the percentage with a positive experience out of all patients who left an actual review. Each Reddit user is counted once per doctor regardless of how many posts they made.
Known Limitations
- Reddit skews younger, more tech-savvy, and more willing to share negative experiences
- Patients with extreme experiences are more likely to post
- Some doctors with common last names may still share a page if identity could not be verified
- LLM-based classification is highly accurate but not perfect — always read the actual posts
- Small sample sizes (under ~20 patients) should be interpreted with caution
- Data is a snapshot in time and may not reflect a doctor's current practice
Privacy
Reddit usernames have been anonymized ("Patient 1", "Patient 2", etc.). Each post includes a "View on Reddit" link to the original public post for verification purposes. All source data comes from publicly accessible Reddit posts.
Open Source
This project was built as a personal research tool. The data pipeline, analysis scripts, and web platform are open source.