NLP Text Analytics
NLP-Powered Review Analysis
Steam gives players a thumbs-up/down button. But "I recommend this game" doesn't tell you what players actually like, what they complain about, or whether sentiment looks different across genres. I built an NLP pipeline over 2,382 game reviews: VADER sentiment scores validated against Steam's own recommendation flag, BERTopic topic clusters, and a dashboard that filters everything through SQL.
The 62.8% agreement rate between VADER and the recommendation flag is the number that drove most of the analysis. It looks low until you look at why they disagree.
Source Code
Code for the project can be found here
Dashboard
Why Steam reviews
I needed a text corpus with a built-in ground truth label. Most review datasets have star ratings, but those are ordinal and noisy. Steam's binary recommendation flag (thumbs up or down) gives you a clean positive/negative label that every reviewer explicitly chose. That means you can validate sentiment scores against something real, not just eyeball them.
The Steam Reviews API is public with zero authentication. No API key, no OAuth, no signup. That matters for a portfolio project that needs to work on clone. I pulled 200 reviews each from 12 games across 4 genres (FPS, Indie, RPG, Strategy) and committed the 2.3 MB seed database to Git. The dashboard runs immediately.
Half the data was garbage
This is the finding I'd lead with in an interview. Of 2,382 reviews pulled from Steam, 1,181 had fewer than 5 tokens after preprocessing. They were emoji runs, single-word jokes ("yes"), ASCII art, and Steam's equivalent of shitposting. Nearly 50% of the corpus.
The preprocessing pipeline handles this in two stages. First, structural cleaning: HTML tags, BBCode markup, URLs, emails, and encoding artifacts (mojibake) get stripped. Then spaCy tokenizes and lemmatizes the remaining text, and a domain-specific stopword list removes words like "game," "play," "player," and "steam" that appear in every review and bury real topic signal.
Reviews that survive cleaning but produce fewer than 5 tokens get flagged as sentinels. They stay in the database with a dropped=True flag and a reason, but they don't reach sentiment scoring or topic modeling. VADER scores short text erratically — a single "great" scores 0.66, "ok" scores 0.0 — and BERTopic can't place a 2-token document in any cluster.
Without the sentinel filter, the topic model produced 15 noisy clusters. With it, 11 topics from 1,201 reviews that actually had something to say. Throwing out half the data was the single best decision in the pipeline.
The 62.8% question
VADER sentiment agrees with the Steam recommendation flag 62.8% of the time across the 1,201 kept reviews. The instinct is to call that poor. But look at the disagreements.
A player writes three paragraphs complaining about netcode, matchmaking, and server stability, then clicks thumbs-up because the core gameplay loop is fun. VADER reads the text and scores it negative. The recommendation flag says positive. They're both right. The flag captures overall disposition ("would I tell a friend to buy this?") and VADER captures what the text actually says. Those are different things.
The more useful question is whether VADER's scores track relative differences. Games that players complain about more do get lower compound scores. Positive-heavy genres (Indie, RPG) do score higher than competitive genres (FPS). VADER captures the direction. It just doesn't map cleanly onto a binary flag that was never designed to measure sentiment.
I ran a threshold sweep to see if a better cutoff would close the gap. It didn't. Optimizing the compound score boundary improves accuracy by 1-2 percentage points. The disagreement is a label problem, not a tuning problem.
Topic modeling
BERTopic with sentence-transformer embeddings (all-MiniLM-L6-v2) and UMAP dimensionality reduction extracted 11 topics from the 1,201 kept reviews.
The English-language topics are genre-flavored. FPS reviews cluster around gameplay mechanics (combat, movement, weapons). RPG reviews pull in narrative and world-building vocabulary. Indie reviews anchor on "cozy" terms and accessibility language. Players in different genres care about different things, and the topic model picks that up.
The problem is that several topics are non-English clusters. The pipeline has no language filter, so Polish, Russian, German, and Turkish reviews got grouped by shared vocabulary rather than shared meaning. BERTopic did its job (cluster similar documents), but "similar" in this case means "written in the same language," not "about the same theme."
With 1,201 reviews across 12 games, BERTopic is also working near the lower bound for stable topic extraction. More data per game would sharpen the topic boundaries and reduce the outlier rate.
Preprocessing decisions
Each review gets stored two ways. cleaned_text preserves capitalization and punctuation because VADER uses those as signals — ALL CAPS amplifies sentiment, exclamation marks boost intensity. tokens is the aggressive version: lowercased, lemmatized, stopwords removed, built for BERTopic.
One preprocessing pipeline, two outputs. VADER gets the text it needs and BERTopic gets the tokens it needs. The schema enforces this split so there's no ambiguity about which representation feeds which model.
BBCode removal runs before HTML tag removal. Steam reviews use BBCode markup ([b], [i], [url=...]) that can contain HTML-like characters. Running HTML removal first would split BBCode tags incorrectly and leave fragments in the text. The order matters.
I removed domain-specific stopwords (game, play, player, steam) because they appeared in nearly every review and dominated topic labels without carrying any information. The word "game" tells you nothing about what makes one cluster different from another. Take it out and the real terms show up.
Architecture
Steam API → data_pull.py → data/raw/*.json → db_setup.py → SQLite
|
preprocessing.py → sentiment.py → topics.py
|
dashboard/app.py
The database is SQLite with five tables:
| Table | What's in it |
|---|---|
metadata | 12 games with app_id, name, and genre |
documents | 2,382 raw reviews with recommendation flag, playtime, timestamps |
processed_documents | Cleaned text, token arrays (JSON), token counts, sentinel flags |
nlp_results | VADER compound/pos/neg/neu scores and topic assignments per review |
topics | 11 topic labels with top words and document counts |
Every pipeline step uses INSERT OR IGNORE keyed on document ID. The full pipeline is idempotent. Run it twice, get the same database.
Dashboard queries hit SQLite directly. All sidebar filters (genre, game, sentiment range, topic, date range) push down to SQL WHERE clauses through parameterized queries. No DataFrames are loaded into memory and filtered client-side.
What I'd do differently
Add language detection. A 20-line fix with langdetect would kill the non-English topic clusters. I documented the limitation instead of fixing it because I was shipping the simple iteration on a deadline. It's the obvious next improvement.
Pull more data. 200 reviews per game is enough to demonstrate the pipeline but not enough for BERTopic to produce stable topics. 1,000+ per game would sharpen the topic boundaries.
Swap VADER for a transformer. VADER was the right choice for the simple iteration — fast, interpretable, no GPU needed. But a fine-tuned DistilBERT trained on the recommendation flag would handle sarcasm, mixed-sentiment paragraphs, and gaming jargon that a lexicon can't touch.
Quick start
git clone https://github.com/ShameekConyers/nlp_dashboard.git cd nlp_dashboard python3 -m venv .venv .venv/bin/pip install -r requirements.txt .venv/bin/python -m spacy download en_core_web_sm .venv/bin/streamlit run dashboard/app.py
No API key needed. The seed database ships with all 2,382 reviews, preprocessing results, sentiment scores, and topic assignments.
Tools: Python, SQLite, spaCy, NLTK (VADER), BERTopic, sentence-transformers, UMAP, Streamlit, Plotly, pytest