NLP Text Analytics

NLP-Powered Review Analysis

Steam gives players a thumbs-up or thumbs-down button, but that choice does not explain what they liked, what frustrated them, or how those concerns differ by genre. I built an NLP pipeline over 2,382 reviews to answer those questions. It compares VADER sentiment with Steam's recommendation flag, finds themes with BERTopic, and serves the results through a SQL-backed dashboard.

VADER and the recommendation flag agree on 62.8% of the usable reviews. The disagreements reveal the difference between the tone of a review and the reviewer's final purchase recommendation.

Source Code

The source code is on GitHub.

Dashboard

Why Steam reviews

I wanted a text corpus with an explicit comparison label. Star ratings are ordinal and vary by reviewer. Steam's binary recommendation flag records a clear choice from every reviewer, which provides a useful benchmark for the sentiment score.

The Steam Reviews API requires no authentication. I pulled 200 reviews from each of 12 games across four genres: FPS, indie, RPG, and strategy. The repository includes the 2.3 MB seed database, so the dashboard runs without an API key.

Half the Reviews Were Too Short to Model

Of the 2,382 Steam reviews, 1,181 had fewer than five tokens after preprocessing. Many consisted of emoji, a single-word joke, ASCII art, or other text with too little content for topic modeling. That removed nearly half the corpus.

The preprocessing pipeline handles this in two stages. First, structural cleaning: HTML tags, BBCode markup, URLs, emails, and encoding artifacts (mojibake) get stripped. Then spaCy tokenizes and lemmatizes the remaining text, and a domain-specific stopword list removes words like "game," "play," "player," and "steam" that appear in every review and bury real topic signal.

Reviews that survive cleaning but produce fewer than five tokens get flagged as sentinels. They remain in the database with a dropped=True flag and a reason, but they do not reach sentiment scoring or topic modeling. VADER gives a single "great" a score of 0.66 and "ok" a score of 0.0, while BERTopic cannot place a two-token document in a useful cluster.

Without the sentinel filter, the topic model produced 15 noisy clusters. With the filter, it produced 11 topics from 1,201 substantive reviews. Excluding weak inputs improved the model more than any parameter change.

The 62.8% question

VADER sentiment agrees with the Steam recommendation flag on 62.8% of the 1,201 kept reviews. That number looks weak until the disagreements are inspected.

A player writes three paragraphs complaining about netcode, matchmaking, and server stability, then clicks thumbs-up because the core gameplay loop is fun. VADER reads the text and scores it negative. The recommendation flag says positive. They're both right. The flag captures overall disposition ("would I tell a friend to buy this?") and VADER captures what the text actually says. Those are different things.

The more useful test is whether VADER tracks relative differences. Games with more complaints receive lower compound scores, and positive-heavy genres such as indie and RPG score higher than competitive FPS games. VADER captures the direction, but it does not map cleanly to a binary flag that records a recommendation rather than sentiment.

I ran a threshold sweep to see if a better cutoff would close the gap. It didn't. Optimizing the compound score boundary improves accuracy by 1-2 percentage points. The disagreement is a label problem, not a tuning problem.

Topic modeling

BERTopic with sentence-transformer embeddings (all-MiniLM-L6-v2) and UMAP dimensionality reduction extracted 11 topics from the 1,201 kept reviews.

The English-language topics vary by genre. FPS reviews cluster around combat, movement, and weapons. RPG reviews use more narrative and world-building vocabulary. Indie reviews contain more "cozy" terms and accessibility language.

The problem is that several topics are non-English clusters. The pipeline has no language filter, so Polish, Russian, German, and Turkish reviews got grouped by shared vocabulary rather than shared meaning. BERTopic did its job (cluster similar documents), but "similar" in this case means "written in the same language," not "about the same theme."

With 1,201 reviews across 12 games, BERTopic is also working near the lower bound for stable topic extraction. More data per game would sharpen the topic boundaries and reduce the outlier rate.

Preprocessing decisions

Each review is stored in two forms. cleaned_text preserves capitalization and punctuation because VADER uses those as signals. Capital letters amplify sentiment, and exclamation marks increase intensity. tokens is the more aggressive representation: lowercased, lemmatized, and stripped of stop words for BERTopic.

One preprocessing pipeline, two outputs. VADER gets the text it needs and BERTopic gets the tokens it needs. The schema enforces this split so there's no ambiguity about which representation feeds which model.

BBCode removal runs before HTML tag removal. Steam reviews use BBCode markup ([b], [i], [url=...]) that can contain HTML-like characters. Running HTML removal first would split BBCode tags incorrectly and leave fragments in the text. The order matters.

I removed domain-specific stopwords (game, play, player, steam) because they appeared in nearly every review and dominated topic labels without carrying any information. The word "game" tells you nothing about what makes one cluster different from another. Take it out and the real terms show up.

Architecture

Steam API → data_pull.py → data/raw/*.json → db_setup.py → SQLite
                                                              |
                                              preprocessing.py → sentiment.py → topics.py
                                                              |
                                                        dashboard/app.py

The database is SQLite with five tables:

Table	What's in it
`metadata`	12 games with app_id, name, and genre
`documents`	2,382 raw reviews with recommendation flag, playtime, timestamps
`processed_documents`	Cleaned text, token arrays (JSON), token counts, sentinel flags
`nlp_results`	VADER compound/pos/neg/neu scores and topic assignments per review
`topics`	11 topic labels with top words and document counts

Every pipeline step uses INSERT OR IGNORE keyed on document ID. The full pipeline is idempotent. Run it twice, get the same database.

Dashboard queries hit SQLite directly. All sidebar filters (genre, game, sentiment range, topic, date range) push down to SQL WHERE clauses through parameterized queries. No DataFrames are loaded into memory and filtered client-side.

What I Would Change

Add language detection. A small langdetect step would remove the non-English clusters. I left the limitation visible in this version because it accurately shows what the current pipeline does.

Pull more data. 200 reviews per game is enough to demonstrate the pipeline but not enough for BERTopic to produce stable topics. 1,000+ per game would sharpen the topic boundaries.

Test a transformer against VADER. VADER is fast, interpretable, and does not need a GPU. A fine-tuned DistilBERT model may handle sarcasm, mixed-sentiment paragraphs, and gaming jargon better, but it should earn the extra complexity in a direct comparison.

Quick start

git clone https://github.com/ShameekConyers/nlp_dashboard.git
cd nlp_dashboard
python3 -m venv .venv
.venv/bin/pip install -r requirements.txt
.venv/bin/python -m spacy download en_core_web_sm
.venv/bin/streamlit run dashboard/app.py

No API key needed. The seed database ships with all 2,382 reviews, preprocessing results, sentiment scores, and topic assignments.

Tools: Python, SQLite, spaCy, NLTK (VADER), BERTopic, sentence-transformers, UMAP, Streamlit, Plotly, pytest