AI-Verified Economic Analytics Pipeline
Macro Economic Dashboard
The information sector has been losing jobs per capita since 2022. Specialty trades keep growing. The yield curve was inverted for over two years. I wanted to know what the data actually says when you normalize it properly, cross-reference it with what tech workers are talking about online, and let AI summarize the findings only after every claim is verified against the source database.
The result is a full-stack analytics pipeline: 10 FRED series, 1,500+ Hacker News stories, a recession probability model, NLP topic modeling, and 21 AI-generated narratives that are each fact-checked before they reach the page.
Source Code
Code for the project can be found here
Dashboard
The question
The information sector has been losing jobs per capita since 2022. Specialty trades keep growing. Power output is climbing. The yield curve was inverted for over two years. What does any of this actually mean?
I pulled 10 FRED series and 1,500+ Hacker News stories to look at it from two sides: the numbers (recession indicators, employment divergence, inflation) and the text (what are tech workers talking about, and does their sentiment track the employment data).
The project has four dashboard sections, a recession probability model, an NLP topic model over HN stories, and 21 AI-generated narrative insights that are each fact-checked against the source database before they show up on the page.
Why FRED, why Hacker News
FRED was the obvious data source. I have an econometrics degree, it's the standard source for U.S. macro data, and the API is free with generous rate limits. The series map naturally into a star schema, and I could focus on the analysis instead of fighting pagination across multiple endpoints.
Hacker News was a pivot from Reddit. Reddit tightened API access in late 2025 and new script-type apps no longer issue working tokens from fresh accounts. The workarounds (browser-session extraction, reverse-engineered tokens) break the "works on clone" requirement. HN has a public search API through Algolia, no auth needed, stable for 10+ years, and the tech-practitioner audience carries the same labor-sentiment signal I was after. I pulled stories matching layoff, AI jobs, and career themes from January 2022 onward, scored them with a transformer sentiment model, and grouped them into topics with NMF.
What the numbers show
The per-capita normalization matters more than anything else in this project. Raw info-sector employment looks flat. After dividing by working-age population, it fell 7.2%. Specialty trades grew 13.5% on the same basis. That's a 20+ point gap that's invisible in the raw data.
The yield curve was inverted in 26 of the months tracked. Every U.S. recession since the 1970s was preceded by an inversion.
Inflation hit 37% cumulative over the dataset window. Unemployment sits near 4.4%, but U6 (which counts discouraged and part-time workers) runs 3.3 percentage points higher. That gap has stayed wide since 2020.
Electric power output is up 8.5% since ChatGPT launched in November 2022.
On the NLP side, the dominant Hacker News topic is "Software Engineering Careers" at 585 stories. The most negative topic by sentiment is "Executive Firings & Restructuring." Layoff story volume and the U6-U3 unemployment gap move in the same direction across the 2022-2026 window.
How the verification works
This is what I'd talk about in an interview. The problem is simple: LLMs get numbers wrong. In early iterations where the model computed values itself, about 15% of claims failed verification. The fix was to stop asking it to do math.
Step 1: Python pre-computes the claims. For each of the 21 analytical slices, the script queries the database and builds 2-4 verifiable claims. Things like "USINFO changed -3.2% between 2025-04 and 2026-03" or "the yield curve was inverted in 26 months." These are computed in Python from real database queries, not generated by the model.
Step 2: The LLM writes prose only. The pre-computed claims and their underlying data context go to llama3.1:8b through Ollama (running locally, no API cost). The model's job is to turn the numbers into readable paragraphs. It doesn't compute anything.
Step 3: Independent check. A separate script re-queries the database for every claim and compares the expected value to the actual value. Tolerances are generous (5% relative or 0.5 absolute for values, sign-match for trends) because the goal is catching hallucinations, not rounding differences.
The dashboard shows a badge on each insight block: green if all claims passed, orange if some failed (with "X of Y confirmed"), red if none passed. A "Show sources" panel inside each insight shows a table with every claim, the expected value, the actual value, and whether they matched.
21 insights ship in the seed database. All 21 pass verification. The demo works without Ollama installed.
COVID adjustment
COVID broke every rolling-window calculation in the dataset. Unemployment went from 3.5% to 14.8% in a single month. A 12-month YoY window touching April 2020 produces +300% and -58% swings that dominate every chart for two years.
For each series, an ARIMA model fitted on pre-COVID data fills in a counterfactual for the March 2020 to January 2022 window, with a 3-month taper back to actual values. The raw data stays in the value column. The adjusted version goes in value_covid_adjusted. Every query uses the adjusted column except the COVID recovery chart, which shows the real shock on purpose.
This didn't change any conclusions. The same findings hold before and after. It just made the charts readable and the rolling calculations meaningful.
Per-capita normalization
Raw employment numbers grow partly because the U.S. working-age population grows about 0.5% per year. Comparing specialty trades employment of 4,256k in 2016 to 5,244k in 2026 overstates the real sector expansion because some of that growth is just more people.
USINFO and CES2023800001 are divided by CNP16OV (civilian noninstitutional population 16+) to get employees per 1,000 working-age persons, then indexed to 100 at the start date.
Before normalization, the information sector showed index 101. After, it's 93. That's the difference between "the sector barely moved" and "it shrank 7.2% relative to population." The methodology matters as much as the chart.
Recession model
A logistic regression and random forest trained on 11 FRED-derived features (yield spread, unemployment change, GDP growth, CPI momentum, employment ratios) plus 3 Hacker News features (rolling sentiment, story volume, layoff frequency). The model outputs a monthly recession probability score between 0 and 1, stored in recession_predictions.
The HN features have near-zero importance in the shipped model. The pre-2022 training period has no HN data, so those months get filled with training-period medians. That constant fill dilutes whatever signal exists in the 24 post-2022 months. It's honest about its limitations.
The dashboard's Recession Risk tab shows a probability timeline, a feature snapshot with red/green signals, and a What If scenario explorer where you can drag sliders and see how the risk score responds.
NLP section
I used sklearn's NMF (Non-negative Matrix Factorization) over the 1,547 HN story titles and excerpts to extract 8 topics. I tested values from 6 to 10 and picked 8 because it produced the most distinct clusters without splitting related themes into fragments.
I had to add stop words for non-AI proper nouns (Musk, Twitter, Tesla, Meta, Facebook) because they were creating personality-driven topics instead of labor-theme topics. AI companies and people (OpenAI, Altman) stayed in the vocabulary since they're part of the thesis.
The dashboard's NLP Analysis section has four charts:
- Topic distribution over time (stacked area, shows how the conversation shifted)
- Sentiment by topic (box plot, shows which themes carry the most negative tone)
- Layoff story volume vs the U6-U3 unemployment gap (dual-axis, tests whether HN chatter tracks macro slack)
- Topic sentiment vs USINFO per-capita employment (dual-axis, tests whether sentiment tracks actual job numbers)
Monthly bigram frequencies are pre-computed and shown as a quarterly heatmap in an expander.
Three insight blocks cover the section: a two-paragraph overview at the top, one for the topic/sentiment charts, and one for the cross-domain charts.
RAG citations
Each AI insight pulls relevant context from a vector store before generation. FRED series metadata (release notes, category hierarchy) and curated federal public-domain publications (BEA, EIA, CEA reports) are chunked, embedded with sentence-transformers, and stored in ChromaDB. During generation, the top-k chunks for each slice get injected into the prompt as reference context.
The LLM is supposed to cite these with [ref:N] tags. It doesn't. llama3.1:8b ignores the citation instruction consistently. All displayed sources come from an auto-attach mechanism that surfaces the retrieved chunks in the "Show sources" panel regardless of whether the model cited them. It works for the reader, but it's a workaround, not the intended design.
Architecture
FRED API + HN Algolia API
|
data_pull.py + hackernews_pull.py
|
sentiment_score.py
|
db_setup.py -> covid_adjustment.py -> topic_model.py
|
export_csv.py -> embed_references.py -> recession_model.py
|
ai_insights.py -> verify_insights.py
|
dashboard/app.py
The database is SQLite with a star schema. Main tables:
| Table | What's in it |
|---|---|
series_metadata | Display names, categories, units for each FRED series |
observations | Raw values and ARIMA COVID-adjusted values side by side |
ai_insights | Narratives, pre-computed claims, verification results, RAG citations |
recession_predictions | Monthly probability scores, feature snapshots, model metadata |
hn_stories | 1,547 HN stories with sentiment scores and topic assignments |
hn_topics | 8 NMF topics with labels and top terms |
hn_ngram_monthly | 520 monthly bigram frequency rows |
reference_docs | FRED metadata + scholarly docs + HN social refs for RAG |
Two modes: seed (default, everything pre-computed, no API calls needed) and full (live pull, requires a free FRED API key).
Analysis queries
Eight SQL queries using CTEs, window functions, joins, and per-capita normalization:
| Query | What it answers |
|---|---|
| Q1 | Yield curve inversions vs unemployment (T10Y2Y monthly avg + UNRATE with 12-month lag) |
| Q2 | Info vs trades divergence, per-capita normalized, indexed to 100 |
| Q3 | GDP annualized growth with NBER recession shading |
| Q4 | Rolling 12-month per-capita employment growth by sector |
| Q5 | COVID recovery comparison (raw values, the one exception) |
| Q6 | U6 vs U3 unemployment gap |
| Q7 | Electric power output vs info employment |
| Q8 | CPI inflation MoM and YoY |
Quick start
git clone https://github.com/ShameekConyers/sql_python_dashboard.git cd sql_python_dashboard python3 -m venv .venv .venv/bin/pip install -r requirements-dev.txt .venv/bin/streamlit run dashboard/app.py
No API key needed. The seed database ships with all 10 FRED series, 1,547 HN stories, 8 topics, recession predictions, and 21 verified AI insights.
Tools: Python, SQL/SQLite, pandas, scikit-learn, pmdarima, Plotly, Streamlit, sentence-transformers, ChromaDB, Ollama, FRED API, Algolia HN API