Spreadsheet Google Drive Consolidator
Drive-to-Database: Automated Excel Consolidation Pipeline
This project takes a folder of messy spreadsheets, standardizes and validates every row, and loads clean data into a SQLite database. Any row that fails validation doesn't get silently dropped. It gets quarantined in a separate table with a plain-English explanation of exactly what went wrong and where. A Streamlit dashboard shows the full before/after picture: how many rows loaded cleanly, how many got flagged, and why.
The pipeline runs in two modes. Local mode works immediately on clone: no credentials, no API keys, just point it at a folder of files. Google Drive mode is optional: pass a folder ID and your own OAuth credentials, and the tool downloads files directly from Drive before running the same pipeline.
Article Source Code
Code for the project can be found here
Dashboard
Motivation
I've seen this pattern in a few different places: a small team collaborating on a shared Google Drive folder, each person maintaining their own spreadsheet, no shared schema, no enforcement. By the time someone wants to analyze anything, the data is a mess. Column names differ file to file. Dates are formatted six different ways. Some cells have notes typed in where numbers should be. And whenever someone tries to consolidate it, rows quietly disappear and nobody can tell you why.
The standard fix is usually PowerBI or some ETL tool. But not every client or small business has a PowerBI license, and a lot of teams are more comfortable editing a spreadsheet than learning a new tool. They want something lightweight that plugs into the folder structure they already have.
That's the gap this project targets. It's not a data warehouse. It's a data cleaning pipeline that works on messy Excel files and gives you an honest report of what it found.
The Quarantine Pattern
This is the core feature. Every row that fails validation goes into a quarantine table with the original data
unchanged, the source file name and row number, and a plain-English reason for the flag. Here are examples drawn
from the sample files this project ships with:
"revenue is negative (-450.00) in row 23 of Q3_2024_sales.xlsx"
"date '2099-01-01' is out of range in row 14 of Q4_2024_sales.xlsx"
"date '2024-13-01' is not a valid date in row 3 of Q3_2024_sales.xlsx"
"quantity 'TBD' is not numeric in row 9 of Q3_2024_sales.xlsx"
"required field 'customer' is empty in row 12 of Q1_2024_sales.xlsx"
"date 'Mike please update this by Friday' is not a valid date in row 31 of Q2_2024_sales.xlsx"
That last one is a good illustration of why this matters. Someone typed a note directly into a date cell. The consolidator sees it, can't parse it as a date, and sends it to quarantine with exactly that explanation. No data is lost, and whoever owns the file has a clear action item.
At the end of a run the dashboard shows something like: "347 rows loaded successfully, 12 rows quarantined for review." The quarantine table is filterable by source file, so you can hand back a targeted list of fixes to whichever team member owns each file.
Design Trade-offs
The natural question is why not just use PowerBI or Tableau Prep. Not every team has access to those tools, and they require someone who knows how to use them. More importantly, they don't give you a queryable database at the end. They give you a dashboard. If you want the underlying clean data in a format you can query with SQL, write to, or build on top of, you need a different approach.
SQLite was the right database for this. No server to set up, no credentials to manage, the database is a single file you can inspect with any SQLite browser, and it works immediately on clone. For the scale this tool targets — a few hundred to a few thousand rows from a shared team folder — anything heavier would be overhead without benefit.
Validation rules live in YAML rather than Python because the whole point is reusability without code changes. Someone adapting this for their own spreadsheets should be able to open one config file, change the column names and range limits, and have it work. The YAML config makes the tool usable by anyone comfortable editing a text file, not just whoever wrote the Python.
One deliberate choice: the pipeline never fixes bad data automatically. When a cell has "see notes below" in a revenue column, you don't know if the real value is 50,000. Silently correcting those rows means making assumptions about data whose intent you don't actually know. The right pattern — the production pattern — is to quarantine it and hand it back to whoever owns the source file.
The tool accepts CSV, Excel, and Google Sheets exports interchangeably. Keeping it format-agnostic means anyone on the team can contribute files however they prefer, and the pipeline handles everything without a conversion step.
Before/After: What the Pipeline Does to Messy Files
The project ships with eight sample files in data/sample_files/, six Excel and two CSV, designed to reflect the
kind of mess you'd actually encounter on a shared drive. Here's what the consolidator handles before the data even
reaches the validator:
| Issue in source files | How it's handled |
|---|---|
| Column called "Revenue", "Rev.", "Total Revenue" across different files | Mapped to canonical name revenue via alias dictionary |
Dates formatted as 2024-01-15, 01/15/2024, Jan 15 2024, 15-Jan-24 | Normalized to ISO 8601 (YYYY-MM-DD) |
Numbers stored as text, currency symbols like $1,200.00 | Stripped and cast to numeric, failures preserved for quarantine |
west_region_2024.xlsx has two title rows above the actual header | Header row auto-detected; title rows skipped |
| Same row appearing in two different quarterly files | Exact cross-file duplicates removed; first occurrence kept |
File has no region column but filename contains "west" | Region column injected: "West" |
Every one of those transformations gets written to the cleaning_log table: what changed, in which file, and when.
Configurable Validation Rules
Rules live in config/validation_rules.yaml. Here's the actual config for this project:
columns: revenue: type: numeric min: 0 required: true date: type: date min: "2015-01-01" max: "2026-12-31" required: true quantity: type: numeric min: 0 max: 100000 required: true region: type: text required: false sales_rep: type: text required: false customer: type: text required: false product: type: text required: false min_non_null_fields: 3 flag_non_conforming_types: true negative_revenue_allowed_files: - returns_flagged.csv
That last section is worth noting: returns_flagged.csv is a refund data file where negative revenue is
intentional. The validator has a file-level exception so those rows don't get quarantined. That kind of nuance
("this rule applies everywhere except this specific file, for a business reason") is exactly the kind of thing that
comes up in real data pipelines.
To adapt this tool to a completely different dataset, you change the YAML. No Python edits required.
Architecture
data/sample_files/ ──► consolidator.py ──► validator.py
(or Google Drive) │
▲ ┌───────┴────────┐
│ ▼ ▼
drive_connector.py clean_df quarantine_df
│ │
└──────┬─────────┘
▼
db_loader.py
┌──────┴──────┐
▼ ▼
consolidated quarantine
table table
│
cleaning_log
table
│
┌─────────────┼─────────────┐
▼ ▼ ▼
export.py report.py dashboard/app.py
(Excel out) (HTML report) (Streamlit UI)
The pipeline has clean separation between stages. consolidator.py handles all the messy standardization work
(column names, dates, numeric cleaning, duplicates) and outputs a single merged DataFrame. validator.py doesn't
touch the data. It only decides which rows are clean and which get quarantined. db_loader.py just writes to
SQLite. Each module has one job.
Each module uses functional programming patterns so transformations are easy to swap out. If you need a different
connector, say Dropbox instead of Google Drive, you replace drive_connector.py without touching anything
downstream.
Database tables:
| Table | Contents |
|---|---|
consolidated | Clean rows that passed all validation rules |
quarantine | Failed rows with original data preserved, source file, row number, and plain-English reason |
cleaning_log | Every transformation applied: column renames, date normalization, numeric cleaning, duplicate removals |
Google Drive Integration
The Drive connector is the optional layer. It authenticates via OAuth 2.0, lists all .xlsx/.xls/.csv files in
a specified folder, downloads them, and hands the local folder path to consolidator.py. The same pipeline runs as
if the files were always local. The integration code is fully readable in the repo. The credentials live in .env
and are never committed.
# Run the pipeline against a Drive folder .venv/bin/python src/consolidator.py --source gdrive --folder-id YOUR_FOLDER_ID .venv/bin/python src/db_loader.py --full .venv/bin/streamlit run dashboard/app.py
This pattern, separating the integration code (public, readable) from the credentials (private, gitignored), is a deliberate design choice. It means anyone can review exactly how the Drive connection works without needing access to your Google Cloud project.
Engineering Notes
The source_file and source_row metadata has to be attached before any cleaning happens, not after. If you tag
rows after duplicate removal and empty-row filtering, the row numbers in your quarantine table point to positions in
the cleaned DataFrame, not the original file. That makes the quarantine reasons useless for tracing back to the
source. Tagging first, cleaning second keeps every reference accurate.
The sample files are the test spec. Writing validation logic against an abstract rule list and backfilling test data afterward means you miss edge cases. Define the inputs first, then write the code that handles them.
Quick Start
git clone <repo> cd excel_consolidator python3 -m venv .venv .venv/bin/pip install -r requirements.txt .venv/bin/python src/consolidator.py --input data/sample_files/ .venv/bin/python src/db_loader.py .venv/bin/streamlit run dashboard/app.py
No credentials needed. The repo ships with eight sample files that demonstrate every edge case the pipeline handles.
Tools: Python, pandas, openpyxl, SQLite, PyYAML, Streamlit, google-api-python-client