Spreadsheet Google Drive Consolidator

Drive-to-Database: Automated Spreadsheet Consolidation Pipeline

This project takes a folder of messy spreadsheets, standardizes and validates every row, and loads clean data into a SQLite database. Any row that fails validation doesn't get silently dropped. It gets quarantined in a separate table with a plain-English explanation of exactly what went wrong and where. A Streamlit dashboard shows the full before/after picture: how many rows loaded cleanly, how many got flagged, and why.

The pipeline runs in two modes. Local mode works immediately on clone: no credentials, no API keys, just point it at a folder of files. Google Drive mode is optional: pass a folder ID and your own OAuth credentials, and the tool downloads files directly from Drive before running the same pipeline.

Article Source Code

Code for the project can be found here

Dashboard

Motivation

I've seen this pattern in a few different places: a small team collaborating on a shared Google Drive folder, each person maintaining their own spreadsheet, no shared schema, no enforcement. By the time someone wants to analyze anything, the data is a mess. Column names differ file to file. Dates are formatted six different ways. Some cells have notes typed in where numbers should be. And whenever someone tries to consolidate it, rows quietly disappear and nobody can tell you why.

The standard fix is usually PowerBI or some ETL tool. But not every client or small business has a PowerBI license, and a lot of teams are more comfortable editing a spreadsheet than learning a new tool. They want something lightweight that plugs into the folder structure they already have.

That's the gap this project targets. It's not a data warehouse. It's a data cleaning pipeline that works on messy spreadsheet files and gives you an honest report of what it found.

The Quarantine Pattern

This is the core feature. Every row that fails validation goes into a quarantine table with the original data unchanged, the source file name and row number, and a plain-English reason for the flag. Here are examples drawn from the sample files this project ships with:

"revenue is negative (-450.00) in row 23 of Q3_2024_sales.xlsx"
"date '2099-01-01' is out of range in row 14 of Q4_2024_sales.xlsx"
"date '2024-13-01' is not a valid date in row 3 of Q3_2024_sales.xlsx"
"quantity 'TBD' is not numeric in row 9 of Q3_2024_sales.xlsx"
"required field 'customer' is empty in row 12 of Q1_2024_sales.xlsx"
"date 'Mike please update this by Friday' is not a valid date in row 31 of Q2_2024_sales.xlsx"

That last one is a good illustration of why this matters. Someone typed a note directly into a date cell. The consolidator sees it, can't parse it as a date, and sends it to quarantine with exactly that explanation. No data is lost, and whoever owns the file has a clear action item.

At the end of a run the dashboard shows something like: "347 rows loaded successfully, 12 rows quarantined for review." The quarantine table is filterable by source file, so you can hand back a targeted list of fixes to whichever team member owns each file.

Design Trade-offs

The natural question is why not just use PowerBI or Tableau Prep. Not every team has access to those tools, and they require someone who knows how to use them. More importantly, they don't give you a queryable database at the end. They give you a dashboard. If you want the underlying clean data in a format you can query with SQL, write to, or build on top of, you need a different approach.

SQLite was the right database for this. No server to set up, no credentials to manage, the database is a single file you can inspect with any SQLite browser, and it works immediately on clone. For the scale this tool targets — a few hundred to a few thousand rows from a shared team folder — anything heavier would be overhead without benefit.

Validation rules live in YAML rather than Python because the whole point is reusability without code changes. Someone adapting this for their own spreadsheets should be able to open one config file, change the column names and range limits, and have it work. The YAML config makes the tool usable by anyone comfortable editing a text file, not just whoever wrote the Python.

One deliberate choice: the pipeline never fixes bad data automatically. When a cell has "see notes below" in a revenue column, you don't know if the real value is $500 or$ 50,000. Silently correcting those rows means making assumptions about data whose intent you don't actually know. The right pattern — the production pattern — is to quarantine it and hand it back to whoever owns the source file.

The tool accepts CSV, spreadsheet, and Google Sheets exports interchangeably. Keeping it format-agnostic means anyone on the team can contribute files however they prefer, and the pipeline handles everything without a conversion step.

Before/After: What the Pipeline Does to Messy Files

The project ships with eight sample files in data/sample_files/, six spreadsheet and two CSV, designed to reflect the kind of mess you'd actually encounter on a shared drive. Here's what the consolidator handles before the data even reaches the validator:

Issue in source files	How it's handled
Column called "Revenue", "Rev.", "Total Revenue" across different files	Mapped to canonical name `revenue` via alias dictionary
Dates formatted as `2024-01-15`, `01/15/2024`, `Jan 15 2024`, `15-Jan-24`	Normalized to ISO 8601 (`YYYY-MM-DD`)
Numbers stored as text, currency symbols like `$1,200.00`	Stripped and cast to numeric, failures preserved for quarantine
`west_region_2024.xlsx` has two title rows above the actual header	Header row auto-detected; title rows skipped
Same row appearing in two different quarterly files	Exact cross-file duplicates removed; first occurrence kept
File has no `region` column but filename contains "west"	Region column injected: `"West"`

Every one of those transformations gets written to the cleaning_log table: what changed, in which file, and when.

Configurable Validation Rules

Rules live in config/validation_rules.yaml. Here's the actual config for this project:

columns:
  revenue:
    type: numeric
    min: 0
    required: true
  date:
    type: date
    min: "2015-01-01"
    max: "2026-12-31"
    required: true
  quantity:
    type: numeric
    min: 0
    max: 100000
    required: true
  region:
    type: text
    required: false
  sales_rep:
    type: text
    required: false
  customer:
    type: text
    required: false
  product:
    type: text
    required: false

min_non_null_fields: 3
flag_non_conforming_types: true

negative_revenue_allowed_files:
  - returns_flagged.csv

That last section is worth noting: returns_flagged.csv is a refund data file where negative revenue is intentional. The validator has a file-level exception so those rows don't get quarantined. That kind of nuance ("this rule applies everywhere except this specific file, for a business reason") is exactly the kind of thing that comes up in real data pipelines.

To adapt this tool to a completely different dataset, you change the YAML. No Python edits required.

Architecture

data/sample_files/  ──►  consolidator.py  ──►  validator.py
  (or Google Drive)                                  │
        ▲                                    ┌───────┴────────┐
        │                                    ▼                ▼
  drive_connector.py                     clean_df       quarantine_df
                                              │                │
                                              └──────┬─────────┘
                                                     ▼
                                               db_loader.py
                                              ┌──────┴──────┐
                                              ▼             ▼
                                        consolidated    quarantine
                                           table          table
                                              │
                                        cleaning_log
                                           table
                                              │
                               ┌─────────────┼─────────────┐
                               ▼             ▼             ▼
                          export.py      report.py     dashboard/app.py
                        (Spreadsheet out)   (HTML report)   (Streamlit UI)

The pipeline has clean separation between stages. consolidator.py handles all the messy standardization work (column names, dates, numeric cleaning, duplicates) and outputs a single merged DataFrame. validator.py doesn't touch the data. It only decides which rows are clean and which get quarantined. db_loader.py just writes to SQLite. Each module has one job.

Each module uses functional programming patterns so transformations are easy to swap out. If you need a different connector, say Dropbox instead of Google Drive, you replace drive_connector.py without touching anything downstream.

Database tables:

Table	Contents
`consolidated`	Clean rows that passed all validation rules
`quarantine`	Failed rows with original data preserved, source file, row number, and plain-English reason
`cleaning_log`	Every transformation applied: column renames, date normalization, numeric cleaning, duplicate removals

Google Drive Integration

The Drive connector is the optional layer. It authenticates via OAuth 2.0, lists all .xlsx/.xls/.csv files in a specified folder, downloads them, and hands the local folder path to consolidator.py. The same pipeline runs as if the files were always local. The integration code is fully readable in the repo. The credentials live in .env and are never committed.

# Run the pipeline against a Drive folder
.venv/bin/python src/consolidator.py --source gdrive --folder-id YOUR_FOLDER_ID
.venv/bin/python src/db_loader.py --full
.venv/bin/streamlit run dashboard/app.py

This pattern, separating the integration code (public, readable) from the credentials (private, gitignored), is a deliberate design choice. It means anyone can review exactly how the Drive connection works without needing access to your Google Cloud project.

Engineering Notes

The source_file and source_row metadata has to be attached before any cleaning happens, not after. If you tag rows after duplicate removal and empty-row filtering, the row numbers in your quarantine table point to positions in the cleaned DataFrame, not the original file. That makes the quarantine reasons useless for tracing back to the source. Tagging first, cleaning second keeps every reference accurate.

The sample files are the test spec. Writing validation logic against an abstract rule list and backfilling test data afterward means you miss edge cases. Define the inputs first, then write the code that handles them.

Quick Start

git clone <repo>
cd excel_consolidator
python3 -m venv .venv
.venv/bin/pip install -r requirements.txt

.venv/bin/python src/consolidator.py --input data/sample_files/
.venv/bin/python src/db_loader.py
.venv/bin/streamlit run dashboard/app.py

No credentials needed. The repo ships with eight sample files that demonstrate every edge case the pipeline handles.

Tools: Python, pandas, openpyxl, SQLite, PyYAML, Streamlit, google-api-python-client