Ingesting from Google Drive

Ingesting Data from Google Drive for ETL Work

Google Drive is where a lot of real-world data lives before it ever reaches a database. Sales teams maintain spreadsheets there. Finance uploads monthly reports. A shared folder accumulates Excel files that nobody wants to move. If you're building a pipeline that needs to touch that data, you need to know how to pull it out programmatically and hand it to your ETL code.

This walkthrough covers two paths: downloading files from Drive (Excel, CSV) and reading Google Sheets directly via the Sheets API. Both paths end at a pandas DataFrame you can pass to whatever comes next in your pipeline.

Authentication: Service Account vs OAuth

The first decision is how your code proves it has permission to access Drive.

Service account is the right choice for automated, server-side pipelines. You create a service account in Google Cloud, download a JSON key file, and share your Drive folder or Sheet with the service account's email address (it looks like your-account@your-project.iam.gserviceaccount.com). No user interaction, no browser popup, runs headlessly. This is what you use for a scheduled job or a pipeline that runs without anyone watching.

OAuth 2.0 is for tools a human runs. It opens a browser the first time, the user signs in, and the token gets cached locally. Good for a CLI tool someone runs on their own machine against their own Drive.

This walkthrough uses a service account. If you're building something user-facing, the auth setup differs but the API calls are identical once you have credentials.

Setting Up Credentials

1. Create a Google Cloud project

Go to console.cloud.google.com, create a new project, then enable two APIs: Google Drive API and Google Sheets API. Both are under APIs & Services > Library.

2. Create a service account

Go to APIs & Services > Credentials > Create Credentials > Service Account. Give it a name, skip the optional steps, and hit Done. Then click into the service account, go to the Keys tab, and create a new JSON key. Download the file. This is your credentials.json.

Never commit this file. Add it to .gitignore immediately.

# .gitignore
credentials.json
token.json
.env

3. Share your Drive folder with the service account

In Google Drive, right-click the folder you want to access > Share. Paste the service account's email address (from the credentials JSON, under the client_email field). Give it Viewer access.

4. Store the credentials path in an environment variable

# .env
GOOGLE_APPLICATION_CREDENTIALS=/path/to/credentials.json
DRIVE_FOLDER_ID=your_folder_id_here

The folder ID is in the URL when you open a folder in Drive: https://drive.google.com/drive/folders/THIS_PART_IS_THE_FOLDER_ID

Installing Dependencies

.venv/bin/pip install google-api-python-client google-auth pandas openpyxl

Connecting to the APIs

Both the Drive and Sheets clients are built the same way. Build them once and pass them where needed.

import os
from google.oauth2 import service_account
from googleapiclient.discovery import build

SCOPES = [
    "https://www.googleapis.com/auth/drive.readonly",
    "https://www.googleapis.com/auth/spreadsheets.readonly",
]

def get_credentials():
    creds_path = os.environ["GOOGLE_APPLICATION_CREDENTIALS"]
    return service_account.Credentials.from_service_account_file(
        creds_path, scopes=SCOPES
    )

def get_drive_service():
    return build("drive", "v3", credentials=get_credentials())

def get_sheets_service():
    return build("sheets", "v4", credentials=get_credentials())

Listing Files in a Folder

The Drive API's files().list() call takes a query string. You filter by parent folder ID and file type. Pagination is handled via nextPageToken.

def list_files_in_folder(drive_service, folder_id: str) -> list[dict]:
    """List all Excel and CSV files in a Drive folder.

    Args:
        drive_service: Authenticated Drive API client.
        folder_id: The Google Drive folder ID to search.

    Returns:
        List of dicts with 'id', 'name', and 'mimeType' for each file.
    """
    SUPPORTED_MIMES = {
        "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",  # .xlsx
        "application/vnd.ms-excel",                                            # .xls
        "text/csv",
        "application/vnd.google-apps.spreadsheet",                             # Google Sheet
    }

    query = f"'{folder_id}' in parents and trashed = false"
    files = []
    page_token = None

    while True:
        response = drive_service.files().list(
            q=query,
            fields="nextPageToken, files(id, name, mimeType)",
            pageToken=page_token,
        ).execute()

        for f in response.get("files", []):
            if f["mimeType"] in SUPPORTED_MIMES:
                files.append(f)

        page_token = response.get("nextPageToken")
        if not page_token:
            break

    return files

This gives you back a list like:

[
  {"id": "1aBcD...", "name": "Q1_2024_sales.xlsx", "mimeType": "application/vnd.openxmlformats-..."},
  {"id": "2xYzW...", "name": "leads.csv",           "mimeType": "text/csv"},
  {"id": "3kLmN...", "name": "Revenue Tracker",     "mimeType": "application/vnd.google-apps.spreadsheet"},
]

Downloading Excel and CSV Files

For Excel and CSV files, download the raw bytes and write to a temp directory. Then read normally with pandas.

import io
import tempfile
import pandas as pd
from googleapiclient.http import MediaIoBaseDownload

def download_file(drive_service, file_id: str, file_name: str, dest_dir: str) -> str:
    """Download a file from Drive to a local directory.

    Args:
        drive_service: Authenticated Drive API client.
        file_id: The Drive file ID.
        file_name: The file's name (used to name the local copy).
        dest_dir: Local directory path to save the file.

    Returns:
        Absolute path to the downloaded file.
    """
    dest_path = os.path.join(dest_dir, file_name)
    request = drive_service.files().get_media(fileId=file_id)

    with open(dest_path, "wb") as f:
        downloader = MediaIoBaseDownload(f, request)
        done = False
        while not done:
            _, done = downloader.next_chunk()

    return dest_path


def download_folder_to_local(drive_service, folder_id: str) -> tuple[str, list[str]]:
    """Download all Excel and CSV files from a Drive folder to a temp directory.

    Args:
        drive_service: Authenticated Drive API client.
        folder_id: The Google Drive folder ID.

    Returns:
        Tuple of (temp_dir_path, list_of_downloaded_file_paths).
    """
    files = list_files_in_folder(drive_service, folder_id)
    temp_dir = tempfile.mkdtemp()
    downloaded = []

    NATIVE_SHEET_MIME = "application/vnd.google-apps.spreadsheet"

    for f in files:
        if f["mimeType"] == NATIVE_SHEET_MIME:
            # Skip native Sheets here — handled separately via Sheets API
            continue
        path = download_file(drive_service, f["id"], f["name"], temp_dir)
        downloaded.append(path)

    return temp_dir, downloaded

Once you have downloaded (a list of local file paths), your existing ETL code can read them as if they were local files:

import pandas as pd

def read_files_to_dataframes(file_paths: list[str]) -> list[tuple[str, pd.DataFrame]]:
    """Read a list of local Excel/CSV paths into DataFrames.

    Args:
        file_paths: List of local file paths to read.

    Returns:
        List of (filename, DataFrame) tuples.
    """
    results = []
    for path in file_paths:
        name = os.path.basename(path)
        if path.endswith(".csv"):
            df = pd.read_csv(path)
        else:
            df = pd.read_excel(path)
        results.append((name, df))
    return results

Reading Google Sheets Directly

Native Google Sheets don't have a file to download — they're stored in Google's own format. You read them through the Sheets API, which returns rows as lists of values. The pattern is straightforward.

def read_sheet_to_dataframe(
    sheets_service,
    spreadsheet_id: str,
    sheet_range: str = "Sheet1",
) -> pd.DataFrame:
    """Read a Google Sheet range into a pandas DataFrame.

    The first row of the range is treated as the header.

    Args:
        sheets_service: Authenticated Sheets API client.
        spreadsheet_id: The spreadsheet ID from the Google Sheets URL.
        sheet_range: Sheet name or A1 range notation. Defaults to "Sheet1".

    Returns:
        DataFrame with the sheet contents.
    """
    result = (
        sheets_service.spreadsheets()
        .values()
        .get(spreadsheetId=spreadsheet_id, range=sheet_range)
        .execute()
    )

    rows = result.get("values", [])
    if not rows:
        return pd.DataFrame()

    headers = rows[0]
    data = rows[1:]

    # Pad short rows so every row has the same number of columns
    data = [row + [""] * (len(headers) - len(row)) for row in data]

    return pd.DataFrame(data, columns=headers)

The spreadsheet ID is in the URL: https://docs.google.com/spreadsheets/d/THIS_IS_THE_ID/edit

If your sheet has multiple tabs you want to read, list the tab names and call this once per tab:

TAB_NAMES = ["Q1 Sales", "Q2 Sales", "Q3 Sales"]

frames = []
for tab in TAB_NAMES:
    df = read_sheet_to_dataframe(sheets_service, spreadsheet_id, sheet_range=tab)
    df["source_tab"] = tab
    frames.append(df)

combined = pd.concat(frames, ignore_index=True)

Exporting a Sheet as CSV (Alternative)

If you don't need real-time data and just want a snapshot, you can export a Sheet as CSV through the Drive API instead of using the Sheets API. This is simpler but only gives you one tab.

def export_sheet_as_csv(drive_service, file_id: str) -> pd.DataFrame:
    """Export a Google Sheet to CSV and read it as a DataFrame.

    Args:
        drive_service: Authenticated Drive API client.
        file_id: The Google Sheet's Drive file ID.

    Returns:
        DataFrame with the sheet's first tab.
    """
    response = drive_service.files().export_media(
        fileId=file_id,
        mimeType="text/csv",
    ).execute()

    return pd.read_csv(io.BytesIO(response))

Use the Sheets API if you need multiple tabs, cell types, or formulas. Use export_media if you just need the visible values from the first tab.

Putting It Together: Full Ingest Function

This ties the pieces together into a single function your ETL code calls. It returns a list of (name, DataFrame) tuples regardless of whether the source was an Excel file or a Google Sheet.

def ingest_from_drive(folder_id: str) -> list[tuple[str, pd.DataFrame]]:
    """Ingest all supported files from a Drive folder into DataFrames.

    Handles .xlsx, .xls, .csv, and native Google Sheets. Each item in the
    returned list has a source name and a DataFrame ready for ETL processing.

    Args:
        folder_id: Google Drive folder ID to ingest from.

    Returns:
        List of (source_name, DataFrame) tuples.
    """
    drive_svc = get_drive_service()
    sheets_svc = get_sheets_service()

    files = list_files_in_folder(drive_svc, folder_id)
    results = []

    NATIVE_SHEET_MIME = "application/vnd.google-apps.spreadsheet"

    with tempfile.TemporaryDirectory() as tmp_dir:
        for f in files:
            try:
                if f["mimeType"] == NATIVE_SHEET_MIME:
                    df = read_sheet_to_dataframe(sheets_svc, f["id"])
                    results.append((f["name"], df))
                else:
                    path = download_file(drive_svc, f["id"], f["name"], tmp_dir)
                    df = pd.read_csv(path) if path.endswith(".csv") else pd.read_excel(path)
                    results.append((f["name"], df))
            except Exception as e:
                print(f"Skipped {f['name']}: {e}")

    return results

From here, each (name, df) pair goes to your consolidator, validator, loader — whatever the next stage of your ETL pipeline expects. The ingest function doesn't know or care about schema. It just gets data out of Drive and into memory.

Patterns Worth Knowing

Keep credentials out of the repo. The service account JSON file holds a private key. It belongs in .gitignore, in a secrets manager, or as an environment variable. Never check it in, even in a private repo.

Use TemporaryDirectory for downloads. It cleans itself up automatically when the with block exits. You don't have to manage temp file paths manually.

Pad short rows from the Sheets API. The Sheets API omits trailing empty cells. A row with five columns where the last two are blank comes back as a list of three values. The padding step in read_sheet_to_dataframe prevents ragged DataFrames.

One service per API, not one per call. Build the Drive client once and reuse it across all file operations. The build() call has overhead. Creating a new client per file is wasteful.

Handle auth errors explicitly. If the credentials file is missing or the service account hasn't been shared with the folder, the Drive API returns a 403 or raises a FileNotFoundError. Wrap the get_credentials() call with a clear error message so users know what went wrong instead of seeing a raw stack trace.

def get_credentials():
    creds_path = os.environ.get("GOOGLE_APPLICATION_CREDENTIALS")
    if not creds_path or not os.path.exists(creds_path):
        raise RuntimeError(
            "GOOGLE_APPLICATION_CREDENTIALS is not set or the file doesn't exist. "
            "See README for Drive setup instructions."
        )
    return service_account.Credentials.from_service_account_file(
        creds_path, scopes=SCOPES
    )

What Comes Next

Once ingest_from_drive returns your list of DataFrames, you're past the auth and API layer. Everything downstream — column standardization, type coercion, validation, deduplication, loading to a database — works exactly the same whether the data came from Drive or a local folder. That's the point of keeping the ingest layer thin: it does one job (get data into memory) and hands off cleanly.

A practical follow-on from here is building a consolidation pipeline on top of this ingest layer. That's covered in the Google Drive Consolidator project, which adds column standardization, row-level validation, a quarantine table for bad data, and a Streamlit dashboard showing the full before/after picture.