Ingesting from Google Drive
Ingesting Data from Google Drive for ETL Work
Google Drive is where a lot of real-world data lives before it ever reaches a database. Sales teams maintain spreadsheets there. Finance uploads monthly reports. A shared folder accumulates Excel files that nobody wants to move. If you're building a pipeline that needs to touch that data, you need to know how to pull it out programmatically and hand it to your ETL code.
This walkthrough covers two paths: downloading files from Drive (Excel, CSV) and reading Google Sheets directly via the Sheets API. Both paths end at a pandas DataFrame you can pass to whatever comes next in your pipeline.
Authentication: Service Account vs OAuth
The first decision is how your code proves it has permission to access Drive.
Service account is the right choice for automated, server-side pipelines. You create a service
account in Google Cloud, download a JSON key file, and share your Drive folder or Sheet with the
service account's email address (it looks like your-account@your-project.iam.gserviceaccount.com).
No user interaction, no browser popup, runs headlessly. This is what you use for a scheduled job or
a pipeline that runs without anyone watching.
OAuth 2.0 is for tools a human runs. It opens a browser the first time, the user signs in, and the token gets cached locally. Good for a CLI tool someone runs on their own machine against their own Drive.
This walkthrough uses a service account. If you're building something user-facing, the auth setup differs but the API calls are identical once you have credentials.
Setting Up Credentials
1. Create a Google Cloud project
Go to console.cloud.google.com, create a new project, then enable two APIs: Google Drive API and Google Sheets API. Both are under APIs & Services > Library.
2. Create a service account
Go to APIs & Services > Credentials > Create Credentials > Service Account. Give it a name, skip
the optional steps, and hit Done. Then click into the service account, go to the Keys tab, and
create a new JSON key. Download the file. This is your credentials.json.
Never commit this file. Add it to .gitignore immediately.
# .gitignore
credentials.json
token.json
.env
3. Share your Drive folder with the service account
In Google Drive, right-click the folder you want to access > Share. Paste the service account's
email address (from the credentials JSON, under the client_email field). Give it Viewer access.
4. Store the credentials path in an environment variable
# .env GOOGLE_APPLICATION_CREDENTIALS=/path/to/credentials.json DRIVE_FOLDER_ID=your_folder_id_here
The folder ID is in the URL when you open a folder in Drive:
https://drive.google.com/drive/folders/THIS_PART_IS_THE_FOLDER_ID
Installing Dependencies
.venv/bin/pip install google-api-python-client google-auth pandas openpyxl
Connecting to the APIs
Both the Drive and Sheets clients are built the same way. Build them once and pass them where needed.
import os from google.oauth2 import service_account from googleapiclient.discovery import build SCOPES = [ "https://www.googleapis.com/auth/drive.readonly", "https://www.googleapis.com/auth/spreadsheets.readonly", ] def get_credentials(): creds_path = os.environ["GOOGLE_APPLICATION_CREDENTIALS"] return service_account.Credentials.from_service_account_file( creds_path, scopes=SCOPES ) def get_drive_service(): return build("drive", "v3", credentials=get_credentials()) def get_sheets_service(): return build("sheets", "v4", credentials=get_credentials())
Listing Files in a Folder
The Drive API's files().list() call takes a query string. You filter by parent folder ID and
file type. Pagination is handled via nextPageToken.
def list_files_in_folder(drive_service, folder_id: str) -> list[dict]: """List all Excel and CSV files in a Drive folder. Args: drive_service: Authenticated Drive API client. folder_id: The Google Drive folder ID to search. Returns: List of dicts with 'id', 'name', and 'mimeType' for each file. """ SUPPORTED_MIMES = { "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet", # .xlsx "application/vnd.ms-excel", # .xls "text/csv", "application/vnd.google-apps.spreadsheet", # Google Sheet } query = f"'{folder_id}' in parents and trashed = false" files = [] page_token = None while True: response = drive_service.files().list( q=query, fields="nextPageToken, files(id, name, mimeType)", pageToken=page_token, ).execute() for f in response.get("files", []): if f["mimeType"] in SUPPORTED_MIMES: files.append(f) page_token = response.get("nextPageToken") if not page_token: break return files
This gives you back a list like:
[ {"id": "1aBcD...", "name": "Q1_2024_sales.xlsx", "mimeType": "application/vnd.openxmlformats-..."}, {"id": "2xYzW...", "name": "leads.csv", "mimeType": "text/csv"}, {"id": "3kLmN...", "name": "Revenue Tracker", "mimeType": "application/vnd.google-apps.spreadsheet"}, ]
Downloading Excel and CSV Files
For Excel and CSV files, download the raw bytes and write to a temp directory. Then read normally with pandas.
import io import tempfile import pandas as pd from googleapiclient.http import MediaIoBaseDownload def download_file(drive_service, file_id: str, file_name: str, dest_dir: str) -> str: """Download a file from Drive to a local directory. Args: drive_service: Authenticated Drive API client. file_id: The Drive file ID. file_name: The file's name (used to name the local copy). dest_dir: Local directory path to save the file. Returns: Absolute path to the downloaded file. """ dest_path = os.path.join(dest_dir, file_name) request = drive_service.files().get_media(fileId=file_id) with open(dest_path, "wb") as f: downloader = MediaIoBaseDownload(f, request) done = False while not done: _, done = downloader.next_chunk() return dest_path def download_folder_to_local(drive_service, folder_id: str) -> tuple[str, list[str]]: """Download all Excel and CSV files from a Drive folder to a temp directory. Args: drive_service: Authenticated Drive API client. folder_id: The Google Drive folder ID. Returns: Tuple of (temp_dir_path, list_of_downloaded_file_paths). """ files = list_files_in_folder(drive_service, folder_id) temp_dir = tempfile.mkdtemp() downloaded = [] NATIVE_SHEET_MIME = "application/vnd.google-apps.spreadsheet" for f in files: if f["mimeType"] == NATIVE_SHEET_MIME: # Skip native Sheets here — handled separately via Sheets API continue path = download_file(drive_service, f["id"], f["name"], temp_dir) downloaded.append(path) return temp_dir, downloaded
Once you have downloaded (a list of local file paths), your existing ETL code can read them
as if they were local files:
import pandas as pd def read_files_to_dataframes(file_paths: list[str]) -> list[tuple[str, pd.DataFrame]]: """Read a list of local Excel/CSV paths into DataFrames. Args: file_paths: List of local file paths to read. Returns: List of (filename, DataFrame) tuples. """ results = [] for path in file_paths: name = os.path.basename(path) if path.endswith(".csv"): df = pd.read_csv(path) else: df = pd.read_excel(path) results.append((name, df)) return results
Reading Google Sheets Directly
Native Google Sheets don't have a file to download — they're stored in Google's own format. You read them through the Sheets API, which returns rows as lists of values. The pattern is straightforward.
def read_sheet_to_dataframe( sheets_service, spreadsheet_id: str, sheet_range: str = "Sheet1", ) -> pd.DataFrame: """Read a Google Sheet range into a pandas DataFrame. The first row of the range is treated as the header. Args: sheets_service: Authenticated Sheets API client. spreadsheet_id: The spreadsheet ID from the Google Sheets URL. sheet_range: Sheet name or A1 range notation. Defaults to "Sheet1". Returns: DataFrame with the sheet contents. """ result = ( sheets_service.spreadsheets() .values() .get(spreadsheetId=spreadsheet_id, range=sheet_range) .execute() ) rows = result.get("values", []) if not rows: return pd.DataFrame() headers = rows[0] data = rows[1:] # Pad short rows so every row has the same number of columns data = [row + [""] * (len(headers) - len(row)) for row in data] return pd.DataFrame(data, columns=headers)
The spreadsheet ID is in the URL:
https://docs.google.com/spreadsheets/d/THIS_IS_THE_ID/edit
If your sheet has multiple tabs you want to read, list the tab names and call this once per tab:
TAB_NAMES = ["Q1 Sales", "Q2 Sales", "Q3 Sales"] frames = [] for tab in TAB_NAMES: df = read_sheet_to_dataframe(sheets_service, spreadsheet_id, sheet_range=tab) df["source_tab"] = tab frames.append(df) combined = pd.concat(frames, ignore_index=True)
Exporting a Sheet as CSV (Alternative)
If you don't need real-time data and just want a snapshot, you can export a Sheet as CSV through the Drive API instead of using the Sheets API. This is simpler but only gives you one tab.
def export_sheet_as_csv(drive_service, file_id: str) -> pd.DataFrame: """Export a Google Sheet to CSV and read it as a DataFrame. Args: drive_service: Authenticated Drive API client. file_id: The Google Sheet's Drive file ID. Returns: DataFrame with the sheet's first tab. """ response = drive_service.files().export_media( fileId=file_id, mimeType="text/csv", ).execute() return pd.read_csv(io.BytesIO(response))
Use the Sheets API if you need multiple tabs, cell types, or formulas. Use export_media if you
just need the visible values from the first tab.
Putting It Together: Full Ingest Function
This ties the pieces together into a single function your ETL code calls. It returns a list of
(name, DataFrame) tuples regardless of whether the source was an Excel file or a Google Sheet.
def ingest_from_drive(folder_id: str) -> list[tuple[str, pd.DataFrame]]: """Ingest all supported files from a Drive folder into DataFrames. Handles .xlsx, .xls, .csv, and native Google Sheets. Each item in the returned list has a source name and a DataFrame ready for ETL processing. Args: folder_id: Google Drive folder ID to ingest from. Returns: List of (source_name, DataFrame) tuples. """ drive_svc = get_drive_service() sheets_svc = get_sheets_service() files = list_files_in_folder(drive_svc, folder_id) results = [] NATIVE_SHEET_MIME = "application/vnd.google-apps.spreadsheet" with tempfile.TemporaryDirectory() as tmp_dir: for f in files: try: if f["mimeType"] == NATIVE_SHEET_MIME: df = read_sheet_to_dataframe(sheets_svc, f["id"]) results.append((f["name"], df)) else: path = download_file(drive_svc, f["id"], f["name"], tmp_dir) df = pd.read_csv(path) if path.endswith(".csv") else pd.read_excel(path) results.append((f["name"], df)) except Exception as e: print(f"Skipped {f['name']}: {e}") return results
From here, each (name, df) pair goes to your consolidator, validator, loader — whatever the
next stage of your ETL pipeline expects. The ingest function doesn't know or care about schema.
It just gets data out of Drive and into memory.
Patterns Worth Knowing
Keep credentials out of the repo. The service account JSON file holds a private key. It
belongs in .gitignore, in a secrets manager, or as an environment variable. Never check it in,
even in a private repo.
Use TemporaryDirectory for downloads. It cleans itself up automatically when the with
block exits. You don't have to manage temp file paths manually.
Pad short rows from the Sheets API. The Sheets API omits trailing empty cells. A row with
five columns where the last two are blank comes back as a list of three values. The padding step
in read_sheet_to_dataframe prevents ragged DataFrames.
One service per API, not one per call. Build the Drive client once and reuse it across all
file operations. The build() call has overhead. Creating a new client per file is wasteful.
Handle auth errors explicitly. If the credentials file is missing or the service account
hasn't been shared with the folder, the Drive API returns a 403 or raises a FileNotFoundError.
Wrap the get_credentials() call with a clear error message so users know what went wrong
instead of seeing a raw stack trace.
def get_credentials(): creds_path = os.environ.get("GOOGLE_APPLICATION_CREDENTIALS") if not creds_path or not os.path.exists(creds_path): raise RuntimeError( "GOOGLE_APPLICATION_CREDENTIALS is not set or the file doesn't exist. " "See README for Drive setup instructions." ) return service_account.Credentials.from_service_account_file( creds_path, scopes=SCOPES )
What Comes Next
Once ingest_from_drive returns your list of DataFrames, you're past the auth and API layer.
Everything downstream — column standardization, type coercion, validation, deduplication, loading
to a database — works exactly the same whether the data came from Drive or a local folder. That's
the point of keeping the ingest layer thin: it does one job (get data into memory) and hands off
cleanly.
A practical follow-on from here is building a consolidation pipeline on top of this ingest layer. That's covered in the Google Drive Consolidator project, which adds column standardization, row-level validation, a quarantine table for bad data, and a Streamlit dashboard showing the full before/after picture.