A comprehensive Python-based system for scanning, analyzing, and reporting storage usage across supported storage environments. This automated workflow scans user directories, processes metadata, generates statistics, and sends personalized usage reports with visualizations to users and supervisors.
- Overview
- Features
- Directory Structure
- Prerequisites
- Workflow
- Scripts Documentation
- Usage Examples
- Output Files
This repository contains an automated pipeline for managing storage-usage data:
- Scan - Collect file metadata from configured directories (users, labs, groups)
- Summarize - Process raw data into statistical summaries
- Database - Query MySQL for billing information
- Visualize - Generate PDF reports with usage charts
- Notify - Email personalized reports to users and supervisors
The system is designed to run periodically (e.g., monthly) on a VM or server configured for the target environment.
- Multi-target scanning: Scans user directories, lab directories, and group directories
- Configurable workflow: JSON-based configuration for directories, scan types, and email settings
- Config folder overrides: All Python scripts accept
--config-dir; the default isconfig - Intelligent statistics: Tracks total storage, inactive files (>6 months), small files (<1KB), and yearly trends
- Database integration: Fetches billing data from MySQL and merges with usage statistics
- Automated reporting: Generates professional PDF visualizations with usage tables and charts
- Smart emailing: Sends consolidated reports to supervisors including their supervised users' data
- Flexible execution: Run all steps or individual scripts with command-line options
data-usage-reporting/
├── bin/ # Main workflow scripts (numbered execution order)
│ ├── 1_run_scans.py # Step 1: Scan directories and collect metadata
│ ├── 2_summarize_scan.py # Step 2: Process data into statistics
│ ├── 3_mysql_run.py # Step 3: Query billing database
│ ├── 4_plot_user.py # Step 4: Generate PDF reports
│ └── 5_email_user.py # Step 5: Send email notifications
│ ├── run_analysis.sh # Default wrapper for the standard config folder
│ ├── core_server_scan.sh # Wrapper for core-server scan config files
│ └── core_server_analysis.sh # Wrapper for core-server analysis config files
├── lib/ # Reusable utility modules
│ ├── scan_utils.py # Scanning and file processing functions
│ ├── stat_utils.py # Statistical analysis functions
│ └── email_utils.py # Email sending functions
├── config/ # Configuration files (JSON)
│ ├── scan_config.json # Scan directories and processing settings
│ ├── mysql_config.json # Database connection and queries
│ └── email_config.json # SMTP and email template settings
├── config_core-server/ # Alternate configuration for core-server runs
├── data/ # Raw scan outputs (YYYYMMDD/)
├── results/ # Processed statistics and reports (YYYYMMDD/)
└── README.md # This file
- Python 3.8+ (tested with 3.11)
- Required packages:
pip install pandas matplotlib mysql-connector-python
- Access requirements:
- Read access to the target storage directories
- MySQL database credentials
- SMTP server access for sending emails
The complete workflow runs 5 numbered scripts in sequence:
1_run_scans.py → 2_summarize_scan.py → 3_mysql_run.py → 4_plot_user.py → 5_email_user.py
-
Scan (
1_run_scans.py):- Recursively scans configured storage directories
- Collects file metadata (size, timestamps, paths)
- Outputs:
data/YYYYMMDD/{labs,groups,users}/{users.csv, files.csv} - Supports selective scanning:
--scan-type labs groups
-
Summarize (
2_summarize_scan.py):- Processes raw CSV files into aggregated statistics
- Calculates: total storage, file counts, inactive files, yearly breakdowns
- Outputs:
results/YYYYMMDD/combined/{user_statistics.csv, yearly_statistics.csv}
-
Database Query (
3_mysql_run.py):- Queries MySQL for current billing information
- Outputs:
results/YYYYMMDD/combined/current_user_project_bill.csv
-
Generate Reports (
4_plot_user.py):- Merges statistics with billing data
- Creates PDF reports with usage tables and bar charts (storage and file counts by year)
- Outputs:
results/YYYYMMDD/combined/plots/{netid}_YYYY-MM-DD.pdf - Also creates:
merged_user_bill.csv,not_in_bill.csv,not_in_users.csv
-
Send Emails (
5_email_user.py):- Sends personalized PDFs to users
- Sends consolidated reports to supervisors (includes supervised users' PDFs)
- Uses SMTP with attachment support
The default wrapper is bin/run_analysis.sh. It uses the standard config folder, loads environment variables from ~/.myenv, and runs the summarize, MySQL, and plotting steps.
The other wrapper scripts are for the core-server configuration files:
- bin/core_server_scan.sh runs the scan step with
config_core-server. - bin/core_server_analysis.sh runs the summarize step with
config_core-server.
Scans configured directories and collects file metadata.
Usage:
python3 bin/1_run_scans.py [--scan-type {all,users,labs,groups} ...] [--config-dir CONFIG_DIR]Options:
--scan-type: Select which directories to scan (default: all)--config-dir: Config folder path, relative to the project root or absolute (default:config)
Outputs:
data/YYYYMMDD/labs/users.csv- Lab directory summariesdata/YYYYMMDD/labs/files.csv- Individual lab file metadatadata/YYYYMMDD/groups/users.csv- Group directory summariesdata/YYYYMMDD/groups/files.csv- Individual group file metadatadata/YYYYMMDD/users/users.csv- User directory summariesdata/YYYYMMDD/users/files.csv- Individual user file metadatadata/YYYYMMDD/scan.log- Scan timing and statistics log
Processes raw scan data into statistical summaries.
Usage:
python3 bin/2_summarize_scan.py [--config-dir CONFIG_DIR]Options:
--config-dir: Config folder path, relative to the project root or absolute (default:config)
Outputs:
results/YYYYMMDD/combined/user_statistics.csv- Per-user summary statsresults/YYYYMMDD/combined/yearly_statistics.csv- Yearly breakdown by user- Includes: TotalFiles, TotalSizeBytes, InactiveFileCount, SmallFileCount, etc.
Queries MySQL database for billing information.
Usage:
export MYSQL_USER='your_username'
export MYSQL_PASSWORD='your-password'
python3 bin/3_mysql_run.py [--config-dir CONFIG_DIR]Options:
--config-dir: Config folder path, relative to the project root or absolute (default:config)
Outputs:
results/YYYYMMDD/combined/current_user_project_bill.csv- Current billing dataresults/YYYYMMDD/combined/recent_12_bill_summary.csv- 12-month billing summary
Generates PDF visualization reports for each user.
Usage:
python3 bin/4_plot_user.py [--config-dir CONFIG_DIR]Options:
--config-dir: Config folder path, relative to the project root or absolute (default:config)
Outputs:
results/YYYYMMDD/combined/plots/{netid}_YYYY-MM-DD.pdf- Individual PDF reportsresults/YYYYMMDD/combined/plot_map_all.csv- Mapping of plots to usersresults/YYYYMMDD/combined/merged_user_bill.csv- Statistics merged with billingresults/YYYYMMDD/combined/not_in_bill.csv- NetIDs without billing recordsresults/YYYYMMDD/combined/not_in_users.csv- Billed users not in scan
Each PDF includes:
- User information table (project, path, dates, file stats, billing)
- Storage usage bar chart by year (GB)
- File count bar chart by year
- University of Illinois branding colors
Sends email reports to users and supervisors.
Usage:
python3 bin/5_email_user.py [--dry-run] [--config-dir CONFIG_DIR]Options:
--dry-run: Show what emails would be sent without actually sending them (recommended before first run)--config-dir: Config folder path, relative to the project root or absolute (default:config)
Behavior:
- Regular users: Receive their own PDF report
- Supervisors: Receive their own report + all supervised users' reports
- Email addresses: Defaults to
{netid}@igb.illinois.eduif not in database - Requires:
config/email_config.jsonand supervisor mapping CSV - Dry run mode displays summary of recipients and attachment counts without sending
# Set MySQL password
export MYSQL_PASSWORD='your-password'
# Run all 5 steps in sequence
python3 bin/1_run_scans.py
python3 bin/2_summarize_scan.py
python3 bin/3_mysql_run.py
python3 bin/4_plot_user.py
# Preview emails before sending (recommended)
python3 bin/5_email_user.py --dry-run
# Send emails after verifying dry run output
python3 bin/5_email_user.py# Scan only labs and groups
python3 bin/1_run_scans.py --scan-type labs groups
# Process specific folder (edit scan_config.json first)
# Set "target_folder": "20251201" in scan_config.json
python3 bin/2_summarize_scan.py
# Generate plots without emailing
python3 bin/4_plot_user.py{labs,groups,users}/users.csv:
| Column | Description |
|---|---|
| ID | Unique identifier |
| NetID | User/lab/group name |
| Group | Category (labs/groups/users) |
| TotalFiles | Total file count |
| TotalSizeBytes | Total size in bytes |
| LastModified | Last modification timestamp |
| ScanTime | Scan duration (seconds) |
{labs,groups,users}/files.csv:
| Column | Description |
|---|---|
| ID | Directory ID (links to users.csv) |
| NetID | Owner NetID |
| Path | Full file path |
| Name | File name |
| SizeBytes | File size |
| LastModified | Modification timestamp |
user_statistics.csv:
| Column | Description |
|---|---|
| ID | User ID |
| NetID | User NetID |
| Group | User category |
| TotalFiles | Total file count |
| TotalSizeBytes | Total storage (bytes) |
| InactiveFileCount | Files not modified in 6+ months |
| SmallFileCount | Files < 1KB |
| LargestFileSize | Largest file size (bytes) |
| AvgFileSize | Average file size (bytes) |
yearly_statistics.csv:
| Column | Description |
|---|---|
| NetID | User NetID |
| Year | File last modified year |
| FileCount | Number of files |
| FileSizeBytes | Total size for that year |
current_user_project_bill.csv (from MySQL):
| Column | Description |
|---|---|
| data_dir_path | Storage directory path |
| data_bill_date | Billing period date |
| data_bill_avg_bytes | Average storage billed (bytes) |
| data_bill_total_cost | Total storage cost ($) |
| data_bill_billed_cost | Amount billed ($) |
| user_id | Database user ID |
| user_name | User NetID |
| user_firstname | First name |
| user_lastname | Last name |
merged_user_bill.csv: Combines user_statistics.csv with current_user_project_bill.csv
plot_map_all.csv: Maps generated PDFs to user information for email sending
PDF files named {netid}_YYYY-MM-DD.pdf containing:
- User/project information table
- Total storage usage bar chart (GB by year)
- File count bar chart (by year)
- Performance: Full scans can take several hours depending on storage size and directory depth
- Disk Space: Raw scan data can be large (several GB); consider cleanup policies
- Security: Keep
mysql_config.jsonand email configs secure (add to.gitignore) - Scheduling: Consider running via cron on the 1st of each month
- Error Handling: Check
data/YYYYMMDD/scan.logfor scan errors and timing
Scan fails with permission errors:
- Ensure the VM has read access to all directories
- Check
skip_hidden: trueinscan_config.json
MySQL connection fails:
- Verify
MYSQL_PASSWORDenvironment variable is set - Test connection:
mysql -h host -u user -p database - Check
mysql_config.jsoncredentials
No PDFs generated:
- Verify
matplotlibis installed - Check that
merged_user_bill.csvexists and has data - Review
user_statistics.csvandcurrent_user_project_bill.csvfor matching NetIDs - Check
not_in_bill.csvandnot_in_users.csvfor not matched NetIDs
Emails not sending:
- Test SMTP connection manually
- Verify
sender_emailand SMTP credentials inemail_config.json - Check
plot_map_all.csvexists
This project is licensed under the GPLv3 License – see the LICENSE file for details.