{
 "nbformat": 4,
 "nbformat_minor": 5,
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "name": "python",
   "version": "3.11.0"
  }
 },
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 01 — Data Acquisition & Exploratory Analysis\n\nIngest aggregated natural gas consumption by postal code, regional HDD data, and MPAC property tax roll summaries. Filter to residential consumers and perform exploratory analysis on gas-temperature relationships."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1.1 — Environment Setup"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "source": [
    "import pandas as pd\nimport numpy as np\nimport matplotlib.pyplot as plt\nimport seaborn as sns\nfrom pathlib import Path\n\nsns.set_theme(style='whitegrid', palette='colorblind')\nplt.rcParams['figure.figsize'] = (12, 6)\nplt.rcParams['figure.dpi'] = 100\n\nprint('Environment ready.')"
   ],
   "execution_count": null,
   "outputs": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1.2 — Generate Synthetic Gas Consumption Data\n\nSimulate 150 postal codes × 24 months of residential natural gas consumption.\nEach postal code has a true thermal slope, baseload, and customer count."
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "source": [
    "np.random.seed(42)\n\nN_POSTAL = 150\nN_MONTHS = 24\nMONTHS = pd.date_range('2024-01-01', periods=N_MONTHS, freq='MS')\n\n# Monthly HDD for southern Ontario (approximate)\nhdd_template = np.array([650, 580, 480, 250, 80, 10, 0, 0, 30, 200, 400, 600])\nhdd_monthly = np.tile(hdd_template, 2) + np.random.normal(0, 25, N_MONTHS)\nhdd_monthly = np.maximum(hdd_monthly, 0)\n\n# Postal code characteristics\npostal_codes = [f'N2{chr(65+i//26)}{chr(65+i%26):1s}{np.random.randint(1,9)}{chr(65+np.random.randint(0,26))}{np.random.randint(0,9)}'\n                for i in range(N_POSTAL)]\n\n# True thermal slopes (GJ per HDD) — range from well-insulated to leaky\ntrue_slopes = np.random.uniform(0.02, 0.12, N_POSTAL)\n# Customer counts per postal code\ncustomer_counts = np.random.randint(10, 40, N_POSTAL)\n# Baseload (GJ/month — DHW, cooking)\nbaseloads = np.random.uniform(1.5, 4.0, N_POSTAL) * customer_counts\n\nprint(f'Generated {N_POSTAL} postal codes with {N_MONTHS} months of data.')\nprint(f'HDD range: {hdd_monthly.min():.0f} to {hdd_monthly.max():.0f}')"
   ],
   "execution_count": null,
   "outputs": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1.3 — Assemble Gas Consumption DataFrame"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "source": [
    "rows = []\nfor j, pc in enumerate(postal_codes):\n    for i, month in enumerate(MONTHS):\n        gas = (true_slopes[j] * customer_counts[j] * hdd_monthly[i]\n               + baseloads[j]\n               + np.random.normal(0, baseloads[j] * 0.15))\n        gas = max(gas, baseloads[j] * 0.3)  # floor at minimal baseload\n        rows.append({\n            'postal_code': pc,\n            'month': month,\n            'gas_gj': round(gas, 2),\n            'hdd': round(hdd_monthly[i], 1),\n            'customer_count': customer_counts[j],\n            'consumer_type': 'residential'\n        })\n\ngas_df = pd.DataFrame(rows)\nprint(f'Gas consumption DataFrame: {gas_df.shape}')\ngas_df.head(10)"
   ],
   "execution_count": null,
   "outputs": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1.4 — Generate MPAC Property Tax Roll Summary\n\nSynthetic building stock characteristics per postal code: footprint, storeys, structure type, basement indicator."
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "source": [
    "structure_types = ['detached', 'semi-detached', 'row', 'low-rise-apt']\ntype_weights = [0.50, 0.20, 0.25, 0.05]\n\nmpac_rows = []\nfor j, pc in enumerate(postal_codes):\n    stype = np.random.choice(structure_types, p=type_weights)\n    storeys = np.random.choice([1, 1.5, 2, 2.5, 3],\n                                p=[0.25, 0.15, 0.35, 0.15, 0.10])\n    footprint = {'detached': np.random.uniform(80, 160),\n                 'semi-detached': np.random.uniform(60, 120),\n                 'row': np.random.uniform(45, 90),\n                 'low-rise-apt': np.random.uniform(200, 500)}[stype]\n    has_basement = np.random.choice([1.0, 0.5, 0.0], p=[0.55, 0.30, 0.15])\n\n    mpac_rows.append({\n        'postal_code': pc,\n        'structure_type': stype,\n        'avg_storeys': storeys,\n        'avg_footprint_m2': round(footprint, 1),\n        'basement_fraction': has_basement,\n        'property_count': customer_counts[j],\n        'true_slope': true_slopes[j],  # ground truth for validation\n    })\n\nmpac_df = pd.DataFrame(mpac_rows)\nprint(f'MPAC summary: {mpac_df.shape}')\nmpac_df.head(10)"
   ],
   "execution_count": null,
   "outputs": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1.5 — Exploratory Analysis: Gas vs. HDD"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "source": [
    "# Pick 6 representative postal codes spanning the slope range\nsorted_idx = np.argsort(true_slopes)\nsample_idx = sorted_idx[np.linspace(0, N_POSTAL-1, 6, dtype=int)]\nsample_pcs = [postal_codes[i] for i in sample_idx]\n\nfig, axes = plt.subplots(2, 3, figsize=(15, 8), sharex=True, sharey=False)\nfor ax, pc in zip(axes.flat, sample_pcs):\n    subset = gas_df[gas_df.postal_code == pc]\n    ax.scatter(subset.hdd, subset.gas_gj, alpha=0.7, s=30)\n    ax.set_title(pc, fontsize=10)\n    ax.set_xlabel('Monthly HDD')\n    ax.set_ylabel('Gas (GJ)')\n\nplt.suptitle('Gas Consumption vs. HDD — Sample Postal Codes', fontsize=14, y=1.02)\nplt.tight_layout()\nplt.show()"
   ],
   "execution_count": null,
   "outputs": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1.6 — Distribution of Customer Counts and Building Types"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "source": [
    "fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))\n\nax1.hist(customer_counts, bins=20, edgecolor='black', alpha=0.7)\nax1.set_xlabel('Residential Customers per Postal Code')\nax1.set_ylabel('Count')\nax1.set_title('Customer Count Distribution')\n\nmpac_df.structure_type.value_counts().plot.barh(ax=ax2, color='steelblue')\nax2.set_xlabel('Number of Postal Codes')\nax2.set_title('Dominant Structure Type')\n\nplt.tight_layout()\nplt.show()"
   ],
   "execution_count": null,
   "outputs": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1.7 — Monthly HDD Pattern"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "source": [
    "fig, ax = plt.subplots(figsize=(12, 4))\nax.bar(range(N_MONTHS), hdd_monthly, color='coral', alpha=0.8)\nax.set_xticks(range(N_MONTHS))\nax.set_xticklabels([m.strftime('%b %y') for m in MONTHS], rotation=45, ha='right')\nax.set_ylabel('HDD18')\nax.set_title('Monthly Heating Degree Days - Regional Weather Station')\nplt.tight_layout()\nplt.show()"
   ],
   "execution_count": null,
   "outputs": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1.8 — Save Intermediate Data"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "source": [
    "gas_df.to_csv('data/ces_gas_consumption.csv', index=False)\nmpac_df.to_csv('data/ces_mpac_summary.csv', index=False)\nprint('Saved: ces_gas_consumption.csv, ces_mpac_summary.csv')"
   ],
   "execution_count": null,
   "outputs": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n**Next:** Notebook 02 runs the HDD regression per postal code, normalizes by building stock, and produces the thermal intensity metric."
   ]
  }
 ]
}