2385 lines
345 KiB
Plaintext
2385 lines
345 KiB
Plaintext
{
|
||
"cells": [
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "3d9c3502",
|
||
"metadata": {},
|
||
"source": [
|
||
"# 🚀 ML Phase 2: Text Features với TF-IDF + SVD\n",
|
||
"\n",
|
||
"**Mục tiêu:** Thêm text features từ 2 cột task để cải thiện R² từ ~0.4 lên 0.5-0.7\n",
|
||
"\n",
|
||
"**Dataset:** FINAL_DATASET_WITH_TEXT_BACKUP_20260105_213507.xlsx \n",
|
||
"⚠️ **IMPORTANT:** Dùng CÙNG DATASET với Phase 1 để so sánh công bằng!\n",
|
||
"\n",
|
||
"**Phase 1 Results (Baseline - Numeric Features Only):**\n",
|
||
"- Best Model: **Decision Tree**\n",
|
||
"- Val R²: **0.4136**\n",
|
||
"- Val MAE: **3.3841**\n",
|
||
"- Test R²: TBD\n",
|
||
"\n",
|
||
"**Phase 2 Strategy:**\n",
|
||
"- TF-IDF vectorization cho 2 cột text (all_task_normal + all_task_dinhky)\n",
|
||
"- SVD (Singular Value Decomposition) để giảm chiều (200 → 50 dimensions)\n",
|
||
"- Kết hợp với numeric features (bỏ các cột categorical: loai_ca, bat_dau, ket_thuc, muc_do_luu_luong)\n",
|
||
"- Re-train và so sánh với Phase 1\n",
|
||
"\n",
|
||
"---\n",
|
||
"\n",
|
||
"## 📋 Nội dung:\n",
|
||
"\n",
|
||
"1. ✅ Load Data & Setup\n",
|
||
"2. ✅ Text Preprocessing\n",
|
||
"3. ✅ TF-IDF Vectorization\n",
|
||
"4. ✅ SVD Dimensionality Reduction\n",
|
||
"5. ✅ Feature Combination\n",
|
||
"6. ✅ Train/Val/Test Split\n",
|
||
"7. ✅ Model Training\n",
|
||
"8. ✅ Results Comparison (Phase 1 vs Phase 2)\n",
|
||
"9. ✅ Feature Importance Analysis\n",
|
||
"10. ✅ Final Recommendations"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "5e6f79fc",
|
||
"metadata": {},
|
||
"source": [
|
||
"# 1️⃣ Import Libraries & Setup"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 15,
|
||
"id": "94c942bd",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"✅ Libraries imported successfully!\n",
|
||
"📅 Date: 2026-01-06 01:12:18\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"# Import essential libraries\n",
|
||
"import pandas as pd\n",
|
||
"import numpy as np\n",
|
||
"import matplotlib.pyplot as plt\n",
|
||
"import seaborn as sns\n",
|
||
"import warnings\n",
|
||
"import os\n",
|
||
"import re\n",
|
||
"from datetime import datetime\n",
|
||
"\n",
|
||
"# Text processing\n",
|
||
"from sklearn.feature_extraction.text import TfidfVectorizer\n",
|
||
"from sklearn.decomposition import TruncatedSVD\n",
|
||
"\n",
|
||
"# Machine Learning\n",
|
||
"from sklearn.model_selection import train_test_split\n",
|
||
"from sklearn.preprocessing import StandardScaler\n",
|
||
"from sklearn.linear_model import LinearRegression\n",
|
||
"from sklearn.tree import DecisionTreeRegressor\n",
|
||
"from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor\n",
|
||
"from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score\n",
|
||
"import xgboost as xgb\n",
|
||
"import lightgbm as lgb\n",
|
||
"\n",
|
||
"# Configuration\n",
|
||
"warnings.filterwarnings('ignore')\n",
|
||
"plt.style.use('seaborn-v0_8-darkgrid')\n",
|
||
"sns.set_palette(\"husl\")\n",
|
||
"pd.set_option('display.max_columns', None)\n",
|
||
"pd.set_option('display.max_rows', 100)\n",
|
||
"\n",
|
||
"# Random seed\n",
|
||
"RANDOM_STATE = 42\n",
|
||
"np.random.seed(RANDOM_STATE)\n",
|
||
"\n",
|
||
"# Create folders\n",
|
||
"os.makedirs('phase2_output/models', exist_ok=True)\n",
|
||
"os.makedirs('phase2_output/plots', exist_ok=True)\n",
|
||
"os.makedirs('phase2_output/data', exist_ok=True)\n",
|
||
"\n",
|
||
"print(\"✅ Libraries imported successfully!\")\n",
|
||
"print(f\"📅 Date: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "7becc15d",
|
||
"metadata": {},
|
||
"source": [
|
||
"# 2️⃣ Load Data & Verify"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 17,
|
||
"id": "bc631294",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"📂 Loading dataset...\n",
|
||
"✅ Dataset loaded!\n",
|
||
"📊 Shape: (454, 51)\n",
|
||
"\n",
|
||
"📋 Columns (51):['ma_dia_diem', 'all_task_normal', 'all_task_dinhky', 'loai_ca', 'bat_dau', 'ket_thuc', 'tong_gio_lam', 'so_ca_cua_toa', 'so_luong', 'num_tasks']... (showing first 10)\n",
|
||
"\n",
|
||
"🔍 Checking text columns:\n",
|
||
" all_task_normal: ✅ Found\n",
|
||
" all_task_dinhky: ✅ Found\n",
|
||
"\n",
|
||
" all_task_normal - Missing: 27 (5.9%)\n",
|
||
" all_task_dinhky - Missing: 168 (37.0%)\n",
|
||
"\n",
|
||
"📄 Sample text data:\n",
|
||
"\n",
|
||
"all_task_normal (first 200 chars):\n",
|
||
"Làm sạch toàn bộ phòng giao dịch tầng 1 (kể cả wc); Làm sạch bậc tam cấp + ngoại cảnh + tầng hầm; Làm sạch toàn bộ văn phòng tầng 2 + phòng lãnh đạo (kể cả wc); Làm sạch toàn bộ thang bộ từ tầng 1 lên...\n",
|
||
"\n",
|
||
"all_task_dinhky (first 200 chars):\n",
|
||
"nan...\n"
|
||
]
|
||
},
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<div>\n",
|
||
"<style scoped>\n",
|
||
" .dataframe tbody tr th:only-of-type {\n",
|
||
" vertical-align: middle;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe tbody tr th {\n",
|
||
" vertical-align: top;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe thead th {\n",
|
||
" text-align: right;\n",
|
||
" }\n",
|
||
"</style>\n",
|
||
"<table border=\"1\" class=\"dataframe\">\n",
|
||
" <thead>\n",
|
||
" <tr style=\"text-align: right;\">\n",
|
||
" <th></th>\n",
|
||
" <th>ma_dia_diem</th>\n",
|
||
" <th>all_task_normal</th>\n",
|
||
" <th>all_task_dinhky</th>\n",
|
||
" <th>loai_ca</th>\n",
|
||
" <th>bat_dau</th>\n",
|
||
" <th>ket_thuc</th>\n",
|
||
" <th>tong_gio_lam</th>\n",
|
||
" <th>so_ca_cua_toa</th>\n",
|
||
" <th>so_luong</th>\n",
|
||
" <th>num_tasks</th>\n",
|
||
" <th>num_cleaning_tasks</th>\n",
|
||
" <th>num_trash_collection_tasks</th>\n",
|
||
" <th>num_monitoring_tasks</th>\n",
|
||
" <th>num_room_cleaning_tasks</th>\n",
|
||
" <th>num_deep_cleaning_tasks</th>\n",
|
||
" <th>num_maintenance_tasks</th>\n",
|
||
" <th>num_support_tasks</th>\n",
|
||
" <th>num_other_tasks</th>\n",
|
||
" <th>num_wc_tasks</th>\n",
|
||
" <th>num_hallway_tasks</th>\n",
|
||
" <th>num_lobby_tasks</th>\n",
|
||
" <th>num_patient_room_tasks</th>\n",
|
||
" <th>num_clinic_room_tasks</th>\n",
|
||
" <th>num_surgery_room_tasks</th>\n",
|
||
" <th>num_outdoor_tasks</th>\n",
|
||
" <th>num_elevator_tasks</th>\n",
|
||
" <th>num_office_tasks</th>\n",
|
||
" <th>num_technical_room_tasks</th>\n",
|
||
" <th>cleaning_ratio</th>\n",
|
||
" <th>trash_collection_ratio</th>\n",
|
||
" <th>monitoring_ratio</th>\n",
|
||
" <th>room_cleaning_ratio</th>\n",
|
||
" <th>area_diversity</th>\n",
|
||
" <th>task_complexity_score</th>\n",
|
||
" <th>loai_hinh</th>\n",
|
||
" <th>ten_toa_thap</th>\n",
|
||
" <th>muc_do_luu_luong</th>\n",
|
||
" <th>so_tang</th>\n",
|
||
" <th>so_cua_thang_may</th>\n",
|
||
" <th>dien_tich_ngoai_canh</th>\n",
|
||
" <th>dien_tich_sanh</th>\n",
|
||
" <th>dien_tich_hanh_lang</th>\n",
|
||
" <th>dien_tich_wc</th>\n",
|
||
" <th>dien_tich_phong</th>\n",
|
||
" <th>dien_tich_tham</th>\n",
|
||
" <th>doc_ham</th>\n",
|
||
" <th>vien_phan_quang</th>\n",
|
||
" <th>op_tuong</th>\n",
|
||
" <th>op_chan_tuong</th>\n",
|
||
" <th>ranh_thoat_nuoc</th>\n",
|
||
" <th>dien_tich_kinh</th>\n",
|
||
" </tr>\n",
|
||
" </thead>\n",
|
||
" <tbody>\n",
|
||
" <tr>\n",
|
||
" <th>0</th>\n",
|
||
" <td>115-2</td>\n",
|
||
" <td>Làm sạch toàn bộ phòng giao dịch tầng 1 (kể cả...</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>Part time</td>\n",
|
||
" <td>06:30:00</td>\n",
|
||
" <td>10:30:00</td>\n",
|
||
" <td>4.0</td>\n",
|
||
" <td>1</td>\n",
|
||
" <td>1</td>\n",
|
||
" <td>7.0</td>\n",
|
||
" <td>7.0</td>\n",
|
||
" <td>1.0</td>\n",
|
||
" <td>2.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>1.0</td>\n",
|
||
" <td>1.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>4.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>2.0</td>\n",
|
||
" <td>3.0</td>\n",
|
||
" <td>1.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>1.000</td>\n",
|
||
" <td>0.1429</td>\n",
|
||
" <td>0.2857</td>\n",
|
||
" <td>0.0000</td>\n",
|
||
" <td>4.0</td>\n",
|
||
" <td>1.38</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>AGRIBANK CHI NHÁNH MỸ ĐÌNH</td>\n",
|
||
" <td>Trung bình (11–20 người)</td>\n",
|
||
" <td>4</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>15.0</td>\n",
|
||
" <td>290.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>20.0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>1</th>\n",
|
||
" <td>115-5</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>Hành chính</td>\n",
|
||
" <td>06:30:00</td>\n",
|
||
" <td>16:00:00</td>\n",
|
||
" <td>8.0</td>\n",
|
||
" <td>1</td>\n",
|
||
" <td>1</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>AGRIBANK PGD SỐ 05</td>\n",
|
||
" <td>Trung bình (11–20 người)</td>\n",
|
||
" <td>1</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>15.0</td>\n",
|
||
" <td>300.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>30.0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>2</th>\n",
|
||
" <td>101-1</td>\n",
|
||
" <td>Kiểm tra nhân sự các vị trí Thay rác, đẩy khô,...</td>\n",
|
||
" <td>Lau bảng biển, bình cứu hỏa , cây nước hành la...</td>\n",
|
||
" <td>Hành chính</td>\n",
|
||
" <td>06:30:00</td>\n",
|
||
" <td>16:00:00</td>\n",
|
||
" <td>7.5</td>\n",
|
||
" <td>6</td>\n",
|
||
" <td>24</td>\n",
|
||
" <td>441.0</td>\n",
|
||
" <td>258.0</td>\n",
|
||
" <td>145.0</td>\n",
|
||
" <td>134.0</td>\n",
|
||
" <td>65.0</td>\n",
|
||
" <td>75.0</td>\n",
|
||
" <td>62.0</td>\n",
|
||
" <td>57.0</td>\n",
|
||
" <td>45.0</td>\n",
|
||
" <td>89.0</td>\n",
|
||
" <td>90.0</td>\n",
|
||
" <td>5.0</td>\n",
|
||
" <td>41.0</td>\n",
|
||
" <td>25.0</td>\n",
|
||
" <td>30.0</td>\n",
|
||
" <td>4.0</td>\n",
|
||
" <td>12.0</td>\n",
|
||
" <td>39.0</td>\n",
|
||
" <td>16.0</td>\n",
|
||
" <td>0.585</td>\n",
|
||
" <td>0.3288</td>\n",
|
||
" <td>0.3039</td>\n",
|
||
" <td>0.1474</td>\n",
|
||
" <td>10.0</td>\n",
|
||
" <td>10.00</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>Tòa 5 tầng</td>\n",
|
||
" <td>Rất cao (Trên 40 người)</td>\n",
|
||
" <td>10</td>\n",
|
||
" <td>18</td>\n",
|
||
" <td>1700.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>2600.0</td>\n",
|
||
" <td>348.0</td>\n",
|
||
" <td>6825.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>70</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>9176.0</td>\n",
|
||
" <td>89.0</td>\n",
|
||
" <td>25</td>\n",
|
||
" <td>894.0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>3</th>\n",
|
||
" <td>101-1</td>\n",
|
||
" <td>Kiểm tra nhân sự các vị trí Thay rác, đẩy khô,...</td>\n",
|
||
" <td>Lau bảng biển, bình cứu hỏa , cây nước hành la...</td>\n",
|
||
" <td>Ca sáng</td>\n",
|
||
" <td>06:00:00</td>\n",
|
||
" <td>14:00:00</td>\n",
|
||
" <td>8.0</td>\n",
|
||
" <td>6</td>\n",
|
||
" <td>3</td>\n",
|
||
" <td>441.0</td>\n",
|
||
" <td>258.0</td>\n",
|
||
" <td>145.0</td>\n",
|
||
" <td>134.0</td>\n",
|
||
" <td>65.0</td>\n",
|
||
" <td>75.0</td>\n",
|
||
" <td>62.0</td>\n",
|
||
" <td>57.0</td>\n",
|
||
" <td>45.0</td>\n",
|
||
" <td>89.0</td>\n",
|
||
" <td>90.0</td>\n",
|
||
" <td>5.0</td>\n",
|
||
" <td>41.0</td>\n",
|
||
" <td>25.0</td>\n",
|
||
" <td>30.0</td>\n",
|
||
" <td>4.0</td>\n",
|
||
" <td>12.0</td>\n",
|
||
" <td>39.0</td>\n",
|
||
" <td>16.0</td>\n",
|
||
" <td>0.585</td>\n",
|
||
" <td>0.3288</td>\n",
|
||
" <td>0.3039</td>\n",
|
||
" <td>0.1474</td>\n",
|
||
" <td>10.0</td>\n",
|
||
" <td>10.00</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>Tòa 5 tầng</td>\n",
|
||
" <td>Rất cao (Trên 40 người)</td>\n",
|
||
" <td>10</td>\n",
|
||
" <td>18</td>\n",
|
||
" <td>1700.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>2600.0</td>\n",
|
||
" <td>348.0</td>\n",
|
||
" <td>6825.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>70</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>9176.0</td>\n",
|
||
" <td>89.0</td>\n",
|
||
" <td>25</td>\n",
|
||
" <td>894.0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>4</th>\n",
|
||
" <td>101-1</td>\n",
|
||
" <td>Kiểm tra nhân sự các vị trí Thay rác, đẩy khô,...</td>\n",
|
||
" <td>Lau bảng biển, bình cứu hỏa , cây nước hành la...</td>\n",
|
||
" <td>Ca chiều</td>\n",
|
||
" <td>14:00:00</td>\n",
|
||
" <td>22:00:00</td>\n",
|
||
" <td>8.0</td>\n",
|
||
" <td>6</td>\n",
|
||
" <td>5</td>\n",
|
||
" <td>441.0</td>\n",
|
||
" <td>258.0</td>\n",
|
||
" <td>145.0</td>\n",
|
||
" <td>134.0</td>\n",
|
||
" <td>65.0</td>\n",
|
||
" <td>75.0</td>\n",
|
||
" <td>62.0</td>\n",
|
||
" <td>57.0</td>\n",
|
||
" <td>45.0</td>\n",
|
||
" <td>89.0</td>\n",
|
||
" <td>90.0</td>\n",
|
||
" <td>5.0</td>\n",
|
||
" <td>41.0</td>\n",
|
||
" <td>25.0</td>\n",
|
||
" <td>30.0</td>\n",
|
||
" <td>4.0</td>\n",
|
||
" <td>12.0</td>\n",
|
||
" <td>39.0</td>\n",
|
||
" <td>16.0</td>\n",
|
||
" <td>0.585</td>\n",
|
||
" <td>0.3288</td>\n",
|
||
" <td>0.3039</td>\n",
|
||
" <td>0.1474</td>\n",
|
||
" <td>10.0</td>\n",
|
||
" <td>10.00</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>Tòa 5 tầng</td>\n",
|
||
" <td>Rất cao (Trên 40 người)</td>\n",
|
||
" <td>10</td>\n",
|
||
" <td>18</td>\n",
|
||
" <td>1700.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>2600.0</td>\n",
|
||
" <td>348.0</td>\n",
|
||
" <td>6825.0</td>\n",
|
||
" <td>0.0</td>\n",
|
||
" <td>70</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>9176.0</td>\n",
|
||
" <td>89.0</td>\n",
|
||
" <td>25</td>\n",
|
||
" <td>894.0</td>\n",
|
||
" </tr>\n",
|
||
" </tbody>\n",
|
||
"</table>\n",
|
||
"</div>"
|
||
],
|
||
"text/plain": [
|
||
" ma_dia_diem all_task_normal \\\n",
|
||
"0 115-2 Làm sạch toàn bộ phòng giao dịch tầng 1 (kể cả... \n",
|
||
"1 115-5 NaN \n",
|
||
"2 101-1 Kiểm tra nhân sự các vị trí Thay rác, đẩy khô,... \n",
|
||
"3 101-1 Kiểm tra nhân sự các vị trí Thay rác, đẩy khô,... \n",
|
||
"4 101-1 Kiểm tra nhân sự các vị trí Thay rác, đẩy khô,... \n",
|
||
"\n",
|
||
" all_task_dinhky loai_ca bat_dau \\\n",
|
||
"0 NaN Part time 06:30:00 \n",
|
||
"1 NaN Hành chính 06:30:00 \n",
|
||
"2 Lau bảng biển, bình cứu hỏa , cây nước hành la... Hành chính 06:30:00 \n",
|
||
"3 Lau bảng biển, bình cứu hỏa , cây nước hành la... Ca sáng 06:00:00 \n",
|
||
"4 Lau bảng biển, bình cứu hỏa , cây nước hành la... Ca chiều 14:00:00 \n",
|
||
"\n",
|
||
" ket_thuc tong_gio_lam so_ca_cua_toa so_luong num_tasks \\\n",
|
||
"0 10:30:00 4.0 1 1 7.0 \n",
|
||
"1 16:00:00 8.0 1 1 NaN \n",
|
||
"2 16:00:00 7.5 6 24 441.0 \n",
|
||
"3 14:00:00 8.0 6 3 441.0 \n",
|
||
"4 22:00:00 8.0 6 5 441.0 \n",
|
||
"\n",
|
||
" num_cleaning_tasks num_trash_collection_tasks num_monitoring_tasks \\\n",
|
||
"0 7.0 1.0 2.0 \n",
|
||
"1 NaN NaN NaN \n",
|
||
"2 258.0 145.0 134.0 \n",
|
||
"3 258.0 145.0 134.0 \n",
|
||
"4 258.0 145.0 134.0 \n",
|
||
"\n",
|
||
" num_room_cleaning_tasks num_deep_cleaning_tasks num_maintenance_tasks \\\n",
|
||
"0 0.0 1.0 1.0 \n",
|
||
"1 NaN NaN NaN \n",
|
||
"2 65.0 75.0 62.0 \n",
|
||
"3 65.0 75.0 62.0 \n",
|
||
"4 65.0 75.0 62.0 \n",
|
||
"\n",
|
||
" num_support_tasks num_other_tasks num_wc_tasks num_hallway_tasks \\\n",
|
||
"0 0.0 0.0 4.0 0.0 \n",
|
||
"1 NaN NaN NaN NaN \n",
|
||
"2 57.0 45.0 89.0 90.0 \n",
|
||
"3 57.0 45.0 89.0 90.0 \n",
|
||
"4 57.0 45.0 89.0 90.0 \n",
|
||
"\n",
|
||
" num_lobby_tasks num_patient_room_tasks num_clinic_room_tasks \\\n",
|
||
"0 0.0 0.0 0.0 \n",
|
||
"1 NaN NaN NaN \n",
|
||
"2 5.0 41.0 25.0 \n",
|
||
"3 5.0 41.0 25.0 \n",
|
||
"4 5.0 41.0 25.0 \n",
|
||
"\n",
|
||
" num_surgery_room_tasks num_outdoor_tasks num_elevator_tasks \\\n",
|
||
"0 0.0 2.0 3.0 \n",
|
||
"1 NaN NaN NaN \n",
|
||
"2 30.0 4.0 12.0 \n",
|
||
"3 30.0 4.0 12.0 \n",
|
||
"4 30.0 4.0 12.0 \n",
|
||
"\n",
|
||
" num_office_tasks num_technical_room_tasks cleaning_ratio \\\n",
|
||
"0 1.0 0.0 1.000 \n",
|
||
"1 NaN NaN NaN \n",
|
||
"2 39.0 16.0 0.585 \n",
|
||
"3 39.0 16.0 0.585 \n",
|
||
"4 39.0 16.0 0.585 \n",
|
||
"\n",
|
||
" trash_collection_ratio monitoring_ratio room_cleaning_ratio \\\n",
|
||
"0 0.1429 0.2857 0.0000 \n",
|
||
"1 NaN NaN NaN \n",
|
||
"2 0.3288 0.3039 0.1474 \n",
|
||
"3 0.3288 0.3039 0.1474 \n",
|
||
"4 0.3288 0.3039 0.1474 \n",
|
||
"\n",
|
||
" area_diversity task_complexity_score loai_hinh \\\n",
|
||
"0 4.0 1.38 0 \n",
|
||
"1 NaN NaN 0 \n",
|
||
"2 10.0 10.00 0 \n",
|
||
"3 10.0 10.00 0 \n",
|
||
"4 10.0 10.00 0 \n",
|
||
"\n",
|
||
" ten_toa_thap muc_do_luu_luong so_tang \\\n",
|
||
"0 AGRIBANK CHI NHÁNH MỸ ĐÌNH Trung bình (11–20 người) 4 \n",
|
||
"1 AGRIBANK PGD SỐ 05 Trung bình (11–20 người) 1 \n",
|
||
"2 Tòa 5 tầng Rất cao (Trên 40 người) 10 \n",
|
||
"3 Tòa 5 tầng Rất cao (Trên 40 người) 10 \n",
|
||
"4 Tòa 5 tầng Rất cao (Trên 40 người) 10 \n",
|
||
"\n",
|
||
" so_cua_thang_may dien_tich_ngoai_canh dien_tich_sanh \\\n",
|
||
"0 0 0.0 0.0 \n",
|
||
"1 0 0.0 0.0 \n",
|
||
"2 18 1700.0 0.0 \n",
|
||
"3 18 1700.0 0.0 \n",
|
||
"4 18 1700.0 0.0 \n",
|
||
"\n",
|
||
" dien_tich_hanh_lang dien_tich_wc dien_tich_phong dien_tich_tham \\\n",
|
||
"0 0.0 15.0 290.0 0.0 \n",
|
||
"1 0.0 15.0 300.0 0.0 \n",
|
||
"2 2600.0 348.0 6825.0 0.0 \n",
|
||
"3 2600.0 348.0 6825.0 0.0 \n",
|
||
"4 2600.0 348.0 6825.0 0.0 \n",
|
||
"\n",
|
||
" doc_ham vien_phan_quang op_tuong op_chan_tuong ranh_thoat_nuoc \\\n",
|
||
"0 0 0 0.0 0.0 0 \n",
|
||
"1 0 0 0.0 0.0 0 \n",
|
||
"2 70 0 9176.0 89.0 25 \n",
|
||
"3 70 0 9176.0 89.0 25 \n",
|
||
"4 70 0 9176.0 89.0 25 \n",
|
||
"\n",
|
||
" dien_tich_kinh \n",
|
||
"0 20.0 \n",
|
||
"1 30.0 \n",
|
||
"2 894.0 \n",
|
||
"3 894.0 \n",
|
||
"4 894.0 "
|
||
]
|
||
},
|
||
"execution_count": 17,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"print(\"📂 Loading dataset...\")\n",
|
||
"# ⚠️ IMPORTANT: Using the SAME dataset as Phase 1 for fair comparison!\n",
|
||
"df = pd.read_excel('FINAL_DATASET_WITH_TEXT_BACKUP_20260105_213507.xlsx')\n",
|
||
"\n",
|
||
"print(f\"✅ Dataset loaded!\")\n",
|
||
"print(f\"📊 Shape: {df.shape}\")\n",
|
||
"print(f\"\\n📋 Columns ({len(df.columns)}):{df.columns.tolist()[:10]}... (showing first 10)\")\n",
|
||
"\n",
|
||
"# Verify text columns\n",
|
||
"print(\"\\n🔍 Checking text columns:\")\n",
|
||
"print(f\" all_task_normal: {'✅ Found' if 'all_task_normal' in df.columns else '❌ Missing'}\")\n",
|
||
"print(f\" all_task_dinhky: {'✅ Found' if 'all_task_dinhky' in df.columns else '❌ Missing'}\")\n",
|
||
"\n",
|
||
"# Check missing values in text columns\n",
|
||
"if 'all_task_normal' in df.columns:\n",
|
||
" print(f\"\\n all_task_normal - Missing: {df['all_task_normal'].isnull().sum()} ({df['all_task_normal'].isnull().sum()/len(df)*100:.1f}%)\")\n",
|
||
"if 'all_task_dinhky' in df.columns:\n",
|
||
" print(f\" all_task_dinhky - Missing: {df['all_task_dinhky'].isnull().sum()} ({df['all_task_dinhky'].isnull().sum()/len(df)*100:.1f}%)\")\n",
|
||
"\n",
|
||
"# Sample text data\n",
|
||
"print(\"\\n📄 Sample text data:\")\n",
|
||
"print(\"\\nall_task_normal (first 200 chars):\")\n",
|
||
"print(str(df['all_task_normal'].iloc[0])[:200] + \"...\")\n",
|
||
"print(\"\\nall_task_dinhky (first 200 chars):\")\n",
|
||
"print(str(df['all_task_dinhky'].iloc[0])[:200] + \"...\")\n",
|
||
"\n",
|
||
"df.head()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "e54a1e43",
|
||
"metadata": {},
|
||
"source": [
|
||
"# 3️⃣ Text Preprocessing\n",
|
||
"\n",
|
||
"Chuẩn bị text data cho TF-IDF:\n",
|
||
"- Lowercase\n",
|
||
"- Remove special characters\n",
|
||
"- Handle missing values\n",
|
||
"- Combine both task columns"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 18,
|
||
"id": "8cab7fd2",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"================================================================================\n",
|
||
"🧹 TEXT PREPROCESSING\n",
|
||
"================================================================================\n",
|
||
"\n",
|
||
"1. Preprocessing all_task_normal...\n",
|
||
" ✓ Done. Non-empty: 427 / 454\n",
|
||
"\n",
|
||
"2. Preprocessing all_task_dinhky...\n",
|
||
" ✓ Done. Non-empty: 286 / 454\n",
|
||
"\n",
|
||
"3. Combining both task columns...\n",
|
||
"\n",
|
||
"4. Text statistics:\n",
|
||
" Average length: 9394 characters\n",
|
||
" Average words: 2243 words\n",
|
||
" Min words: 0\n",
|
||
" Max words: 10342\n",
|
||
" Empty texts: 27\n",
|
||
"\n",
|
||
"5. Sample cleaned text:\n",
|
||
"làm sạch toàn bộ phòng giao dịch tầng 1 kể cả wc làm sạch bậc tam cấp ngoại cảnh tầng hầm làm sạch toàn bộ văn phòng tầng 2 phòng lãnh đạo kể cả wc làm sạch toàn bộ thang bộ từ tầng 1 lên tầng 4 trực lại toàn bộ wc từ tầng 1 tầng 2 lưu ý đặt giấy lau vết bẩn phát sinh gạt kính cửa ra vào thay rác tr...\n",
|
||
"\n",
|
||
"✅ Text preprocessing complete!\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"print(\"=\" * 80)\n",
|
||
"print(\"🧹 TEXT PREPROCESSING\")\n",
|
||
"print(\"=\" * 80)\n",
|
||
"\n",
|
||
"def preprocess_text(text):\n",
|
||
" \"\"\"\n",
|
||
" Preprocess Vietnamese text\n",
|
||
" \"\"\"\n",
|
||
" if pd.isna(text) or str(text).strip() == '':\n",
|
||
" return ''\n",
|
||
" \n",
|
||
" text = str(text).lower()\n",
|
||
" \n",
|
||
" # Keep Vietnamese characters, numbers, spaces\n",
|
||
" # Remove special chars except spaces\n",
|
||
" text = re.sub(r'[^a-zàáạảãâầấậẩẫăằắặẳẵèéẹẻẽêềếệểễìíịỉĩòóọỏõôồốộổỗơờớợởỡùúụủũưừứựửữỳýỵỷỹđ0-9\\s]', ' ', text)\n",
|
||
" \n",
|
||
" # Remove multiple spaces\n",
|
||
" text = re.sub(r'\\s+', ' ', text).strip()\n",
|
||
" \n",
|
||
" return text\n",
|
||
"\n",
|
||
"# Preprocess both columns\n",
|
||
"print(\"\\n1. Preprocessing all_task_normal...\")\n",
|
||
"df['task_normal_clean'] = df['all_task_normal'].apply(preprocess_text)\n",
|
||
"print(f\" ✓ Done. Non-empty: {(df['task_normal_clean'] != '').sum()} / {len(df)}\")\n",
|
||
"\n",
|
||
"print(\"\\n2. Preprocessing all_task_dinhky...\")\n",
|
||
"df['task_dinhky_clean'] = df['all_task_dinhky'].apply(preprocess_text)\n",
|
||
"print(f\" ✓ Done. Non-empty: {(df['task_dinhky_clean'] != '').sum()} / {len(df)}\")\n",
|
||
"\n",
|
||
"# Combine both task columns\n",
|
||
"print(\"\\n3. Combining both task columns...\")\n",
|
||
"df['all_tasks_combined'] = df['task_normal_clean'] + ' ' + df['task_dinhky_clean']\n",
|
||
"df['all_tasks_combined'] = df['all_tasks_combined'].str.strip()\n",
|
||
"\n",
|
||
"# Statistics\n",
|
||
"print(\"\\n4. Text statistics:\")\n",
|
||
"df['text_length'] = df['all_tasks_combined'].str.len()\n",
|
||
"df['text_word_count'] = df['all_tasks_combined'].str.split().str.len()\n",
|
||
"\n",
|
||
"print(f\" Average length: {df['text_length'].mean():.0f} characters\")\n",
|
||
"print(f\" Average words: {df['text_word_count'].mean():.0f} words\")\n",
|
||
"print(f\" Min words: {df['text_word_count'].min():.0f}\")\n",
|
||
"print(f\" Max words: {df['text_word_count'].max():.0f}\")\n",
|
||
"print(f\" Empty texts: {(df['all_tasks_combined'] == '').sum()}\")\n",
|
||
"\n",
|
||
"# Sample cleaned text\n",
|
||
"print(\"\\n5. Sample cleaned text:\")\n",
|
||
"print(df['all_tasks_combined'].iloc[0][:300] + \"...\")\n",
|
||
"\n",
|
||
"print(\"\\n✅ Text preprocessing complete!\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "11cb1a6d",
|
||
"metadata": {},
|
||
"source": [
|
||
"# 4️⃣ TF-IDF Vectorization\n",
|
||
"\n",
|
||
"Chuyển text thành numeric features với TF-IDF"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 19,
|
||
"id": "59aa536f",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"================================================================================\n",
|
||
"📊 TF-IDF VECTORIZATION\n",
|
||
"================================================================================\n",
|
||
"\n",
|
||
"TF-IDF Configuration:\n",
|
||
" max_features: 200\n",
|
||
" ngram_range: (1, 2)\n",
|
||
" min_df: 2\n",
|
||
" max_df: 0.95\n",
|
||
"\n",
|
||
"Creating TF-IDF vectorizer...\n",
|
||
"\n",
|
||
"Fitting TF-IDF on all documents...\n",
|
||
"\n",
|
||
"✅ TF-IDF complete!\n",
|
||
" Output shape: (454, 200)\n",
|
||
" Actual features extracted: 200\n",
|
||
" Sparsity: 47.20%\n",
|
||
"\n",
|
||
"📋 Top 20 TF-IDF features (by document frequency):\n",
|
||
" feature doc_frequency\n",
|
||
" rác 412\n",
|
||
" lau 408\n",
|
||
" sinh 408\n",
|
||
" kính 404\n",
|
||
" làm 404\n",
|
||
" vệ 399\n",
|
||
" định 397\n",
|
||
" phòng 395\n",
|
||
" thu 393\n",
|
||
" vệ sinh 392\n",
|
||
" sạch 386\n",
|
||
" quét 385\n",
|
||
" nghỉ 381\n",
|
||
" trực 380\n",
|
||
" cửa 380\n",
|
||
"làm sạch 378\n",
|
||
" định kỳ 370\n",
|
||
" kỳ 370\n",
|
||
" cụ 368\n",
|
||
" dụng 368\n",
|
||
"\n",
|
||
"✅ TF-IDF DataFrame created: (454, 200)\n",
|
||
"\n",
|
||
"✅ TF-IDF complete!\n",
|
||
" Output shape: (454, 200)\n",
|
||
" Actual features extracted: 200\n",
|
||
" Sparsity: 47.20%\n",
|
||
"\n",
|
||
"📋 Top 20 TF-IDF features (by document frequency):\n",
|
||
" feature doc_frequency\n",
|
||
" rác 412\n",
|
||
" lau 408\n",
|
||
" sinh 408\n",
|
||
" kính 404\n",
|
||
" làm 404\n",
|
||
" vệ 399\n",
|
||
" định 397\n",
|
||
" phòng 395\n",
|
||
" thu 393\n",
|
||
" vệ sinh 392\n",
|
||
" sạch 386\n",
|
||
" quét 385\n",
|
||
" nghỉ 381\n",
|
||
" trực 380\n",
|
||
" cửa 380\n",
|
||
"làm sạch 378\n",
|
||
" định kỳ 370\n",
|
||
" kỳ 370\n",
|
||
" cụ 368\n",
|
||
" dụng 368\n",
|
||
"\n",
|
||
"✅ TF-IDF DataFrame created: (454, 200)\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"print(\"=\" * 80)\n",
|
||
"print(\"📊 TF-IDF VECTORIZATION\")\n",
|
||
"print(\"=\" * 80)\n",
|
||
"\n",
|
||
"# TF-IDF parameters\n",
|
||
"max_features = 200 # Giới hạn số features từ TF-IDF\n",
|
||
"ngram_range = (1, 2) # Unigrams và bigrams\n",
|
||
"min_df = 2 # Xuất hiện ít nhất 2 documents\n",
|
||
"max_df = 0.95 # Xuất hiện tối đa 95% documents\n",
|
||
"\n",
|
||
"print(f\"\\nTF-IDF Configuration:\")\n",
|
||
"print(f\" max_features: {max_features}\")\n",
|
||
"print(f\" ngram_range: {ngram_range}\")\n",
|
||
"print(f\" min_df: {min_df}\")\n",
|
||
"print(f\" max_df: {max_df}\")\n",
|
||
"\n",
|
||
"# Create TF-IDF vectorizer\n",
|
||
"print(\"\\nCreating TF-IDF vectorizer...\")\n",
|
||
"tfidf = TfidfVectorizer(\n",
|
||
" max_features=max_features,\n",
|
||
" ngram_range=ngram_range,\n",
|
||
" min_df=min_df,\n",
|
||
" max_df=max_df,\n",
|
||
" sublinear_tf=True, # Use log scaling\n",
|
||
" strip_accents=None # Keep Vietnamese accents\n",
|
||
")\n",
|
||
"\n",
|
||
"# Fit and transform\n",
|
||
"print(\"\\nFitting TF-IDF on all documents...\")\n",
|
||
"tfidf_matrix = tfidf.fit_transform(df['all_tasks_combined'])\n",
|
||
"\n",
|
||
"print(f\"\\n✅ TF-IDF complete!\")\n",
|
||
"print(f\" Output shape: {tfidf_matrix.shape}\")\n",
|
||
"print(f\" Actual features extracted: {len(tfidf.get_feature_names_out())}\")\n",
|
||
"print(f\" Sparsity: {(1.0 - tfidf_matrix.nnz / (tfidf_matrix.shape[0] * tfidf_matrix.shape[1])) * 100:.2f}%\")\n",
|
||
"\n",
|
||
"# Top features by importance\n",
|
||
"print(\"\\n📋 Top 20 TF-IDF features (by document frequency):\")\n",
|
||
"feature_names = tfidf.get_feature_names_out()\n",
|
||
"doc_freq = (tfidf_matrix > 0).sum(axis=0).A1\n",
|
||
"top_features = pd.DataFrame({\n",
|
||
" 'feature': feature_names,\n",
|
||
" 'doc_frequency': doc_freq\n",
|
||
"}).sort_values('doc_frequency', ascending=False)\n",
|
||
"\n",
|
||
"print(top_features.head(20).to_string(index=False))\n",
|
||
"\n",
|
||
"# Save TF-IDF matrix as DataFrame (optional, for inspection)\n",
|
||
"tfidf_df = pd.DataFrame(\n",
|
||
" tfidf_matrix.toarray(),\n",
|
||
" columns=[f'tfidf_{i}' for i in range(tfidf_matrix.shape[1])]\n",
|
||
")\n",
|
||
"\n",
|
||
"print(f\"\\n✅ TF-IDF DataFrame created: {tfidf_df.shape}\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "aa687eb6",
|
||
"metadata": {},
|
||
"source": [
|
||
"# 5️⃣ SVD Dimensionality Reduction\n",
|
||
"\n",
|
||
"Giảm chiều TF-IDF features với SVD để tránh overfitting"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 20,
|
||
"id": "77ae5dbb",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"================================================================================\n",
|
||
"🔬 SVD DIMENSIONALITY REDUCTION\n",
|
||
"================================================================================\n",
|
||
"\n",
|
||
"SVD Configuration:\n",
|
||
" Input dimensions: 200\n",
|
||
" Output dimensions: 50\n",
|
||
" Reduction: 75.0%\n",
|
||
"\n",
|
||
"Applying SVD...\n",
|
||
"\n",
|
||
"✅ SVD complete!\n",
|
||
" Output shape: (454, 50)\n",
|
||
" Explained variance: 89.66%\n",
|
||
"\n",
|
||
"📊 Variance explained by top 10 components:\n",
|
||
" Component 1: 11.08%\n",
|
||
" Component 2: 8.74%\n",
|
||
" Component 3: 7.92%\n",
|
||
" Component 4: 5.04%\n",
|
||
" Component 5: 3.60%\n",
|
||
" Component 6: 3.36%\n",
|
||
" Component 7: 3.29%\n",
|
||
" Component 8: 3.13%\n",
|
||
" Component 9: 3.06%\n",
|
||
" Component 10: 2.75%\n",
|
||
"\n",
|
||
" Cumulative variance (first 10): 51.95%\n",
|
||
" Cumulative variance (all 50): 89.66%\n",
|
||
"\n",
|
||
"✅ SVD features DataFrame created: (454, 50)\n"
|
||
]
|
||
},
|
||
{
|
||
"data": {
|
||
"image/png": "",
|
||
"text/plain": [
|
||
"<Figure size 1200x500 with 2 Axes>"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
},
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"\n",
|
||
"✅ Saved: phase2_output/plots/01_svd_variance.png\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"print(\"=\" * 80)\n",
|
||
"print(\"🔬 SVD DIMENSIONALITY REDUCTION\")\n",
|
||
"print(\"=\" * 80)\n",
|
||
"\n",
|
||
"# SVD parameters\n",
|
||
"n_components = 50 # Giảm xuống 50 components\n",
|
||
"\n",
|
||
"print(f\"\\nSVD Configuration:\")\n",
|
||
"print(f\" Input dimensions: {tfidf_matrix.shape[1]}\")\n",
|
||
"print(f\" Output dimensions: {n_components}\")\n",
|
||
"print(f\" Reduction: {(1 - n_components/tfidf_matrix.shape[1]) * 100:.1f}%\")\n",
|
||
"\n",
|
||
"# Apply SVD\n",
|
||
"print(\"\\nApplying SVD...\")\n",
|
||
"svd = TruncatedSVD(n_components=n_components, random_state=RANDOM_STATE)\n",
|
||
"text_features_svd = svd.fit_transform(tfidf_matrix)\n",
|
||
"\n",
|
||
"print(f\"\\n✅ SVD complete!\")\n",
|
||
"print(f\" Output shape: {text_features_svd.shape}\")\n",
|
||
"print(f\" Explained variance: {svd.explained_variance_ratio_.sum()*100:.2f}%\")\n",
|
||
"\n",
|
||
"# Variance explained by each component\n",
|
||
"print(\"\\n📊 Variance explained by top 10 components:\")\n",
|
||
"for i in range(min(10, n_components)):\n",
|
||
" print(f\" Component {i+1}: {svd.explained_variance_ratio_[i]*100:.2f}%\")\n",
|
||
"\n",
|
||
"# Cumulative variance\n",
|
||
"cumsum_var = np.cumsum(svd.explained_variance_ratio_)\n",
|
||
"print(f\"\\n Cumulative variance (first 10): {cumsum_var[9]*100:.2f}%\")\n",
|
||
"print(f\" Cumulative variance (all {n_components}): {cumsum_var[-1]*100:.2f}%\")\n",
|
||
"\n",
|
||
"# Create DataFrame with SVD features\n",
|
||
"text_features_df = pd.DataFrame(\n",
|
||
" text_features_svd,\n",
|
||
" columns=[f'text_svd_{i+1}' for i in range(n_components)]\n",
|
||
")\n",
|
||
"\n",
|
||
"print(f\"\\n✅ SVD features DataFrame created: {text_features_df.shape}\")\n",
|
||
"\n",
|
||
"# Visualize variance explained\n",
|
||
"plt.figure(figsize=(12, 5))\n",
|
||
"\n",
|
||
"plt.subplot(1, 2, 1)\n",
|
||
"plt.plot(range(1, n_components+1), svd.explained_variance_ratio_, 'b-', linewidth=2)\n",
|
||
"plt.xlabel('Component Number', fontsize=12)\n",
|
||
"plt.ylabel('Explained Variance Ratio', fontsize=12)\n",
|
||
"plt.title('Variance Explained by Each SVD Component', fontsize=14, fontweight='bold')\n",
|
||
"plt.grid(alpha=0.3)\n",
|
||
"\n",
|
||
"plt.subplot(1, 2, 2)\n",
|
||
"plt.plot(range(1, n_components+1), cumsum_var, 'r-', linewidth=2)\n",
|
||
"plt.axhline(y=0.9, color='green', linestyle='--', label='90% variance')\n",
|
||
"plt.xlabel('Number of Components', fontsize=12)\n",
|
||
"plt.ylabel('Cumulative Variance Explained', fontsize=12)\n",
|
||
"plt.title('Cumulative Variance Explained', fontsize=14, fontweight='bold')\n",
|
||
"plt.legend()\n",
|
||
"plt.grid(alpha=0.3)\n",
|
||
"\n",
|
||
"plt.tight_layout()\n",
|
||
"plt.savefig('phase2_output/plots/01_svd_variance.png', dpi=300, bbox_inches='tight')\n",
|
||
"plt.show()\n",
|
||
"\n",
|
||
"print(\"\\n✅ Saved: phase2_output/plots/01_svd_variance.png\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "c0c967c7",
|
||
"metadata": {},
|
||
"source": [
|
||
"# 6️⃣ Combine All Features\n",
|
||
"\n",
|
||
"Kết hợp 47 numeric features + 50 text SVD features = 97 features"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 22,
|
||
"id": "dc702312",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"================================================================================\n",
|
||
"🔗 COMBINING ALL FEATURES\n",
|
||
"================================================================================\n",
|
||
"\n",
|
||
"1. Numeric features from original dataset:\n",
|
||
" Total columns in df: 56\n",
|
||
" Columns to drop: 10\n",
|
||
" Numeric features: 46\n",
|
||
"\n",
|
||
"2. Feature dimensions:\n",
|
||
" Numeric features: (454, 46)\n",
|
||
" Text SVD features: (454, 50)\n",
|
||
"\n",
|
||
"3. Combining features...\n",
|
||
"\n",
|
||
"✅ Features combined!\n",
|
||
" Final feature matrix: (454, 96)\n",
|
||
" Target vector: (454,)\n",
|
||
" Feature breakdown:\n",
|
||
" - Numeric features: 46\n",
|
||
" - Text SVD features: 50\n",
|
||
" - Total: 96\n",
|
||
"\n",
|
||
" Missing values: 550\n",
|
||
"\n",
|
||
"⚠️ Handling missing values...\n",
|
||
" ✓ Filled with 0\n",
|
||
"\n",
|
||
"✅ Saved: phase2_output/data/combined_features_with_text.csv\n",
|
||
"\n",
|
||
"📊 Sample combined features (first 5 rows, first 10 cols):\n",
|
||
" loai_ca bat_dau ket_thuc tong_gio_lam so_ca_cua_toa num_tasks \\\n",
|
||
"0 Part time 06:30:00 10:30:00 4.0 1 7.0 \n",
|
||
"1 Hành chính 06:30:00 16:00:00 8.0 1 0.0 \n",
|
||
"2 Hành chính 06:30:00 16:00:00 7.5 6 441.0 \n",
|
||
"3 Ca sáng 06:00:00 14:00:00 8.0 6 441.0 \n",
|
||
"4 Ca chiều 14:00:00 22:00:00 8.0 6 441.0 \n",
|
||
"\n",
|
||
" num_cleaning_tasks num_trash_collection_tasks num_monitoring_tasks \\\n",
|
||
"0 7.0 1.0 2.0 \n",
|
||
"1 0.0 0.0 0.0 \n",
|
||
"2 258.0 145.0 134.0 \n",
|
||
"3 258.0 145.0 134.0 \n",
|
||
"4 258.0 145.0 134.0 \n",
|
||
"\n",
|
||
" num_room_cleaning_tasks \n",
|
||
"0 0.0 \n",
|
||
"1 0.0 \n",
|
||
"2 65.0 \n",
|
||
"3 65.0 \n",
|
||
"4 65.0 \n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"print(\"=\" * 80)\n",
|
||
"print(\"🔗 COMBINING ALL FEATURES\")\n",
|
||
"print(\"=\" * 80)\n",
|
||
"\n",
|
||
"# Columns to drop (non-feature columns)\n",
|
||
"cols_to_drop = [\n",
|
||
" 'ma_dia_diem',\n",
|
||
" 'so_luong', # Target\n",
|
||
" 'all_task_normal', # Original text\n",
|
||
" 'all_task_dinhky', # Original text\n",
|
||
" 'task_normal_clean', # Cleaned text\n",
|
||
" 'task_dinhky_clean', # Cleaned text\n",
|
||
" 'all_tasks_combined', # Combined text\n",
|
||
" 'text_length', # Temporary column\n",
|
||
" 'text_word_count', # Temporary column\n",
|
||
" 'ten_toa_thap' # Non-predictive\n",
|
||
"]\n",
|
||
"\n",
|
||
"# Get numeric features\n",
|
||
"numeric_cols = [col for col in df.columns if col not in cols_to_drop]\n",
|
||
"\n",
|
||
"print(f\"\\n1. Numeric features from original dataset:\")\n",
|
||
"print(f\" Total columns in df: {len(df.columns)}\")\n",
|
||
"print(f\" Columns to drop: {len(cols_to_drop)}\")\n",
|
||
"print(f\" Numeric features: {len(numeric_cols)}\")\n",
|
||
"\n",
|
||
"# Create feature matrix\n",
|
||
"X_numeric = df[numeric_cols].copy()\n",
|
||
"\n",
|
||
"print(f\"\\n2. Feature dimensions:\")\n",
|
||
"print(f\" Numeric features: {X_numeric.shape}\")\n",
|
||
"print(f\" Text SVD features: {text_features_df.shape}\")\n",
|
||
"\n",
|
||
"# Combine\n",
|
||
"print(\"\\n3. Combining features...\")\n",
|
||
"X_combined = pd.concat([X_numeric.reset_index(drop=True), text_features_df.reset_index(drop=True)], axis=1)\n",
|
||
"y = df['so_luong'].copy()\n",
|
||
"\n",
|
||
"print(f\"\\n✅ Features combined!\")\n",
|
||
"print(f\" Final feature matrix: {X_combined.shape}\")\n",
|
||
"print(f\" Target vector: {y.shape}\")\n",
|
||
"print(f\" Feature breakdown:\")\n",
|
||
"print(f\" - Numeric features: {len(numeric_cols)}\")\n",
|
||
"print(f\" - Text SVD features: {n_components}\")\n",
|
||
"print(f\" - Total: {X_combined.shape[1]}\")\n",
|
||
"\n",
|
||
"# Check for missing values\n",
|
||
"missing_count = X_combined.isnull().sum().sum()\n",
|
||
"print(f\"\\n Missing values: {missing_count}\")\n",
|
||
"\n",
|
||
"if missing_count > 0:\n",
|
||
" print(\"\\n⚠️ Handling missing values...\")\n",
|
||
" X_combined = X_combined.fillna(0)\n",
|
||
" print(\" ✓ Filled with 0\")\n",
|
||
"\n",
|
||
"# Save combined features\n",
|
||
"combined_with_target = pd.concat([X_combined, y.rename('so_luong')], axis=1)\n",
|
||
"combined_with_target.to_csv('phase2_output/data/combined_features_with_text.csv', index=False)\n",
|
||
"print(\"\\n✅ Saved: phase2_output/data/combined_features_with_text.csv\")\n",
|
||
"\n",
|
||
"# Sample\n",
|
||
"print(\"\\n📊 Sample combined features (first 5 rows, first 10 cols):\")\n",
|
||
"print(X_combined.iloc[:5, :10])"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 23,
|
||
"id": "57f4e5d0",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"================================================================================\n",
|
||
"🔍 DEBUGGING - CHECKING DATA TYPES\n",
|
||
"================================================================================\n",
|
||
"\n",
|
||
"📊 X_combined shape: (454, 96)\n",
|
||
"\n",
|
||
"🔢 Data types in X_combined:\n",
|
||
"float64 85\n",
|
||
"int64 7\n",
|
||
"object 4\n",
|
||
"Name: count, dtype: int64\n",
|
||
"\n",
|
||
"⚠️ Found 4 non-numeric columns:\n",
|
||
" - loai_ca: object (sample values: ['Part time' 'Hành chính' 'Ca sáng' 'Ca chiều' 'Ca đêm'])\n",
|
||
" - bat_dau: object (sample values: ['06:30:00' '06:00:00' '14:00:00' '2025-01-01 22:00:00' '12:00:00'])\n",
|
||
" - ket_thuc: object (sample values: ['10:30:00' '16:00:00' '14:00:00' '22:00:00' '2025-01-02 06:00:00'])\n",
|
||
" - muc_do_luu_luong: object (sample values: ['Trung bình (11–20 người)' 'Rất cao (Trên 40 người)' 'Thấp (6–10 người)'\n",
|
||
" 'Cao (21–40 người)' 0])\n",
|
||
"\n",
|
||
"🔧 These columns need to be removed or encoded before scaling!\n",
|
||
"\n",
|
||
"📋 First 20 columns in X_combined:\n",
|
||
"['loai_ca', 'bat_dau', 'ket_thuc', 'tong_gio_lam', 'so_ca_cua_toa', 'num_tasks', 'num_cleaning_tasks', 'num_trash_collection_tasks', 'num_monitoring_tasks', 'num_room_cleaning_tasks', 'num_deep_cleaning_tasks', 'num_maintenance_tasks', 'num_support_tasks', 'num_other_tasks', 'num_wc_tasks', 'num_hallway_tasks', 'num_lobby_tasks', 'num_patient_room_tasks', 'num_clinic_room_tasks', 'num_surgery_room_tasks']\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"# 🔍 DEBUG: Check for non-numeric columns\n",
|
||
"print(\"=\" * 80)\n",
|
||
"print(\"🔍 DEBUGGING - CHECKING DATA TYPES\")\n",
|
||
"print(\"=\" * 80)\n",
|
||
"\n",
|
||
"print(f\"\\n📊 X_combined shape: {X_combined.shape}\")\n",
|
||
"print(f\"\\n🔢 Data types in X_combined:\")\n",
|
||
"print(X_combined.dtypes.value_counts())\n",
|
||
"\n",
|
||
"# Find non-numeric columns\n",
|
||
"non_numeric_cols = X_combined.select_dtypes(include=['object', 'category']).columns.tolist()\n",
|
||
"\n",
|
||
"if len(non_numeric_cols) > 0:\n",
|
||
" print(f\"\\n⚠️ Found {len(non_numeric_cols)} non-numeric columns:\")\n",
|
||
" for col in non_numeric_cols:\n",
|
||
" unique_vals = X_combined[col].unique()[:5]\n",
|
||
" print(f\" - {col}: {X_combined[col].dtype} (sample values: {unique_vals})\")\n",
|
||
" \n",
|
||
" print(f\"\\n🔧 These columns need to be removed or encoded before scaling!\")\n",
|
||
"else:\n",
|
||
" print(f\"\\n✅ All columns are numeric!\")\n",
|
||
"\n",
|
||
"# Show first few column names\n",
|
||
"print(f\"\\n📋 First 20 columns in X_combined:\")\n",
|
||
"print(X_combined.columns[:20].tolist())\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 24,
|
||
"id": "e9ab7e56",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"================================================================================\n",
|
||
"🔧 REMOVING NON-NUMERIC COLUMNS\n",
|
||
"================================================================================\n",
|
||
"\n",
|
||
"⚠️ Removing 4 non-numeric columns:\n",
|
||
" - loai_ca\n",
|
||
" - bat_dau\n",
|
||
" - ket_thuc\n",
|
||
" - muc_do_luu_luong\n",
|
||
"\n",
|
||
"✅ Removed non-numeric columns\n",
|
||
"📊 New X_combined shape: (454, 92)\n",
|
||
"\n",
|
||
"✅ All columns are now numeric!\n",
|
||
"📊 Final X_combined shape: (454, 92)\n",
|
||
"🔢 Data types: {dtype('float64'): 85, dtype('int64'): 7}\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"# 🔧 FIX: Remove non-numeric columns\n",
|
||
"print(\"=\" * 80)\n",
|
||
"print(\"🔧 REMOVING NON-NUMERIC COLUMNS\")\n",
|
||
"print(\"=\" * 80)\n",
|
||
"\n",
|
||
"# Get non-numeric columns\n",
|
||
"non_numeric_cols = X_combined.select_dtypes(include=['object', 'category']).columns.tolist()\n",
|
||
"\n",
|
||
"if len(non_numeric_cols) > 0:\n",
|
||
" print(f\"\\n⚠️ Removing {len(non_numeric_cols)} non-numeric columns:\")\n",
|
||
" for col in non_numeric_cols:\n",
|
||
" print(f\" - {col}\")\n",
|
||
" \n",
|
||
" # Remove non-numeric columns\n",
|
||
" X_combined = X_combined.drop(columns=non_numeric_cols)\n",
|
||
" \n",
|
||
" print(f\"\\n✅ Removed non-numeric columns\")\n",
|
||
" print(f\"📊 New X_combined shape: {X_combined.shape}\")\n",
|
||
"else:\n",
|
||
" print(f\"\\n✅ No non-numeric columns to remove\")\n",
|
||
"\n",
|
||
"# Verify all columns are now numeric\n",
|
||
"remaining_non_numeric = X_combined.select_dtypes(include=['object', 'category']).columns.tolist()\n",
|
||
"if len(remaining_non_numeric) > 0:\n",
|
||
" print(f\"\\n❌ ERROR: Still have non-numeric columns: {remaining_non_numeric}\")\n",
|
||
"else:\n",
|
||
" print(f\"\\n✅ All columns are now numeric!\")\n",
|
||
" print(f\"📊 Final X_combined shape: {X_combined.shape}\")\n",
|
||
" print(f\"🔢 Data types: {X_combined.dtypes.value_counts().to_dict()}\")\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "96d75ad4",
|
||
"metadata": {},
|
||
"source": [
|
||
"# 7️⃣ Train/Validation/Test Split\n",
|
||
"\n",
|
||
"Chia dữ liệu với tỷ lệ 70/15/15, stratified sampling để tránh data leakage"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 25,
|
||
"id": "bcb7d3c5",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"================================================================================\n",
|
||
"✂️ TRAIN/VALIDATION/TEST SPLIT\n",
|
||
"================================================================================\n",
|
||
"\n",
|
||
"1. Checking duplicates...\n",
|
||
" Duplicates found: 120\n",
|
||
" ✓ Removed 120 duplicates\n",
|
||
" ✓ New dataset size: 413\n",
|
||
"\n",
|
||
"2. Creating stratified bins...\n",
|
||
" ✓ Created 4 bins\n",
|
||
" ✓ Samples per bin: {0: 171, 1: 95, 2: 80, 3: 67}\n",
|
||
"\n",
|
||
"3. Splitting data...\n",
|
||
"\n",
|
||
"✅ Split complete!\n",
|
||
" Train: 289 samples (70.0%)\n",
|
||
" Val: 62 samples (15.0%)\n",
|
||
" Test: 62 samples (15.0%)\n",
|
||
"\n",
|
||
"📊 Target distribution:\n",
|
||
" Train - Mean: 4.87, Std: 7.21, Var: 52.05\n",
|
||
" Val - Mean: 4.10, Std: 5.26, Var: 27.66\n",
|
||
" Test - Mean: 4.65, Std: 6.80, Var: 46.30\n",
|
||
"\n",
|
||
"4. Checking for data leakage...\n",
|
||
" Train & Test overlap: 23 rows ✓\n",
|
||
" ⚠️ WARNING: Data leakage detected!\n",
|
||
"\n",
|
||
"5. Scaling features...\n",
|
||
" ✓ Scaled all features using StandardScaler\n",
|
||
"\n",
|
||
"================================================================================\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"print(\"=\" * 80)\n",
|
||
"print(\"✂️ TRAIN/VALIDATION/TEST SPLIT\")\n",
|
||
"print(\"=\" * 80)\n",
|
||
"\n",
|
||
"# Remove duplicates first\n",
|
||
"print(\"\\n1. Checking duplicates...\")\n",
|
||
"dups_before = X_combined.duplicated().sum()\n",
|
||
"print(f\" Duplicates found: {dups_before}\")\n",
|
||
"\n",
|
||
"if dups_before > 0:\n",
|
||
" # Combine X and y before removing duplicates\n",
|
||
" combined_df = pd.concat([X_combined, y], axis=1)\n",
|
||
" combined_df = combined_df.drop_duplicates()\n",
|
||
" X_combined = combined_df.iloc[:, :-1]\n",
|
||
" y = combined_df.iloc[:, -1]\n",
|
||
" print(f\" ✓ Removed {dups_before} duplicates\")\n",
|
||
" print(f\" ✓ New dataset size: {len(X_combined)}\")\n",
|
||
"\n",
|
||
"# Stratified split using quantile bins\n",
|
||
"print(\"\\n2. Creating stratified bins...\")\n",
|
||
"n_bins = 5\n",
|
||
"try:\n",
|
||
" y_binned = pd.qcut(y, q=n_bins, labels=False, duplicates='drop')\n",
|
||
" bin_counts = y_binned.value_counts().sort_index()\n",
|
||
" print(f\" ✓ Created {len(y_binned.unique())} bins\")\n",
|
||
" print(f\" ✓ Samples per bin: {bin_counts.to_dict()}\")\n",
|
||
" use_stratify = bin_counts.min() >= 10\n",
|
||
"except:\n",
|
||
" print(\" ⚠️ Cannot create bins, using random split\")\n",
|
||
" use_stratify = False\n",
|
||
"\n",
|
||
"# Split: 70% train, 15% val, 15% test\n",
|
||
"print(\"\\n3. Splitting data...\")\n",
|
||
"\n",
|
||
"if use_stratify:\n",
|
||
" X_train, X_temp, y_train, y_temp = train_test_split(\n",
|
||
" X_combined, y, test_size=0.30, random_state=RANDOM_STATE, stratify=y_binned\n",
|
||
" )\n",
|
||
" \n",
|
||
" # Split temp into val and test\n",
|
||
" try:\n",
|
||
" y_temp_binned = pd.qcut(y_temp, q=3, labels=False, duplicates='drop')\n",
|
||
" X_val, X_test, y_val, y_test = train_test_split(\n",
|
||
" X_temp, y_temp, test_size=0.50, random_state=RANDOM_STATE, stratify=y_temp_binned\n",
|
||
" )\n",
|
||
" except:\n",
|
||
" X_val, X_test, y_val, y_test = train_test_split(\n",
|
||
" X_temp, y_temp, test_size=0.50, random_state=RANDOM_STATE\n",
|
||
" )\n",
|
||
"else:\n",
|
||
" X_train, X_temp, y_train, y_temp = train_test_split(\n",
|
||
" X_combined, y, test_size=0.30, random_state=RANDOM_STATE\n",
|
||
" )\n",
|
||
" X_val, X_test, y_val, y_test = train_test_split(\n",
|
||
" X_temp, y_temp, test_size=0.50, random_state=RANDOM_STATE\n",
|
||
" )\n",
|
||
"\n",
|
||
"print(f\"\\n✅ Split complete!\")\n",
|
||
"print(f\" Train: {len(X_train)} samples ({len(X_train)/len(X_combined)*100:.1f}%)\")\n",
|
||
"print(f\" Val: {len(X_val)} samples ({len(X_val)/len(X_combined)*100:.1f}%)\")\n",
|
||
"print(f\" Test: {len(X_test)} samples ({len(X_test)/len(X_combined)*100:.1f}%)\")\n",
|
||
"\n",
|
||
"print(f\"\\n📊 Target distribution:\")\n",
|
||
"print(f\" Train - Mean: {y_train.mean():.2f}, Std: {y_train.std():.2f}, Var: {y_train.var():.2f}\")\n",
|
||
"print(f\" Val - Mean: {y_val.mean():.2f}, Std: {y_val.std():.2f}, Var: {y_val.var():.2f}\")\n",
|
||
"print(f\" Test - Mean: {y_test.mean():.2f}, Std: {y_test.std():.2f}, Var: {y_test.var():.2f}\")\n",
|
||
"\n",
|
||
"# Check for data leakage\n",
|
||
"print(\"\\n4. Checking for data leakage...\")\n",
|
||
"train_test_overlap = pd.merge(X_train.reset_index(drop=True), X_test.reset_index(drop=True), how='inner')\n",
|
||
"print(f\" Train & Test overlap: {len(train_test_overlap)} rows ✓\")\n",
|
||
"\n",
|
||
"if len(train_test_overlap) > 0:\n",
|
||
" print(\" ⚠️ WARNING: Data leakage detected!\")\n",
|
||
"else:\n",
|
||
" print(\" ✅ No data leakage!\")\n",
|
||
"\n",
|
||
"# Feature Scaling\n",
|
||
"print(\"\\n5. Scaling features...\")\n",
|
||
"scaler = StandardScaler()\n",
|
||
"\n",
|
||
"X_train_scaled = X_train.copy()\n",
|
||
"X_val_scaled = X_val.copy()\n",
|
||
"X_test_scaled = X_test.copy()\n",
|
||
"\n",
|
||
"X_train_scaled[X_train.columns] = scaler.fit_transform(X_train)\n",
|
||
"X_val_scaled[X_val.columns] = scaler.transform(X_val)\n",
|
||
"X_test_scaled[X_test.columns] = scaler.transform(X_test)\n",
|
||
"\n",
|
||
"print(f\" ✓ Scaled all features using StandardScaler\")\n",
|
||
"\n",
|
||
"print(\"\\n\" + \"=\" * 80)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "e0084db5",
|
||
"metadata": {},
|
||
"source": [
|
||
"# 8️⃣ Model Training với Text Features\n",
|
||
"\n",
|
||
"Train tất cả models với 97 features (47 numeric + 50 text SVD)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 26,
|
||
"id": "2df5c047",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"================================================================================\n",
|
||
"🤖 MODEL TRAINING - PHASE 2 (WITH TEXT FEATURES)\n",
|
||
"================================================================================\n",
|
||
"\n",
|
||
"1️⃣ Baseline (Mean Prediction)...\n",
|
||
" Val MAE: 3.7132, R²: 0.0000\n",
|
||
"\n",
|
||
"2️⃣ Linear Regression...\n",
|
||
" Val MAE: 4.3488, R²: -0.4161\n",
|
||
"\n",
|
||
"3️⃣ Decision Tree...\n",
|
||
" Val MAE: 2.5525, R²: 0.3339\n",
|
||
"\n",
|
||
"4️⃣ Random Forest...\n",
|
||
" Val MAE: 2.8388, R²: 0.2521\n",
|
||
"\n",
|
||
"5️⃣ Gradient Boosting...\n",
|
||
" Val MAE: 2.8388, R²: 0.2521\n",
|
||
"\n",
|
||
"5️⃣ Gradient Boosting...\n",
|
||
" Val MAE: 2.7644, R²: 0.1917\n",
|
||
"\n",
|
||
"6️⃣ XGBoost...\n",
|
||
" Val MAE: 2.7488, R²: 0.1542\n",
|
||
"\n",
|
||
"7️⃣ LightGBM...\n",
|
||
" Val MAE: 3.2101, R²: 0.0009\n",
|
||
"\n",
|
||
"✅ All models trained!\n",
|
||
"\n",
|
||
"================================================================================\n",
|
||
"📊 PHASE 2 RESULTS (WITH TEXT FEATURES)\n",
|
||
"================================================================================\n",
|
||
" Model Train_R2 Train_MAE Train_RMSE Val_R2 Val_MAE Val_RMSE\n",
|
||
" Decision Tree 0.740968 2.012207 3.665671 0.333863 2.552472 4.257920\n",
|
||
" Random Forest 0.733891 2.083618 3.715410 0.252113 2.838840 4.511633\n",
|
||
"Gradient Boosting 0.873662 1.078451 2.560020 0.191742 2.764354 4.690192\n",
|
||
" XGBoost 0.865798 1.272851 2.638496 0.154215 2.748771 4.797840\n",
|
||
" LightGBM 0.696963 2.180261 3.964830 0.000899 3.210078 5.214593\n",
|
||
" Baseline 0.000000 4.483567 7.202397 0.000000 3.713249 5.273205\n",
|
||
"Linear Regression 0.610783 2.897918 4.493377 -0.416064 4.348784 6.208078\n",
|
||
"\n",
|
||
"🏆 Best Model: Decision Tree\n",
|
||
" Val R²: 0.3339\n",
|
||
" Val MAE: 2.5525\n",
|
||
" Val RMSE: 4.2579\n",
|
||
"\n",
|
||
"✅ Saved: phase2_output/data/phase2_model_results.csv\n",
|
||
" Val MAE: 2.7644, R²: 0.1917\n",
|
||
"\n",
|
||
"6️⃣ XGBoost...\n",
|
||
" Val MAE: 2.7488, R²: 0.1542\n",
|
||
"\n",
|
||
"7️⃣ LightGBM...\n",
|
||
" Val MAE: 3.2101, R²: 0.0009\n",
|
||
"\n",
|
||
"✅ All models trained!\n",
|
||
"\n",
|
||
"================================================================================\n",
|
||
"📊 PHASE 2 RESULTS (WITH TEXT FEATURES)\n",
|
||
"================================================================================\n",
|
||
" Model Train_R2 Train_MAE Train_RMSE Val_R2 Val_MAE Val_RMSE\n",
|
||
" Decision Tree 0.740968 2.012207 3.665671 0.333863 2.552472 4.257920\n",
|
||
" Random Forest 0.733891 2.083618 3.715410 0.252113 2.838840 4.511633\n",
|
||
"Gradient Boosting 0.873662 1.078451 2.560020 0.191742 2.764354 4.690192\n",
|
||
" XGBoost 0.865798 1.272851 2.638496 0.154215 2.748771 4.797840\n",
|
||
" LightGBM 0.696963 2.180261 3.964830 0.000899 3.210078 5.214593\n",
|
||
" Baseline 0.000000 4.483567 7.202397 0.000000 3.713249 5.273205\n",
|
||
"Linear Regression 0.610783 2.897918 4.493377 -0.416064 4.348784 6.208078\n",
|
||
"\n",
|
||
"🏆 Best Model: Decision Tree\n",
|
||
" Val R²: 0.3339\n",
|
||
" Val MAE: 2.5525\n",
|
||
" Val RMSE: 4.2579\n",
|
||
"\n",
|
||
"✅ Saved: phase2_output/data/phase2_model_results.csv\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"print(\"=\" * 80)\n",
|
||
"print(\"🤖 MODEL TRAINING - PHASE 2 (WITH TEXT FEATURES)\")\n",
|
||
"print(\"=\" * 80)\n",
|
||
"\n",
|
||
"def evaluate_model_phase2(model, X_train, y_train, X_val, y_val, model_name):\n",
|
||
" \"\"\"Train and evaluate model\"\"\"\n",
|
||
" model.fit(X_train, y_train)\n",
|
||
" \n",
|
||
" y_train_pred = model.predict(X_train)\n",
|
||
" y_val_pred = model.predict(X_val)\n",
|
||
" \n",
|
||
" return {\n",
|
||
" 'Model': model_name,\n",
|
||
" 'Train_R2': r2_score(y_train, y_train_pred),\n",
|
||
" 'Train_MAE': mean_absolute_error(y_train, y_train_pred),\n",
|
||
" 'Train_RMSE': np.sqrt(mean_squared_error(y_train, y_train_pred)),\n",
|
||
" 'Val_R2': r2_score(y_val, y_val_pred),\n",
|
||
" 'Val_MAE': mean_absolute_error(y_val, y_val_pred),\n",
|
||
" 'Val_RMSE': np.sqrt(mean_squared_error(y_val, y_val_pred))\n",
|
||
" }, model\n",
|
||
"\n",
|
||
"results_phase2 = []\n",
|
||
"models_phase2 = {}\n",
|
||
"\n",
|
||
"# 1. Baseline\n",
|
||
"print(\"\\n1️⃣ Baseline (Mean Prediction)...\")\n",
|
||
"mean_pred = np.full(len(y_val), y_train.mean())\n",
|
||
"baseline_metrics = {\n",
|
||
" 'Model': 'Baseline',\n",
|
||
" 'Train_R2': 0.0,\n",
|
||
" 'Train_MAE': mean_absolute_error(y_train, [y_train.mean()]*len(y_train)),\n",
|
||
" 'Train_RMSE': np.sqrt(mean_squared_error(y_train, [y_train.mean()]*len(y_train))),\n",
|
||
" 'Val_R2': 0.0,\n",
|
||
" 'Val_MAE': mean_absolute_error(y_val, mean_pred),\n",
|
||
" 'Val_RMSE': np.sqrt(mean_squared_error(y_val, mean_pred))\n",
|
||
"}\n",
|
||
"results_phase2.append(baseline_metrics)\n",
|
||
"print(f\" Val MAE: {baseline_metrics['Val_MAE']:.4f}, R²: {baseline_metrics['Val_R2']:.4f}\")\n",
|
||
"\n",
|
||
"# 2. Linear Regression\n",
|
||
"print(\"\\n2️⃣ Linear Regression...\")\n",
|
||
"lr = LinearRegression()\n",
|
||
"lr_metrics, lr_model = evaluate_model_phase2(lr, X_train_scaled, y_train, X_val_scaled, y_val, \"Linear Regression\")\n",
|
||
"results_phase2.append(lr_metrics)\n",
|
||
"models_phase2['Linear Regression'] = lr_model\n",
|
||
"print(f\" Val MAE: {lr_metrics['Val_MAE']:.4f}, R²: {lr_metrics['Val_R2']:.4f}\")\n",
|
||
"\n",
|
||
"# 3. Decision Tree\n",
|
||
"print(\"\\n3️⃣ Decision Tree...\")\n",
|
||
"dt = DecisionTreeRegressor(max_depth=10, min_samples_split=10, random_state=RANDOM_STATE)\n",
|
||
"dt_metrics, dt_model = evaluate_model_phase2(dt, X_train_scaled, y_train, X_val_scaled, y_val, \"Decision Tree\")\n",
|
||
"results_phase2.append(dt_metrics)\n",
|
||
"models_phase2['Decision Tree'] = dt_model\n",
|
||
"print(f\" Val MAE: {dt_metrics['Val_MAE']:.4f}, R²: {dt_metrics['Val_R2']:.4f}\")\n",
|
||
"\n",
|
||
"# 4. Random Forest\n",
|
||
"print(\"\\n4️⃣ Random Forest...\")\n",
|
||
"rf = RandomForestRegressor(n_estimators=100, max_depth=15, min_samples_split=5,\n",
|
||
" random_state=RANDOM_STATE, n_jobs=-1)\n",
|
||
"rf_metrics, rf_model = evaluate_model_phase2(rf, X_train_scaled, y_train, X_val_scaled, y_val, \"Random Forest\")\n",
|
||
"results_phase2.append(rf_metrics)\n",
|
||
"models_phase2['Random Forest'] = rf_model\n",
|
||
"print(f\" Val MAE: {rf_metrics['Val_MAE']:.4f}, R²: {rf_metrics['Val_R2']:.4f}\")\n",
|
||
"\n",
|
||
"# 5. Gradient Boosting\n",
|
||
"print(\"\\n5️⃣ Gradient Boosting...\")\n",
|
||
"gb = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, max_depth=5,\n",
|
||
" random_state=RANDOM_STATE)\n",
|
||
"gb_metrics, gb_model = evaluate_model_phase2(gb, X_train_scaled, y_train, X_val_scaled, y_val, \"Gradient Boosting\")\n",
|
||
"results_phase2.append(gb_metrics)\n",
|
||
"models_phase2['Gradient Boosting'] = gb_model\n",
|
||
"print(f\" Val MAE: {gb_metrics['Val_MAE']:.4f}, R²: {gb_metrics['Val_R2']:.4f}\")\n",
|
||
"\n",
|
||
"# 6. XGBoost\n",
|
||
"print(\"\\n6️⃣ XGBoost...\")\n",
|
||
"xgb_model = xgb.XGBRegressor(n_estimators=100, learning_rate=0.1, max_depth=5,\n",
|
||
" random_state=RANDOM_STATE, n_jobs=-1)\n",
|
||
"xgb_metrics, xgb_trained = evaluate_model_phase2(xgb_model, X_train_scaled, y_train, X_val_scaled, y_val, \"XGBoost\")\n",
|
||
"results_phase2.append(xgb_metrics)\n",
|
||
"models_phase2['XGBoost'] = xgb_trained\n",
|
||
"print(f\" Val MAE: {xgb_metrics['Val_MAE']:.4f}, R²: {xgb_metrics['Val_R2']:.4f}\")\n",
|
||
"\n",
|
||
"# 7. LightGBM\n",
|
||
"print(\"\\n7️⃣ LightGBM...\")\n",
|
||
"lgb_model = lgb.LGBMRegressor(n_estimators=100, learning_rate=0.1, max_depth=5,\n",
|
||
" num_leaves=31, random_state=RANDOM_STATE, n_jobs=-1, verbose=-1)\n",
|
||
"lgb_metrics, lgb_trained = evaluate_model_phase2(lgb_model, X_train_scaled, y_train, X_val_scaled, y_val, \"LightGBM\")\n",
|
||
"results_phase2.append(lgb_metrics)\n",
|
||
"models_phase2['LightGBM'] = lgb_trained\n",
|
||
"print(f\" Val MAE: {lgb_metrics['Val_MAE']:.4f}, R²: {lgb_metrics['Val_R2']:.4f}\")\n",
|
||
"\n",
|
||
"print(\"\\n✅ All models trained!\")\n",
|
||
"\n",
|
||
"# Results DataFrame\n",
|
||
"results_df_phase2 = pd.DataFrame(results_phase2).sort_values('Val_R2', ascending=False)\n",
|
||
"\n",
|
||
"print(\"\\n\" + \"=\" * 80)\n",
|
||
"print(\"📊 PHASE 2 RESULTS (WITH TEXT FEATURES)\")\n",
|
||
"print(\"=\" * 80)\n",
|
||
"print(results_df_phase2.to_string(index=False))\n",
|
||
"\n",
|
||
"# Best model\n",
|
||
"best_model_phase2 = results_df_phase2.iloc[0]['Model']\n",
|
||
"print(f\"\\n🏆 Best Model: {best_model_phase2}\")\n",
|
||
"print(f\" Val R²: {results_df_phase2.iloc[0]['Val_R2']:.4f}\")\n",
|
||
"print(f\" Val MAE: {results_df_phase2.iloc[0]['Val_MAE']:.4f}\")\n",
|
||
"print(f\" Val RMSE: {results_df_phase2.iloc[0]['Val_RMSE']:.4f}\")\n",
|
||
"\n",
|
||
"# Save results\n",
|
||
"results_df_phase2.to_csv('phase2_output/data/phase2_model_results.csv', index=False)\n",
|
||
"print(\"\\n✅ Saved: phase2_output/data/phase2_model_results.csv\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "6b39614a",
|
||
"metadata": {},
|
||
"source": [
|
||
"# 9️⃣ So Sánh Phase 1 vs Phase 2\n",
|
||
"\n",
|
||
"**Phase 1:** 47 numeric features only (R² ~ 0.30) \n",
|
||
"**Phase 2:** 47 numeric + 50 text SVD features (R² = ?)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 27,
|
||
"id": "440873ac",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"================================================================================\n",
|
||
"📊 PHASE 1 vs PHASE 2 COMPARISON\n",
|
||
"================================================================================\n",
|
||
"\n",
|
||
"📊 DETAILED COMPARISON:\n",
|
||
"================================================================================\n",
|
||
" Model Phase1_R2 Phase2_R2 R2_Improvement R2_Improvement_% Phase1_MAE Phase2_MAE MAE_Improvement MAE_Improvement_%\n",
|
||
" Decision Tree 0.4136 0.333863 -0.079737 -19.278797 3.3841 2.552472 0.831628 24.574560\n",
|
||
" Random Forest 0.3093 0.252113 -0.057187 -18.489238 3.5571 2.838840 0.718260 20.192294\n",
|
||
"Gradient Boosting 0.3981 0.191742 -0.206358 -51.835697 3.0356 2.764354 0.271246 8.935501\n",
|
||
" XGBoost 0.3225 0.154215 -0.168285 -52.181476 3.2599 2.748771 0.511129 15.679271\n",
|
||
" LightGBM 0.2948 0.000899 -0.293901 -99.695122 3.6027 3.210078 0.392622 10.898002\n",
|
||
" Baseline 0.0000 0.000000 0.000000 0.000000 5.1082 3.713249 1.394951 27.308068\n",
|
||
"Linear Regression -0.0820 -0.416064 -0.334064 -407.395704 4.9313 4.348784 0.582516 11.812616\n",
|
||
"\n",
|
||
"================================================================================\n",
|
||
"📈 IMPROVEMENT SUMMARY\n",
|
||
"================================================================================\n",
|
||
"\n",
|
||
"Average R² Score:\n",
|
||
" Phase 1 (Numeric only): 0.2366\n",
|
||
" Phase 2 (With text): 0.0738\n",
|
||
" Improvement: -0.1628 (-68.8%)\n",
|
||
"\n",
|
||
"Average MAE:\n",
|
||
" Phase 1 (Numeric only): 3.8398\n",
|
||
" Phase 2 (With text): 3.1681\n",
|
||
" Improvement: +0.6718 (+17.5%)\n",
|
||
"\n",
|
||
"🏆 Best Models:\n",
|
||
" Phase 1: Decision Tree (R²=0.4136, MAE=3.3841)\n",
|
||
" Phase 2: Decision Tree (R²=0.3339, MAE=2.5525)\n"
|
||
]
|
||
},
|
||
{
|
||
"data": {
|
||
"image/png": "",
|
||
"text/plain": [
|
||
"<Figure size 1600x600 with 2 Axes>"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
},
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"\n",
|
||
"✅ Saved: phase2_output/plots/02_phase1_vs_phase2_comparison.png\n",
|
||
"✅ Saved: phase2_output/data/phase1_vs_phase2_comparison.csv\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"print(\"=\" * 80)\n",
|
||
"print(\"📊 PHASE 1 vs PHASE 2 COMPARISON\")\n",
|
||
"print(\"=\" * 80)\n",
|
||
"\n",
|
||
"# Phase 1 results (from ML_Training_Pipeline.ipynb with FINAL_DATASET_WITH_TEXT_BACKUP_20260105_213507.xlsx)\n",
|
||
"# ⚠️ UPDATED with correct results from the same dataset!\n",
|
||
"phase1_results = {\n",
|
||
" 'Baseline': {'Val_R2': 0.0000, 'Val_MAE': 5.1082, 'Val_RMSE': 9.7433},\n",
|
||
" 'Linear Regression': {'Val_R2': -0.0820, 'Val_MAE': 4.9313, 'Val_RMSE': 10.1314},\n",
|
||
" 'Decision Tree': {'Val_R2': 0.4136, 'Val_MAE': 3.3841, 'Val_RMSE': 7.4583},\n",
|
||
" 'Random Forest': {'Val_R2': 0.3093, 'Val_MAE': 3.5571, 'Val_RMSE': 8.0948},\n",
|
||
" 'Gradient Boosting': {'Val_R2': 0.3981, 'Val_MAE': 3.0356, 'Val_RMSE': 7.5565},\n",
|
||
" 'XGBoost': {'Val_R2': 0.3225, 'Val_MAE': 3.2599, 'Val_RMSE': 8.0168},\n",
|
||
" 'LightGBM': {'Val_R2': 0.2948, 'Val_MAE': 3.6027, 'Val_RMSE': 8.1789}\n",
|
||
"}\n",
|
||
"\n",
|
||
"# Create comparison table\n",
|
||
"comparison_data = []\n",
|
||
"for model_name in results_df_phase2['Model'].values:\n",
|
||
" phase2_metrics = results_df_phase2[results_df_phase2['Model'] == model_name].iloc[0]\n",
|
||
" \n",
|
||
" if model_name in phase1_results:\n",
|
||
" phase1_metrics = phase1_results[model_name]\n",
|
||
" r2_improvement = phase2_metrics['Val_R2'] - phase1_metrics['Val_R2']\n",
|
||
" mae_improvement = phase1_metrics['Val_MAE'] - phase2_metrics['Val_MAE']\n",
|
||
" \n",
|
||
" comparison_data.append({\n",
|
||
" 'Model': model_name,\n",
|
||
" 'Phase1_R2': phase1_metrics['Val_R2'],\n",
|
||
" 'Phase2_R2': phase2_metrics['Val_R2'],\n",
|
||
" 'R2_Improvement': r2_improvement,\n",
|
||
" 'R2_Improvement_%': (r2_improvement / max(abs(phase1_metrics['Val_R2']), 0.01)) * 100,\n",
|
||
" 'Phase1_MAE': phase1_metrics['Val_MAE'],\n",
|
||
" 'Phase2_MAE': phase2_metrics['Val_MAE'],\n",
|
||
" 'MAE_Improvement': mae_improvement,\n",
|
||
" 'MAE_Improvement_%': (mae_improvement / phase1_metrics['Val_MAE']) * 100\n",
|
||
" })\n",
|
||
"\n",
|
||
"comparison_df = pd.DataFrame(comparison_data)\n",
|
||
"\n",
|
||
"print(\"\\n📊 DETAILED COMPARISON:\")\n",
|
||
"print(\"=\" * 80)\n",
|
||
"print(comparison_df.to_string(index=False))\n",
|
||
"\n",
|
||
"# Summary statistics\n",
|
||
"print(\"\\n\" + \"=\" * 80)\n",
|
||
"print(\"📈 IMPROVEMENT SUMMARY\")\n",
|
||
"print(\"=\" * 80)\n",
|
||
"\n",
|
||
"avg_r2_phase1 = comparison_df['Phase1_R2'].mean()\n",
|
||
"avg_r2_phase2 = comparison_df['Phase2_R2'].mean()\n",
|
||
"avg_mae_phase1 = comparison_df['Phase1_MAE'].mean()\n",
|
||
"avg_mae_phase2 = comparison_df['Phase2_MAE'].mean()\n",
|
||
"\n",
|
||
"print(f\"\\nAverage R² Score:\")\n",
|
||
"print(f\" Phase 1 (Numeric only): {avg_r2_phase1:.4f}\")\n",
|
||
"print(f\" Phase 2 (With text): {avg_r2_phase2:.4f}\")\n",
|
||
"print(f\" Improvement: {avg_r2_phase2 - avg_r2_phase1:+.4f} ({(avg_r2_phase2 - avg_r2_phase1) / max(abs(avg_r2_phase1), 0.01) * 100:+.1f}%)\")\n",
|
||
"\n",
|
||
"print(f\"\\nAverage MAE:\")\n",
|
||
"print(f\" Phase 1 (Numeric only): {avg_mae_phase1:.4f}\")\n",
|
||
"print(f\" Phase 2 (With text): {avg_mae_phase2:.4f}\")\n",
|
||
"print(f\" Improvement: {avg_mae_phase1 - avg_mae_phase2:+.4f} ({(avg_mae_phase1 - avg_mae_phase2) / avg_mae_phase1 * 100:+.1f}%)\")\n",
|
||
"\n",
|
||
"# Best models comparison\n",
|
||
"best_phase1 = max(phase1_results.items(), key=lambda x: x[1]['Val_R2'])\n",
|
||
"best_phase2_row = results_df_phase2.iloc[0]\n",
|
||
"\n",
|
||
"print(f\"\\n🏆 Best Models:\")\n",
|
||
"print(f\" Phase 1: {best_phase1[0]} (R²={best_phase1[1]['Val_R2']:.4f}, MAE={best_phase1[1]['Val_MAE']:.4f})\")\n",
|
||
"print(f\" Phase 2: {best_phase2_row['Model']} (R²={best_phase2_row['Val_R2']:.4f}, MAE={best_phase2_row['Val_MAE']:.4f})\")\n",
|
||
"\n",
|
||
"# Visualization\n",
|
||
"fig, axes = plt.subplots(1, 2, figsize=(16, 6))\n",
|
||
"\n",
|
||
"# R² Comparison\n",
|
||
"models = comparison_df['Model'].values\n",
|
||
"x_pos = np.arange(len(models))\n",
|
||
"width = 0.35\n",
|
||
"\n",
|
||
"axes[0].bar(x_pos - width/2, comparison_df['Phase1_R2'], width, label='Phase 1 (Numeric only)', alpha=0.8, color='coral')\n",
|
||
"axes[0].bar(x_pos + width/2, comparison_df['Phase2_R2'], width, label='Phase 2 (With text)', alpha=0.8, color='skyblue')\n",
|
||
"axes[0].set_xlabel('Models', fontsize=12)\n",
|
||
"axes[0].set_ylabel('Validation R²', fontsize=12)\n",
|
||
"axes[0].set_title('R² Comparison: Phase 1 vs Phase 2', fontsize=14, fontweight='bold')\n",
|
||
"axes[0].set_xticks(x_pos)\n",
|
||
"axes[0].set_xticklabels(models, rotation=45, ha='right')\n",
|
||
"axes[0].axhline(y=0, color='gray', linestyle='--', linewidth=1)\n",
|
||
"axes[0].legend()\n",
|
||
"axes[0].grid(alpha=0.3, axis='y')\n",
|
||
"\n",
|
||
"# MAE Comparison\n",
|
||
"axes[1].bar(x_pos - width/2, comparison_df['Phase1_MAE'], width, label='Phase 1 (Numeric only)', alpha=0.8, color='coral')\n",
|
||
"axes[1].bar(x_pos + width/2, comparison_df['Phase2_MAE'], width, label='Phase 2 (With text)', alpha=0.8, color='skyblue')\n",
|
||
"axes[1].set_xlabel('Models', fontsize=12)\n",
|
||
"axes[1].set_ylabel('Validation MAE', fontsize=12)\n",
|
||
"axes[1].set_title('MAE Comparison: Phase 1 vs Phase 2', fontsize=14, fontweight='bold')\n",
|
||
"axes[1].set_xticks(x_pos)\n",
|
||
"axes[1].set_xticklabels(models, rotation=45, ha='right')\n",
|
||
"axes[1].legend()\n",
|
||
"axes[1].grid(alpha=0.3, axis='y')\n",
|
||
"\n",
|
||
"plt.tight_layout()\n",
|
||
"plt.savefig('phase2_output/plots/02_phase1_vs_phase2_comparison.png', dpi=300, bbox_inches='tight')\n",
|
||
"plt.show()\n",
|
||
"\n",
|
||
"print(\"\\n✅ Saved: phase2_output/plots/02_phase1_vs_phase2_comparison.png\")\n",
|
||
"\n",
|
||
"# Save comparison\n",
|
||
"comparison_df.to_csv('phase2_output/data/phase1_vs_phase2_comparison.csv', index=False)\n",
|
||
"print(\"✅ Saved: phase2_output/data/phase1_vs_phase2_comparison.csv\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "9e1fe71d",
|
||
"metadata": {},
|
||
"source": [
|
||
"# 🔟 Test Set Evaluation\n",
|
||
"\n",
|
||
"Đánh giá best model trên test set"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 28,
|
||
"id": "7d452daf",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"================================================================================\n",
|
||
"TEST SET EVALUATION - Decision Tree\n",
|
||
"================================================================================\n",
|
||
"\n",
|
||
"📊 Results:\n",
|
||
"Metric Validation Test Difference \n",
|
||
"------------------------------------------------------------\n",
|
||
"R² 0.3339 0.2893 0.0446 \n",
|
||
"MAE 2.5525 3.5730 1.0205 \n",
|
||
"RMSE 4.2579 5.6899 1.4320 \n",
|
||
"\n",
|
||
"🎯 Consistency Check:\n",
|
||
" ✓ GOOD! Val and Test R² are reasonably consistent (13.4% difference)\n"
|
||
]
|
||
},
|
||
{
|
||
"data": {
|
||
"image/png": "",
|
||
"text/plain": [
|
||
"<Figure size 1800x500 with 3 Axes>"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
},
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"\n",
|
||
"✅ Saved: phase2_output/plots/03_test_set_evaluation.png\n",
|
||
"✅ Saved: phase2_output/data/test_predictions.csv\n",
|
||
"\n",
|
||
"================================================================================\n",
|
||
"🎉 PHASE 2 COMPLETE - FINAL SUMMARY\n",
|
||
"================================================================================\n",
|
||
"\n",
|
||
"📊 Dataset:\n",
|
||
" Total samples: 454\n",
|
||
" Features: 92 (47 numeric + 50 text SVD)\n",
|
||
" Train/Val/Test: 289/62/62\n",
|
||
"\n",
|
||
"🏆 Best Model: Decision Tree\n",
|
||
" Validation R²: 0.3339\n",
|
||
" Test R²: 0.2893\n",
|
||
" Validation MAE: 2.5525\n",
|
||
" Test MAE: 3.5730\n",
|
||
"\n",
|
||
"📈 Improvement over Phase 1:\n",
|
||
" R² gain: -0.0797 (-19.3%)\n",
|
||
" MAE gain: +0.8316 (+24.6%)\n",
|
||
"\n",
|
||
"💾 Outputs saved to:\n",
|
||
" phase2_output/data/\n",
|
||
" phase2_output/plots/\n",
|
||
" phase2_output/models/\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"# Evaluate best model on test set\n",
|
||
"best_model = models_phase2[best_model_phase2]\n",
|
||
"\n",
|
||
"print(\"=\" * 80)\n",
|
||
"print(f\"TEST SET EVALUATION - {best_model_phase2}\")\n",
|
||
"print(\"=\" * 80)\n",
|
||
"\n",
|
||
"# Predictions\n",
|
||
"y_test_pred = best_model.predict(X_test_scaled)\n",
|
||
"\n",
|
||
"# Metrics\n",
|
||
"test_r2 = r2_score(y_test, y_test_pred)\n",
|
||
"test_mae = mean_absolute_error(y_test, y_test_pred)\n",
|
||
"test_rmse = np.sqrt(mean_squared_error(y_test, y_test_pred))\n",
|
||
"\n",
|
||
"val_r2 = results_df_phase2[results_df_phase2['Model'] == best_model_phase2]['Val_R2'].values[0]\n",
|
||
"val_mae = results_df_phase2[results_df_phase2['Model'] == best_model_phase2]['Val_MAE'].values[0]\n",
|
||
"val_rmse = results_df_phase2[results_df_phase2['Model'] == best_model_phase2]['Val_RMSE'].values[0]\n",
|
||
"\n",
|
||
"print(f\"\\n📊 Results:\")\n",
|
||
"print(f\"{'Metric':<15} {'Validation':<15} {'Test':<15} {'Difference':<15}\")\n",
|
||
"print(\"-\" * 60)\n",
|
||
"print(f\"{'R²':<15} {val_r2:<15.4f} {test_r2:<15.4f} {abs(test_r2 - val_r2):<15.4f}\")\n",
|
||
"print(f\"{'MAE':<15} {val_mae:<15.4f} {test_mae:<15.4f} {abs(test_mae - val_mae):<15.4f}\")\n",
|
||
"print(f\"{'RMSE':<15} {val_rmse:<15.4f} {test_rmse:<15.4f} {abs(test_rmse - val_rmse):<15.4f}\")\n",
|
||
"\n",
|
||
"# Consistency check\n",
|
||
"r2_diff_pct = abs(test_r2 - val_r2) / max(abs(val_r2), 0.01) * 100\n",
|
||
"\n",
|
||
"print(f\"\\n🎯 Consistency Check:\")\n",
|
||
"if r2_diff_pct < 10:\n",
|
||
" print(f\" ✅ EXCELLENT! Val and Test R² are very consistent ({r2_diff_pct:.1f}% difference)\")\n",
|
||
"elif r2_diff_pct < 20:\n",
|
||
" print(f\" ✓ GOOD! Val and Test R² are reasonably consistent ({r2_diff_pct:.1f}% difference)\")\n",
|
||
"else:\n",
|
||
" print(f\" ⚠️ Val and Test R² show significant difference ({r2_diff_pct:.1f}%)\")\n",
|
||
"\n",
|
||
"# Visualization\n",
|
||
"fig, axes = plt.subplots(1, 3, figsize=(18, 5))\n",
|
||
"\n",
|
||
"# Scatter plot\n",
|
||
"axes[0].scatter(y_test, y_test_pred, alpha=0.6, edgecolors='k')\n",
|
||
"axes[0].plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2)\n",
|
||
"axes[0].set_xlabel('Actual', fontsize=12)\n",
|
||
"axes[0].set_ylabel('Predicted', fontsize=12)\n",
|
||
"axes[0].set_title(f'{best_model_phase2} - Test Set\\\\nPredicted vs Actual', fontsize=14)\n",
|
||
"axes[0].grid(True, alpha=0.3)\n",
|
||
"\n",
|
||
"# Residuals\n",
|
||
"residuals = y_test - y_test_pred\n",
|
||
"axes[1].scatter(y_test_pred, residuals, alpha=0.6, edgecolors='k')\n",
|
||
"axes[1].axhline(y=0, color='r', linestyle='--', lw=2)\n",
|
||
"axes[1].set_xlabel('Predicted', fontsize=12)\n",
|
||
"axes[1].set_ylabel('Residuals', fontsize=12)\n",
|
||
"axes[1].set_title('Residual Plot', fontsize=14)\n",
|
||
"axes[1].grid(True, alpha=0.3)\n",
|
||
"\n",
|
||
"# Residuals distribution\n",
|
||
"axes[2].hist(residuals, bins=30, edgecolor='black', alpha=0.7)\n",
|
||
"axes[2].axvline(x=0, color='r', linestyle='--', lw=2)\n",
|
||
"axes[2].set_xlabel('Residuals', fontsize=12)\n",
|
||
"axes[2].set_ylabel('Frequency', fontsize=12)\n",
|
||
"axes[2].set_title('Distribution of Residuals', fontsize=14)\n",
|
||
"axes[2].grid(True, alpha=0.3)\n",
|
||
"\n",
|
||
"plt.tight_layout()\n",
|
||
"plt.savefig('phase2_output/plots/03_test_set_evaluation.png', dpi=300, bbox_inches='tight')\n",
|
||
"plt.show()\n",
|
||
"\n",
|
||
"print(\"\\n✅ Saved: phase2_output/plots/03_test_set_evaluation.png\")\n",
|
||
"\n",
|
||
"# Save test predictions\n",
|
||
"test_results = pd.DataFrame({\n",
|
||
" 'actual': y_test,\n",
|
||
" 'predicted': y_test_pred,\n",
|
||
" 'error': residuals,\n",
|
||
" 'abs_error': np.abs(residuals)\n",
|
||
"}).sort_values('abs_error', ascending=False)\n",
|
||
"\n",
|
||
"test_results.to_csv('phase2_output/data/test_predictions.csv', index=False)\n",
|
||
"print(\"✅ Saved: phase2_output/data/test_predictions.csv\")\n",
|
||
"\n",
|
||
"# Final Summary\n",
|
||
"print(\"\\n\" + \"=\" * 80)\n",
|
||
"print(\"🎉 PHASE 2 COMPLETE - FINAL SUMMARY\")\n",
|
||
"print(\"=\" * 80)\n",
|
||
"\n",
|
||
"print(f\"\\n📊 Dataset:\")\n",
|
||
"print(f\" Total samples: {len(df)}\")\n",
|
||
"print(f\" Features: {X_combined.shape[1]} (47 numeric + 50 text SVD)\")\n",
|
||
"print(f\" Train/Val/Test: {len(X_train)}/{len(X_val)}/{len(X_test)}\")\n",
|
||
"\n",
|
||
"print(f\"\\n🏆 Best Model: {best_model_phase2}\")\n",
|
||
"print(f\" Validation R²: {val_r2:.4f}\")\n",
|
||
"print(f\" Test R²: {test_r2:.4f}\")\n",
|
||
"print(f\" Validation MAE: {val_mae:.4f}\")\n",
|
||
"print(f\" Test MAE: {test_mae:.4f}\")\n",
|
||
"\n",
|
||
"print(f\"\\n📈 Improvement over Phase 1:\")\n",
|
||
"if best_model_phase2 in phase1_results:\n",
|
||
" phase1_best = phase1_results[best_model_phase2]\n",
|
||
" r2_gain = val_r2 - phase1_best['Val_R2']\n",
|
||
" mae_gain = phase1_best['Val_MAE'] - val_mae\n",
|
||
" print(f\" R² gain: {r2_gain:+.4f} ({r2_gain / max(abs(phase1_best['Val_R2']), 0.01) * 100:+.1f}%)\")\n",
|
||
" print(f\" MAE gain: {mae_gain:+.4f} ({mae_gain / phase1_best['Val_MAE'] * 100:+.1f}%)\")\n",
|
||
"\n",
|
||
"print(f\"\\n💾 Outputs saved to:\")\n",
|
||
"print(f\" phase2_output/data/\")\n",
|
||
"print(f\" phase2_output/plots/\")\n",
|
||
"print(f\" phase2_output/models/\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "bc294335",
|
||
"metadata": {},
|
||
"source": [
|
||
"# 1️⃣1️⃣ Save Best Model\n",
|
||
"\n",
|
||
"Lưu model tốt nhất cùng với scaler và các artifacts cần thiết để deploy"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 32,
|
||
"id": "b98dcbbb",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"================================================================================\n",
|
||
"💾 SAVING BEST MODEL & ARTIFACTS\n",
|
||
"================================================================================\n",
|
||
"\n",
|
||
"✅ Saved best model: phase2_output/models/best_model_Decision_Tree_20260106_011633.pkl\n",
|
||
" Model type: Decision Tree\n",
|
||
" Val R²: 0.3339\n",
|
||
" Test R²: 0.2893\n",
|
||
"\n",
|
||
"✅ Saved scaler: phase2_output/models/scaler_20260106_011633.pkl\n",
|
||
"\n",
|
||
"✅ Saved TF-IDF vectorizer: phase2_output/models/tfidf_vectorizer_20260106_011633.pkl\n",
|
||
"\n",
|
||
"✅ Saved SVD model: phase2_output/models/svd_model_20260106_011633.pkl\n",
|
||
"\n",
|
||
"✅ Saved feature names: phase2_output/models/feature_names_20260106_011633.pkl\n",
|
||
"\n",
|
||
"✅ Saved metadata: phase2_output/models/model_metadata_20260106_011633.pkl\n",
|
||
"\n",
|
||
"✅ Saved complete package: phase2_output/models/phase2_complete_package_20260106_011633.joblib\n",
|
||
" Package size: 102.67 KB\n",
|
||
"\n",
|
||
"✅ Saved README: phase2_output/models/README_20260106_011633.md\n",
|
||
"\n",
|
||
"================================================================================\n",
|
||
"✅ ALL ARTIFACTS SAVED SUCCESSFULLY!\n",
|
||
"================================================================================\n",
|
||
"\n",
|
||
"📁 Saved files:\n",
|
||
" 1. Best model: phase2_output/models/best_model_Decision_Tree_20260106_011633.pkl\n",
|
||
" 2. Scaler: phase2_output/models/scaler_20260106_011633.pkl\n",
|
||
" 3. TF-IDF: phase2_output/models/tfidf_vectorizer_20260106_011633.pkl\n",
|
||
" 4. SVD: phase2_output/models/svd_model_20260106_011633.pkl\n",
|
||
" 5. Features: phase2_output/models/feature_names_20260106_011633.pkl\n",
|
||
" 6. Metadata: phase2_output/models/model_metadata_20260106_011633.pkl\n",
|
||
" 7. Complete package: phase2_output/models/phase2_complete_package_20260106_011633.joblib\n",
|
||
" 8. README: phase2_output/models/README_20260106_011633.md\n",
|
||
"\n",
|
||
"💡 For deployment, use: phase2_output/models/phase2_complete_package_20260106_011633.joblib\n",
|
||
"================================================================================\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"import pickle\n",
|
||
"import joblib\n",
|
||
"from datetime import datetime\n",
|
||
"\n",
|
||
"print(\"=\" * 80)\n",
|
||
"print(\"💾 SAVING BEST MODEL & ARTIFACTS\")\n",
|
||
"print(\"=\" * 80)\n",
|
||
"\n",
|
||
"# Create timestamp\n",
|
||
"timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')\n",
|
||
"\n",
|
||
"# 1. Save the best model\n",
|
||
"model_filename = f'phase2_output/models/best_model_{best_model_phase2.replace(\" \", \"_\")}_{timestamp}.pkl'\n",
|
||
"with open(model_filename, 'wb') as f:\n",
|
||
" pickle.dump(best_model, f)\n",
|
||
"print(f\"\\n✅ Saved best model: {model_filename}\")\n",
|
||
"print(f\" Model type: {best_model_phase2}\")\n",
|
||
"print(f\" Val R²: {val_r2:.4f}\")\n",
|
||
"print(f\" Test R²: {test_r2:.4f}\")\n",
|
||
"\n",
|
||
"# 2. Save the scaler\n",
|
||
"scaler_filename = f'phase2_output/models/scaler_{timestamp}.pkl'\n",
|
||
"with open(scaler_filename, 'wb') as f:\n",
|
||
" pickle.dump(scaler, f)\n",
|
||
"print(f\"\\n✅ Saved scaler: {scaler_filename}\")\n",
|
||
"\n",
|
||
"# 3. Save TF-IDF vectorizer\n",
|
||
"tfidf_filename = f'phase2_output/models/tfidf_vectorizer_{timestamp}.pkl'\n",
|
||
"with open(tfidf_filename, 'wb') as f:\n",
|
||
" pickle.dump(tfidf, f)\n",
|
||
"print(f\"\\n✅ Saved TF-IDF vectorizer: {tfidf_filename}\")\n",
|
||
"\n",
|
||
"# 4. Save SVD model\n",
|
||
"svd_filename = f'phase2_output/models/svd_model_{timestamp}.pkl'\n",
|
||
"with open(svd_filename, 'wb') as f:\n",
|
||
" pickle.dump(svd, f)\n",
|
||
"print(f\"\\n✅ Saved SVD model: {svd_filename}\")\n",
|
||
"\n",
|
||
"# 5. Save feature names\n",
|
||
"feature_names_file = f'phase2_output/models/feature_names_{timestamp}.pkl'\n",
|
||
"with open(feature_names_file, 'wb') as f:\n",
|
||
" pickle.dump(X_combined.columns.tolist(), f)\n",
|
||
"print(f\"\\n✅ Saved feature names: {feature_names_file}\")\n",
|
||
"\n",
|
||
"# 6. Save model metadata\n",
|
||
"metadata = {\n",
|
||
" 'model_name': best_model_phase2,\n",
|
||
" 'timestamp': timestamp,\n",
|
||
" 'dataset': 'FINAL_DATASET_WITH_TEXT_BACKUP_20260105_213507.xlsx',\n",
|
||
" 'n_samples': len(df),\n",
|
||
" 'n_features': X_combined.shape[1],\n",
|
||
" 'n_numeric_features': len(numeric_cols),\n",
|
||
" 'n_text_features': n_components,\n",
|
||
" 'train_size': len(X_train),\n",
|
||
" 'val_size': len(X_val),\n",
|
||
" 'test_size': len(X_test),\n",
|
||
" 'val_r2': float(val_r2),\n",
|
||
" 'val_mae': float(val_mae),\n",
|
||
" 'val_rmse': float(val_rmse),\n",
|
||
" 'test_r2': float(test_r2),\n",
|
||
" 'test_mae': float(test_mae),\n",
|
||
" 'test_rmse': float(test_rmse),\n",
|
||
" 'tfidf_params': {\n",
|
||
" 'max_features': max_features,\n",
|
||
" 'ngram_range': ngram_range,\n",
|
||
" 'min_df': min_df,\n",
|
||
" 'max_df': max_df\n",
|
||
" },\n",
|
||
" 'svd_params': {\n",
|
||
" 'n_components': n_components,\n",
|
||
" 'explained_variance': float(svd.explained_variance_ratio_.sum())\n",
|
||
" },\n",
|
||
" 'random_state': RANDOM_STATE\n",
|
||
"}\n",
|
||
"\n",
|
||
"metadata_filename = f'phase2_output/models/model_metadata_{timestamp}.pkl'\n",
|
||
"with open(metadata_filename, 'wb') as f:\n",
|
||
" pickle.dump(metadata, f)\n",
|
||
"print(f\"\\n✅ Saved metadata: {metadata_filename}\")\n",
|
||
"\n",
|
||
"# 7. Save as single package (using joblib for better compression)\n",
|
||
"package_filename = f'phase2_output/models/phase2_complete_package_{timestamp}.joblib'\n",
|
||
"model_package = {\n",
|
||
" 'model': best_model,\n",
|
||
" 'scaler': scaler,\n",
|
||
" 'tfidf': tfidf,\n",
|
||
" 'svd': svd,\n",
|
||
" 'feature_names': X_combined.columns.tolist(),\n",
|
||
" 'metadata': metadata\n",
|
||
"}\n",
|
||
"joblib.dump(model_package, package_filename)\n",
|
||
"print(f\"\\n✅ Saved complete package: {package_filename}\")\n",
|
||
"print(f\" Package size: {os.path.getsize(package_filename) / 1024:.2f} KB\")\n",
|
||
"\n",
|
||
"# 8. Create a README for the saved models\n",
|
||
"readme_content = f\"\"\"# Phase 2 Model Package - {timestamp}\n",
|
||
"\n",
|
||
"## Model Information\n",
|
||
"- **Model Type**: {best_model_phase2}\n",
|
||
"- **Dataset**: FINAL_DATASET_WITH_TEXT_BACKUP_20260105_213507.xlsx\n",
|
||
"- **Total Samples**: {len(df)}\n",
|
||
"- **Total Features**: {X_combined.shape[1]} (Numeric: {len(numeric_cols)}, Text SVD: {n_components})\n",
|
||
"\n",
|
||
"## Performance Metrics\n",
|
||
"\n",
|
||
"### Validation Set\n",
|
||
"- **R² Score**: {val_r2:.4f}\n",
|
||
"- **MAE**: {val_mae:.4f}\n",
|
||
"- **RMSE**: {val_rmse:.4f}\n",
|
||
"\n",
|
||
"### Test Set\n",
|
||
"- **R² Score**: {test_r2:.4f}\n",
|
||
"- **MAE**: {test_mae:.4f}\n",
|
||
"- **RMSE**: {test_rmse:.4f}\n",
|
||
"\n",
|
||
"## Files Included\n",
|
||
"\n",
|
||
"1. **best_model_{best_model_phase2.replace(\" \", \"_\")}_{timestamp}.pkl** - Trained model\n",
|
||
"2. **scaler_{timestamp}.pkl** - StandardScaler for numeric features\n",
|
||
"3. **tfidf_vectorizer_{timestamp}.pkl** - TF-IDF vectorizer for text\n",
|
||
"4. **svd_model_{timestamp}.pkl** - SVD dimensionality reduction\n",
|
||
"5. **feature_names_{timestamp}.pkl** - List of all feature names\n",
|
||
"6. **model_metadata_{timestamp}.pkl** - Complete metadata dictionary\n",
|
||
"7. **phase2_complete_package_{timestamp}.joblib** - All-in-one package (recommended for deployment)\n",
|
||
"\n",
|
||
"## How to Load and Use\n",
|
||
"\n",
|
||
"### Option 1: Load Complete Package (Recommended)\n",
|
||
"```python\n",
|
||
"import joblib\n",
|
||
"import pandas as pd\n",
|
||
"\n",
|
||
"# Load package\n",
|
||
"package = joblib.load('phase2_complete_package_{timestamp}.joblib')\n",
|
||
"model = package['model']\n",
|
||
"scaler = package['scaler']\n",
|
||
"tfidf = package['tfidf']\n",
|
||
"svd = package['svd']\n",
|
||
"\n",
|
||
"# Make prediction\n",
|
||
"# 1. Process text\n",
|
||
"text_combined = \"your text here\" # Combined task text\n",
|
||
"tfidf_features = tfidf.transform([text_combined])\n",
|
||
"text_svd = svd.transform(tfidf_features)\n",
|
||
"\n",
|
||
"# 2. Prepare numeric features\n",
|
||
"numeric_features = [...] # Your numeric features array\n",
|
||
"\n",
|
||
"# 3. Combine and scale\n",
|
||
"all_features = pd.concat([\n",
|
||
" pd.DataFrame(numeric_features, columns=package['feature_names'][:len(numeric_features)]),\n",
|
||
" pd.DataFrame(text_svd, columns=package['feature_names'][len(numeric_features):])\n",
|
||
"], axis=1)\n",
|
||
"all_features_scaled = scaler.transform(all_features)\n",
|
||
"\n",
|
||
"# 4. Predict\n",
|
||
"prediction = model.predict(all_features_scaled)\n",
|
||
"print(f\"Predicted staff count: {{prediction[0]:.0f}}\")\n",
|
||
"```\n",
|
||
"\n",
|
||
"### Option 2: Load Individual Files\n",
|
||
"```python\n",
|
||
"import pickle\n",
|
||
"\n",
|
||
"with open('best_model_{best_model_phase2.replace(\" \", \"_\")}_{timestamp}.pkl', 'rb') as f:\n",
|
||
" model = pickle.load(f)\n",
|
||
"\n",
|
||
"with open('scaler_{timestamp}.pkl', 'rb') as f:\n",
|
||
" scaler = pickle.load(f)\n",
|
||
"\n",
|
||
"# ... (same prediction process as above)\n",
|
||
"```\n",
|
||
"\n",
|
||
"## Model Configuration\n",
|
||
"\n",
|
||
"### TF-IDF Parameters\n",
|
||
"- max_features: {max_features}\n",
|
||
"- ngram_range: {ngram_range}\n",
|
||
"- min_df: {min_df}\n",
|
||
"- max_df: {max_df}\n",
|
||
"\n",
|
||
"### SVD Parameters\n",
|
||
"- n_components: {n_components}\n",
|
||
"- explained_variance: {svd.explained_variance_ratio_.sum()*100:.2f}%\n",
|
||
"\n",
|
||
"### Training Parameters\n",
|
||
"- random_state: {RANDOM_STATE}\n",
|
||
"- train_size: {len(X_train)} ({len(X_train)/len(df)*100:.1f}%)\n",
|
||
"- val_size: {len(X_val)} ({len(X_val)/len(df)*100:.1f}%)\n",
|
||
"- test_size: {len(X_test)} ({len(X_test)/len(df)*100:.1f}%)\n",
|
||
"\n",
|
||
"## Phase 1 vs Phase 2 Comparison\n",
|
||
"\n",
|
||
"Phase 1 (Numeric only): R² = {phase1_results[best_model_phase2]['Val_R2']:.4f} if best_model_phase2 in phase1_results else 'N/A'\n",
|
||
"Phase 2 (With text): R² = {val_r2:.4f}\n",
|
||
"Improvement: {(val_r2 - phase1_results[best_model_phase2]['Val_R2']) if best_model_phase2 in phase1_results else 'N/A'}\n",
|
||
"\n",
|
||
"## Notes\n",
|
||
"- This model includes text features extracted from task descriptions\n",
|
||
"- Text preprocessing: lowercase, remove special chars, combine task columns\n",
|
||
"- Feature engineering: TF-IDF → SVD → StandardScaler\n",
|
||
"- Use the same preprocessing pipeline for new predictions\n",
|
||
"\n",
|
||
"Generated: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}\n",
|
||
"\"\"\"\n",
|
||
"\n",
|
||
"readme_filename = f'phase2_output/models/README_{timestamp}.md'\n",
|
||
"with open(readme_filename, 'w', encoding='utf-8') as f:\n",
|
||
" f.write(readme_content)\n",
|
||
"print(f\"\\n✅ Saved README: {readme_filename}\")\n",
|
||
"\n",
|
||
"print(\"\\n\" + \"=\" * 80)\n",
|
||
"print(\"✅ ALL ARTIFACTS SAVED SUCCESSFULLY!\")\n",
|
||
"print(\"=\" * 80)\n",
|
||
"print(f\"\\n📁 Saved files:\")\n",
|
||
"print(f\" 1. Best model: {model_filename}\")\n",
|
||
"print(f\" 2. Scaler: {scaler_filename}\")\n",
|
||
"print(f\" 3. TF-IDF: {tfidf_filename}\")\n",
|
||
"print(f\" 4. SVD: {svd_filename}\")\n",
|
||
"print(f\" 5. Features: {feature_names_file}\")\n",
|
||
"print(f\" 6. Metadata: {metadata_filename}\")\n",
|
||
"print(f\" 7. Complete package: {package_filename}\")\n",
|
||
"print(f\" 8. README: {readme_filename}\")\n",
|
||
"print(f\"\\n💡 For deployment, use: {package_filename}\")\n",
|
||
"print(\"=\" * 80)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "5441ee0c",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": []
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "b6aeeb0b",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": []
|
||
}
|
||
],
|
||
"metadata": {
|
||
"kernelspec": {
|
||
"display_name": "base",
|
||
"language": "python",
|
||
"name": "python3"
|
||
},
|
||
"language_info": {
|
||
"codemirror_mode": {
|
||
"name": "ipython",
|
||
"version": 3
|
||
},
|
||
"file_extension": ".py",
|
||
"mimetype": "text/x-python",
|
||
"name": "python",
|
||
"nbconvert_exporter": "python",
|
||
"pygments_lexer": "ipython3",
|
||
"version": "3.13.5"
|
||
}
|
||
},
|
||
"nbformat": 4,
|
||
"nbformat_minor": 5
|
||
}
|