{ "cells": [ { "cell_type": "markdown", "id": "291bb40a", "metadata": {}, "source": [ "# 🤖 ML Pipeline: Dự Đoán Số Lượng Nhân Sự\n", "\n", "**Mục tiêu:** Dự đoán số lượng nhân sự (`so_luong`) cần thiết cho mỗi ca làm việc\n", "\n", "**Dataset:** FINAL_DATASET_WITH_TEXT.xlsx (454 rows × 51 columns)\n", "\n", "**Features sử dụng:** 47 features (không bao gồm 2 cột text - để lại cho phase sau)\n", "\n", "**Target:** `so_luong` (0-64 nhân sự)\n", "\n", "---\n", "\n", "## 📋 Nội dung Notebook:\n", "\n", "1. ✅ Import Libraries & Load Data\n", "2. ✅ Exploratory Data Analysis (EDA)\n", "3. ✅ Data Cleaning & Preprocessing \n", "4. ✅ Feature Engineering\n", "5. ✅ Visualizations & Insights\n", "6. ✅ Train/Val/Test Split\n", "7. ✅ Baseline Models\n", "8. ✅ Advanced Models (RF, XGBoost, LightGBM)\n", "9. ✅ Model Evaluation & Comparison\n", "10. ✅ Feature Importance Analysis\n", "11. ✅ Final Results & Recommendations" ] }, { "cell_type": "markdown", "id": "d2dd2408", "metadata": {}, "source": [ "# 1️⃣ Import Libraries & Load Data" ] }, { "cell_type": "code", "execution_count": 2, "id": "9ada57ec", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "✅ Libraries imported successfully!\n", "📅 Date: 2026-01-05 23:29:39\n" ] } ], "source": [ "# Import essential libraries\n", "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "import warnings\n", "import os\n", "from datetime import datetime\n", "\n", "# Machine Learning libraries\n", "from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV\n", "from sklearn.preprocessing import StandardScaler, LabelEncoder, OneHotEncoder\n", "from sklearn.linear_model import LinearRegression\n", "from sklearn.tree import DecisionTreeRegressor\n", "from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor\n", "from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score\n", "import xgboost as xgb\n", "import lightgbm as lgb\n", "\n", "# Configuration\n", "warnings.filterwarnings('ignore')\n", "plt.style.use('seaborn-v0_8-darkgrid')\n", "sns.set_palette(\"husl\")\n", "pd.set_option('display.max_columns', None)\n", "pd.set_option('display.max_rows', 100)\n", "\n", "# Random seed for reproducibility\n", "RANDOM_STATE = 42\n", "np.random.seed(RANDOM_STATE)\n", "\n", "# Create folders for outputs\n", "os.makedirs('data/cleaned', exist_ok=True)\n", "os.makedirs('data/splits', exist_ok=True)\n", "os.makedirs('models', exist_ok=True)\n", "os.makedirs('plots', exist_ok=True)\n", "\n", "print(\"✅ Libraries imported successfully!\")\n", "print(f\"📅 Date: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}\")" ] }, { "cell_type": "code", "execution_count": 27, "id": "0350590f", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "📂 Loading dataset...\n", "✅ Dataset loaded successfully!\n", "📊 Shape: (454, 51)\n", "📋 Columns: ['ma_dia_diem', 'all_task_normal', 'all_task_dinhky', 'loai_ca', 'bat_dau', 'ket_thuc', 'tong_gio_lam', 'so_ca_cua_toa', 'so_luong', 'num_tasks']... (showing first 10)\n" ] }, { "data": { "text/html": [ "
| \n", " | ma_dia_diem | \n", "all_task_normal | \n", "all_task_dinhky | \n", "loai_ca | \n", "bat_dau | \n", "ket_thuc | \n", "tong_gio_lam | \n", "so_ca_cua_toa | \n", "so_luong | \n", "num_tasks | \n", "num_cleaning_tasks | \n", "num_trash_collection_tasks | \n", "num_monitoring_tasks | \n", "num_room_cleaning_tasks | \n", "num_deep_cleaning_tasks | \n", "num_maintenance_tasks | \n", "num_support_tasks | \n", "num_other_tasks | \n", "num_wc_tasks | \n", "num_hallway_tasks | \n", "num_lobby_tasks | \n", "num_patient_room_tasks | \n", "num_clinic_room_tasks | \n", "num_surgery_room_tasks | \n", "num_outdoor_tasks | \n", "num_elevator_tasks | \n", "num_office_tasks | \n", "num_technical_room_tasks | \n", "cleaning_ratio | \n", "trash_collection_ratio | \n", "monitoring_ratio | \n", "room_cleaning_ratio | \n", "area_diversity | \n", "task_complexity_score | \n", "loai_hinh | \n", "ten_toa_thap | \n", "muc_do_luu_luong | \n", "so_tang | \n", "so_cua_thang_may | \n", "dien_tich_ngoai_canh | \n", "dien_tich_sanh | \n", "dien_tich_hanh_lang | \n", "dien_tich_wc | \n", "dien_tich_phong | \n", "dien_tich_tham | \n", "doc_ham | \n", "vien_phan_quang | \n", "op_tuong | \n", "op_chan_tuong | \n", "ranh_thoat_nuoc | \n", "dien_tich_kinh | \n", "
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | \n", "115-2 | \n", "Làm sạch toàn bộ phòng giao dịch tầng 1 (kể cả... | \n", "NaN | \n", "Part time | \n", "06:30:00 | \n", "10:30:00 | \n", "4.0 | \n", "1 | \n", "1 | \n", "7.0 | \n", "7.0 | \n", "1.0 | \n", "2.0 | \n", "0.0 | \n", "1.0 | \n", "1.0 | \n", "0.0 | \n", "0.0 | \n", "4.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "2.0 | \n", "3.0 | \n", "1.0 | \n", "0.0 | \n", "1.000 | \n", "0.1429 | \n", "0.2857 | \n", "0.0000 | \n", "4.0 | \n", "1.38 | \n", "0 | \n", "AGRIBANK CHI NHÁNH MỸ ĐÌNH | \n", "Trung bình (11–20 người) | \n", "4 | \n", "0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "15.0 | \n", "290.0 | \n", "0.0 | \n", "0 | \n", "0 | \n", "0.0 | \n", "0.0 | \n", "0 | \n", "20.0 | \n", "
| 1 | \n", "115-5 | \n", "NaN | \n", "NaN | \n", "Hành chính | \n", "06:30:00 | \n", "16:00:00 | \n", "8.0 | \n", "1 | \n", "1 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "0 | \n", "AGRIBANK PGD SỐ 05 | \n", "Trung bình (11–20 người) | \n", "1 | \n", "0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "15.0 | \n", "300.0 | \n", "0.0 | \n", "0 | \n", "0 | \n", "0.0 | \n", "0.0 | \n", "0 | \n", "30.0 | \n", "
| 2 | \n", "101-1 | \n", "Kiểm tra nhân sự các vị trí Thay rác, đẩy khô,... | \n", "Lau bảng biển, bình cứu hỏa , cây nước hành la... | \n", "Hành chính | \n", "06:30:00 | \n", "16:00:00 | \n", "7.5 | \n", "6 | \n", "24 | \n", "441.0 | \n", "258.0 | \n", "145.0 | \n", "134.0 | \n", "65.0 | \n", "75.0 | \n", "62.0 | \n", "57.0 | \n", "45.0 | \n", "89.0 | \n", "90.0 | \n", "5.0 | \n", "41.0 | \n", "25.0 | \n", "30.0 | \n", "4.0 | \n", "12.0 | \n", "39.0 | \n", "16.0 | \n", "0.585 | \n", "0.3288 | \n", "0.3039 | \n", "0.1474 | \n", "10.0 | \n", "10.00 | \n", "0 | \n", "Tòa 5 tầng | \n", "Rất cao (Trên 40 người) | \n", "10 | \n", "18 | \n", "1700.0 | \n", "0.0 | \n", "2600.0 | \n", "348.0 | \n", "6825.0 | \n", "0.0 | \n", "70 | \n", "0 | \n", "9176.0 | \n", "89.0 | \n", "25 | \n", "894.0 | \n", "
| 3 | \n", "101-1 | \n", "Kiểm tra nhân sự các vị trí Thay rác, đẩy khô,... | \n", "Lau bảng biển, bình cứu hỏa , cây nước hành la... | \n", "Ca sáng | \n", "06:00:00 | \n", "14:00:00 | \n", "8.0 | \n", "6 | \n", "3 | \n", "441.0 | \n", "258.0 | \n", "145.0 | \n", "134.0 | \n", "65.0 | \n", "75.0 | \n", "62.0 | \n", "57.0 | \n", "45.0 | \n", "89.0 | \n", "90.0 | \n", "5.0 | \n", "41.0 | \n", "25.0 | \n", "30.0 | \n", "4.0 | \n", "12.0 | \n", "39.0 | \n", "16.0 | \n", "0.585 | \n", "0.3288 | \n", "0.3039 | \n", "0.1474 | \n", "10.0 | \n", "10.00 | \n", "0 | \n", "Tòa 5 tầng | \n", "Rất cao (Trên 40 người) | \n", "10 | \n", "18 | \n", "1700.0 | \n", "0.0 | \n", "2600.0 | \n", "348.0 | \n", "6825.0 | \n", "0.0 | \n", "70 | \n", "0 | \n", "9176.0 | \n", "89.0 | \n", "25 | \n", "894.0 | \n", "
| 4 | \n", "101-1 | \n", "Kiểm tra nhân sự các vị trí Thay rác, đẩy khô,... | \n", "Lau bảng biển, bình cứu hỏa , cây nước hành la... | \n", "Ca chiều | \n", "14:00:00 | \n", "22:00:00 | \n", "8.0 | \n", "6 | \n", "5 | \n", "441.0 | \n", "258.0 | \n", "145.0 | \n", "134.0 | \n", "65.0 | \n", "75.0 | \n", "62.0 | \n", "57.0 | \n", "45.0 | \n", "89.0 | \n", "90.0 | \n", "5.0 | \n", "41.0 | \n", "25.0 | \n", "30.0 | \n", "4.0 | \n", "12.0 | \n", "39.0 | \n", "16.0 | \n", "0.585 | \n", "0.3288 | \n", "0.3039 | \n", "0.1474 | \n", "10.0 | \n", "10.00 | \n", "0 | \n", "Tòa 5 tầng | \n", "Rất cao (Trên 40 người) | \n", "10 | \n", "18 | \n", "1700.0 | \n", "0.0 | \n", "2600.0 | \n", "348.0 | \n", "6825.0 | \n", "0.0 | \n", "70 | \n", "0 | \n", "9176.0 | \n", "89.0 | \n", "25 | \n", "894.0 | \n", "