predict_caLamviec_nhansu/ML_PIPELINE_PLAN.md

# 📋 KẾ HOẠCH TIỀN XỬ LÝ DỮ LIỆU & HUẤN LUYỆN MODEL

**Ngày:** January 5, 2026
**Target Variable:** `so_luong` (Số lượng nhân sự)
**Dataset:** FINAL_DATASET_WITH_TEXT.xlsx (454 rows × 51 columns)

---

## 🎯 MỤC TIÊU

Dự đoán số lượng nhân sự cần thiết cho mỗi ca làm việc dựa trên:
- ✅ Task features (25 features từ text)
- ✅ Shift features (5 features về ca làm việc)
- ✅ Building features (17 features về tòa nhà)
- ⚪ Text columns (2 cột) - **GIỮ LẠI CHO SAU**

**Tổng features sử dụng:** 47 features (không tính 2 cột text và ma_dia_diem)

---

## 📊 PHÂN TÍCH DATASET HIỆN TẠI

### Thông tin cơ bản:
- **Tổng samples:** 454 shifts
- **Target:** `so_luong` (0-64, mean=4.64, median=4.0)
- **Features:** 47 numeric + categorical
- **Missing values:** Cần kiểm tra

### Phân bố features:

#### 1. Shift Features (5):
- `loai_ca` - Categorical (Part time, Ca sáng, Ca chiều, Hành chính, etc.)
- `bat_dau` - Time (cần parse)
- `ket_thuc` - Time (cần parse)
- `tong_gio_lam` - Numeric/Time (cần parse)
- `so_ca_cua_toa` - Numeric (1-41)

#### 2. Task Features (25):
- Counts (9): `num_tasks`, `num_cleaning_tasks`, etc.
- Areas (10): `num_wc_tasks`, `num_hallway_tasks`, etc.
- Ratios (4): `cleaning_ratio`, `trash_collection_ratio`, etc.
- Complexity (2): `area_diversity`, `task_complexity_score`

#### 3. Building Features (17):
- Categorical (3): `loai_hinh`, `ten_toa_thap`, `muc_do_luu_luong`
- Numeric (14): `so_tang`, `dien_tich_*`, binary features

---

## 🔧 PHASE 1: TIỀN XỬ LÝ DỮ LIỆU

### Step 1.1: Exploratory Data Analysis (EDA)

**Script:** `01_eda_analysis.py`

```python
Nhiệm vụ:
1. Load dataset và kiểm tra shape, dtypes
2. Phân tích missing values
3. Thống kê mô tả (describe) cho tất cả features
4. Phân tích target variable (so_luong):
   - Distribution (histogram, boxplot)
   - Outliers detection
   - Skewness và kurtosis
5. Correlation matrix với target
6. Identify zero-variance features
7. Save EDA report
```

**Output:**
- `EDA_REPORT.md` - Báo cáo chi tiết
- `eda_plots/` - Các biểu đồ phân tích

**Thời gian:** ~30 phút

---

### Step 1.2: Data Cleaning

**Script:** `02_data_cleaning.py`

```python
Nhiệm vụ:
1. Xử lý missing values:
   - Task features: Fill 0 (không có task đó)
   - Building features: Fill median hoặc mode
   - Shift features: Xác định cách xử lý

2. Xử lý outliers trong target:
   - Kiểm tra so_luong = 0 (có 16 ca)
   - Quyết định keep/remove/cap

3. Loại bỏ duplicate rows (nếu có)

4. Loại bỏ features không cần thiết:
   - ma_dia_diem (identifier, không dùng cho training)
   - all_task_normal (giữ lại cho sau)
   - all_task_dinhky (giữ lại cho sau)
   - ten_toa_thap (có thể redundant với ma_dia_diem)

5. Validate data quality sau cleaning
```

**Output:**
- `CLEANED_DATA.csv` - Dataset sau cleaning
- `CLEANING_REPORT.md` - Báo cáo chi tiết

**Thời gian:** ~20 phút

---

### Step 1.3: Feature Engineering

**Script:** `03_feature_engineering.py`

```python
Nhiệm vụ:

A. TIME FEATURES (từ bat_dau, ket_thuc, tong_gio_lam):
   - hour_start (giờ bắt đầu: 0-23)
   - hour_end (giờ kết thúc: 0-23)
   - work_hours_numeric (số giờ làm: float)
   - is_morning_shift (6-12h: 1/0)
   - is_afternoon_shift (12-18h: 1/0)
   - is_evening_shift (18-24h: 1/0)
   - is_night_shift (0-6h: 1/0)
   - is_cross_day (ca qua đêm: 1/0)

B. INTERACTION FEATURES:
   - tasks_per_hour = num_tasks / work_hours_numeric
   - tasks_per_floor = num_tasks / so_tang
   - wc_per_floor = num_wc_tasks / so_tang
   - cleaning_workload = num_cleaning_tasks * area_diversity

C. AGGREGATION FEATURES:
   - total_area = sum(dien_tich_*)
   - area_per_floor = total_area / so_tang
   - has_special_areas = (num_patient_room_tasks +
                           num_surgery_room_tasks +
                           num_clinic_room_tasks) > 0

D. CATEGORICAL ENCODING:
   - loai_ca: One-hot encoding
   - loai_hinh: Label encoding (có thứ tự)
   - muc_do_luu_luong: Label encoding (có thứ tự)

E. FEATURE SELECTION (optional):
   - Remove zero-variance features
   - Remove highly correlated features (>0.95)
```

**Output:**
- `ENGINEERED_DATA.csv` - Dataset với features mới
- `FEATURE_ENGINEERING_REPORT.md` - Báo cáo

**Thời gian:** ~40 phút

---

### Step 1.4: Feature Scaling

**Script:** `04_feature_scaling.py`

```python
Nhiệm vụ:
1. Tách features theo loại:
   - Numeric features cần scaling
   - Binary features (0/1) - không cần scaling
   - One-hot encoded features - không cần scaling

2. Apply scaling methods:
   Option A: StandardScaler (mean=0, std=1) - Recommended
   Option B: MinMaxScaler (0-1)
   Option C: RobustScaler (robust to outliers)

3. Save scaler object để dùng cho inference

4. Validate scaling: check mean, std sau khi scale
```

**Output:**
- `SCALED_DATA.csv` - Dataset đã scale
- `scaler.pkl` - Scaler object
- `SCALING_REPORT.md` - Báo cáo

**Thời gian:** ~15 phút

---

### Step 1.5: Train/Validation/Test Split

**Script:** `05_train_test_split.py`

```python
Nhiệm vụ:
1. Split strategy:
   - Train: 70% (318 samples)
   - Validation: 15% (68 samples)
   - Test: 15% (68 samples)

2. Stratified split (nếu cần):
   - Stratify by loai_ca hoặc binned so_luong

3. Random state = 42 (reproducibility)

4. Validate split:
   - Check distribution của so_luong trong mỗi set
   - Check no data leakage

5. Save splits
```

**Output:**
- `train.csv` (318 rows)
- `val.csv` (68 rows)
- `test.csv` (68 rows)
- `SPLIT_REPORT.md`

**Thời gian:** ~10 phút

---

## 🤖 PHASE 2: MODEL TRAINING

### Step 2.1: Baseline Model

**Script:** `06_baseline_model.py`

```python
Mục đích: Tạo baseline đơn giản để so sánh

Models:
1. Mean Baseline: Dự đoán = mean(so_luong) cho tất cả
2. Linear Regression: Simple linear model
3. Decision Tree (max_depth=5): Simple tree

Metrics:
- MAE (Mean Absolute Error)
- RMSE (Root Mean Squared Error)
- R² Score
- MAPE (Mean Absolute Percentage Error)

Output:
- Baseline scores
- Simple visualizations
```

**Output:**
- `BASELINE_RESULTS.md`
- `baseline_models/` - Saved models

**Thời gian:** ~15 phút

---

### Step 2.2: Advanced Models Training

**Script:** `07_train_models.py`

```python
Models to train:

1. Random Forest Regressor
   - n_estimators: [100, 200, 300]
   - max_depth: [10, 20, 30, None]
   - min_samples_split: [2, 5, 10]

2. Gradient Boosting Regressor
   - n_estimators: [100, 200]
   - learning_rate: [0.01, 0.05, 0.1]
   - max_depth: [3, 5, 7]

3. XGBoost Regressor
   - n_estimators: [100, 200]
   - learning_rate: [0.01, 0.05, 0.1]
   - max_depth: [3, 5, 7]
   - subsample: [0.8, 1.0]

4. LightGBM Regressor
   - n_estimators: [100, 200]
   - learning_rate: [0.01, 0.05, 0.1]
   - num_leaves: [31, 50, 70]

Training approach:
- Cross-validation (5-fold) trên train set
- Hyperparameter tuning với GridSearchCV/RandomizedSearchCV
- Evaluate trên validation set
- Save best models
```

**Output:**
- `trained_models/` - Saved models
- `MODEL_TRAINING_REPORT.md`
- `training_logs/` - Training logs

**Thời gian:** ~1-2 giờ (tùy hyperparameter tuning)

---

### Step 2.3: Model Evaluation

**Script:** `08_model_evaluation.py`

```python
Nhiệm vụ:

1. Load all trained models

2. Evaluate trên validation set:
   - MAE, RMSE, R², MAPE
   - Residual plots
   - Actual vs Predicted plots

3. Compare models:
   - Performance comparison table
   - Visual comparison (bar charts)

4. Analyze errors:
   - Where models fail (high errors)
   - Error distribution by loai_ca, so_tang, etc.

5. Feature importance:
   - Top 20 important features
   - SHAP values (if possible)
```

**Output:**
- `EVALUATION_REPORT.md`
- `evaluation_plots/`
- `feature_importance.csv`

**Thời gian:** ~30 phút

---

### Step 2.4: Final Model Selection & Test

**Script:** `09_final_evaluation.py`

```python
Nhiệm vụ:

1. Select best model based on validation performance

2. Re-train best model on train+val combined (optional)

3. Final evaluation on test set:
   - MAE, RMSE, R², MAPE
   - Confidence intervals
   - Error analysis

4. Create production-ready pipeline:
   - Preprocessing + Model
   - Save as single pickle file

5. Generate final report
```

**Output:**
- `FINAL_MODEL.pkl` - Production model
- `FINAL_EVALUATION_REPORT.md`
- `final_plots/`

**Thời gian:** ~20 phút

---

## 📊 PHASE 3: ANALYSIS & INSIGHTS

### Step 3.1: Feature Importance Analysis

**Script:** `10_feature_analysis.py`

```python
Nhiệm vụ:
1. Feature importance from best model
2. SHAP values analysis (detailed)
3. Partial dependence plots
4. Feature interactions
5. Recommendations for feature engineering v2
```

**Output:**
- `FEATURE_IMPORTANCE_REPORT.md`
- `shap_plots/`

**Thời gian:** ~30 phút

---

### Step 3.2: Business Insights

**Script:** `11_business_insights.py`

```python
Nhiệm vụ:
1. Phân tích theo loại ca:
   - Ca nào cần nhiều nhân sự nhất?
   - Features quan trọng cho từng loại ca

2. Phân tích theo loại tòa:
   - Loại tòa nào phức tạp nhất?
   - Correlation với số nhân sự

3. Task features impact:
   - Task nào ảnh hưởng nhiều nhất?
   - Recommendations cho task planning

4. Optimization opportunities:
   - Cách giảm số nhân sự mà vẫn hiệu quả
   - Resource allocation recommendations
```

**Output:**
- `BUSINESS_INSIGHTS_REPORT.md`
- `insights_plots/`

**Thời gian:** ~30 phút

---

## 📁 FOLDER STRUCTURE

```
Predict_calamviecHM/
├── data/
│   ├── raw/
│   │   └── FINAL_DATASET_WITH_TEXT.xlsx
│   ├── cleaned/
│   │   ├── CLEANED_DATA.csv
│   │   ├── ENGINEERED_DATA.csv
│   │   └── SCALED_DATA.csv
│   └── splits/
│       ├── train.csv
│       ├── val.csv
│       └── test.csv
│
├── models/
│   ├── baseline_models/
│   ├── trained_models/
│   │   ├── random_forest.pkl
│   │   ├── xgboost.pkl
│   │   ├── lightgbm.pkl
│   │   └── gradient_boosting.pkl
│   ├── scaler.pkl
│   └── FINAL_MODEL.pkl
│
├── scripts/
│   ├── 01_eda_analysis.py
│   ├── 02_data_cleaning.py
│   ├── 03_feature_engineering.py
│   ├── 04_feature_scaling.py
│   ├── 05_train_test_split.py
│   ├── 06_baseline_model.py
│   ├── 07_train_models.py
│   ├── 08_model_evaluation.py
│   ├── 09_final_evaluation.py
│   ├── 10_feature_analysis.py
│   └── 11_business_insights.py
│
├── reports/
│   ├── EDA_REPORT.md
│   ├── CLEANING_REPORT.md
│   ├── FEATURE_ENGINEERING_REPORT.md
│   ├── MODEL_TRAINING_REPORT.md
│   ├── EVALUATION_REPORT.md
│   ├── FINAL_EVALUATION_REPORT.md
│   ├── FEATURE_IMPORTANCE_REPORT.md
│   └── BUSINESS_INSIGHTS_REPORT.md
│
├── plots/
│   ├── eda_plots/
│   ├── evaluation_plots/
│   ├── final_plots/
│   ├── shap_plots/
│   └── insights_plots/
│
└── notebooks/
    ├── EDA_Notebook.ipynb
    └── Model_Comparison.ipynb
```

---

## 📊 EXPECTED PERFORMANCE

### Target Metrics (Realistic):

| Metric | Baseline | Target | Stretch |
|--------|----------|--------|---------|
| MAE | ~2.5 | <2.0 | <1.5 |
| RMSE | ~3.5 | <3.0 | <2.5 |
| R² | 0.30 | >0.60 | >0.75 |
| MAPE | ~50% | <30% | <20% |

**Giải thích:**
- Baseline: Simple mean/linear model
- Target: Mục tiêu hợp lý với dataset này
- Stretch: Mục tiêu lý tưởng (có thể khó đạt được)

---

## ⏱️ TIMELINE TỔNG QUAN

| Phase | Time | Status |
|-------|------|--------|
| **Phase 1: Tiền xử lý** | ~2 giờ | 📋 Planned |
| - EDA | 30 min | |
| - Cleaning | 20 min | |
| - Feature Engineering | 40 min | |
| - Scaling | 15 min | |
| - Split | 10 min | |
| **Phase 2: Training** | ~2-3 giờ | 📋 Planned |
| - Baseline | 15 min | |
| - Advanced Models | 1-2h | |
| - Evaluation | 30 min | |
| - Final Selection | 20 min | |
| **Phase 3: Analysis** | ~1 giờ | 📋 Planned |
| - Feature Analysis | 30 min | |
| - Business Insights | 30 min | |
| **TOTAL** | **~5-6 giờ** | 📋 Planned |

---

## 🎯 FEATURES SỬ DỤNG (47 features)

### ✅ Sẽ dùng ngay:

**Shift Features (5):**
1. loai_ca (encoded)
2. hour_start (engineered)
3. hour_end (engineered)
4. work_hours_numeric (engineered)
5. so_ca_cua_toa

**Task Features (25):**
6-14. Task counts (9)
15-24. Area coverage (10)
25-28. Ratios (4)
29-30. Diversity & Complexity (2)

**Building Features (17):**
31-33. Categorical (3 - encoded)
34-47. Numeric (14)

### ⚪ Giữ lại cho Phase 2 (sau này):
- `all_task_normal` - Text column (TF-IDF/BERT)
- `all_task_dinhky` - Text column (TF-IDF/BERT)

### ❌ Không dùng:
- `ma_dia_diem` - Identifier
- `bat_dau` - Raw time (đã engineer)
- `ket_thuc` - Raw time (đã engineer)
- `tong_gio_lam` - Raw time (đã engineer)

---

## 🚀 BẮT ĐẦU TỪ BƯỚC NÀO?

### Option 1: Toàn bộ pipeline tự động
```python
# Script: run_full_pipeline.py
# Chạy tất cả steps từ 1-11 tự động
```

### Option 2: Từng bước (Recommended)
```python
# Bắt đầu với EDA
python scripts/01_eda_analysis.py
# Sau đó tiếp tục 02, 03, ...
```

### Option 3: Notebook interactive
```python
# Sử dụng Jupyter Notebook
# EDA_Notebook.ipynb để khám phá và thử nghiệm
```

---

## 📝 NOTES QUAN TRỌNG

### 1. Data Quality:
- ⚠️ 27 rows không có task text (5.9%)
- ⚠️ 16 shifts có so_luong = 0
- ✅ 429/454 shifts có đầy đủ data (94.5%)

### 2. Feature Engineering:
- Time features rất quan trọng (giờ bắt đầu ảnh hưởng lớn)
- Interaction features có thể boost performance
- Task complexity score đã tính sẵn, có thể dùng trực tiếp

### 3. Model Selection:
- Tree-based models (RF, XGB, LGBM) thường tốt nhất cho tabular data
- Ensemble methods có thể kết hợp nhiều models
- Consider model interpretability cho business

### 4. Text Features (Phase 2):
- Sau khi có baseline với 47 features
- Thêm TF-IDF vectors từ text
- Compare performance improvement
- Có thể tăng R² thêm 5-10%

---

## ✅ CHECKLIST TRƯỚC KHI BẮT ĐẦU

- [ ] Dataset sẵn sàng: `FINAL_DATASET_WITH_TEXT.xlsx`
- [ ] Python environment setup (pandas, sklearn, xgboost, lightgbm)
- [ ] Tạo folder structure
- [ ] Backup data gốc
- [ ] Set random seed = 42 (reproducibility)
- [ ] Chuẩn bị notebook/IDE để code

---

## 🎯 SUCCESS CRITERIA

### Minimum Viable Model:
- ✅ MAE < 2.0
- ✅ R² > 0.60
- ✅ Model có thể explain được (feature importance)
- ✅ Reproducible (có scripts + saved models)

### Stretch Goals:
- 🎯 MAE < 1.5
- 🎯 R² > 0.75
- 🎯 SHAP analysis hoàn chỉnh
- 🎯 Business insights actionable

---

**Kế hoạch tạo bởi:** GitHub Copilot
**Ngày:** January 5, 2026
**Status:** 📋 READY TO START
**Next Step:** Bắt đầu với `01_eda_analysis.py`

---

## ❓ BẠN MUỐN BẮT ĐẦU VỚI BƯỚC NÀO?

1. **Tạo toàn bộ scripts** (11 files) - Recommended
2. **Chỉ tạo EDA script** để khám phá data trước
3. **Tạo full pipeline script** chạy một lần
4. **Tạo Jupyter Notebook** để interactive

Hãy cho tôi biết! 🚀