-

2026-01-18 02:07:32 +07:00 · 2026-01-18 02:07:32 +07:00 · bde7ccd980
parent 919d419c39
commit bde7ccd980
29 changed files with 457400 additions and 9708 deletions
--- a/COMPLETE_25_FEATURES.md
+++ b/COMPLETE_25_FEATURES.md
@ -1,304 +0,0 @@
-# ✅ HOÀN THÀNH: TRÍCH XUẤT 25 KEYWORD FEATURES
-
-**Ngày hoàn thành:** January 5, 2026  
-**Trạng thái:** ✅ COMPLETED
-
---
-
-## 📊 KẾT QUẢ
-
-### **Files đã tạo:**
-
-1. ✅ **`extract_25_features.py`** (411 dòng)
-   - Function extraction hoàn chỉnh với 25 features
-   - Script chạy batch cho 302 tòa nhà
-   - Helper functions đầy đủ
-   - Main script với thống kê chi tiết
-
-2. ✅ **`features_25_keywords.csv`** (302 rows × 26 columns)
-   - 302 tòa nhà với 25 features + mã địa điểm
-   - Ready để join với building features
-   - Encoding: UTF-8-sig (hỗ trợ tiếng Việt)
-
-3. ✅ **`analyze_task_keywords.py`** (208 dòng)
-   - Phân tích 30,917 công việc từ 302 tòa nhà
-   - Top động từ và khu vực phổ biến
-   - Thống kê chi tiết cho mỗi loại công việc
-
-4. ✅ **`RA_SOAT_PHAN_LOAI_FEATURES.md`** (báo cáo 267 dòng)
-   - Phân tích chi tiết dữ liệu thực tế
-   - So sánh phân loại cũ vs mới
-   - Lý do thay đổi từng feature
-   - Kế hoạch implementation
-
-5. ✅ **`TASK_FEATURES.ipynb`** (cập nhật)
-   - Documentation đầy đủ về 25 features
-   - Code cells để chạy extraction
-   - Visualization và analysis
-   - Tích hợp với workflow
-
---
-
-## 🎯 25 FEATURES ĐÃ IMPLEMENT
-
-### **🔹 NHÓM 1: TASK COUNTS (9 features)**
-
-| # | Feature | Trung bình | Mô tả |
-|:-:|:--------|:----------:|:------|
-| 1 | `num_tasks` | 102.87 | Tổng số công việc |
-| 2 | `num_cleaning_tasks` | 59.75 | Vệ sinh thường ngày (55.9%) |
-| 3 | `num_trash_collection_tasks` ⭐ | 20.01 | Thu gom rác (7.9%) |
-| 4 | `num_monitoring_tasks` ⭐ | 18.73 | Trực/kiểm tra (16.1%) |
-| 5 | `num_room_cleaning_tasks` ⭐ | 0.82 | Dọn phòng Y TẾ (0.4%) |
-| 6 | `num_deep_cleaning_tasks` ⭐ | 7.54 | Vệ sinh chuyên sâu (4.5%) |
-| 7 | `num_maintenance_tasks` | 0.98 | Bảo trì/sửa chữa (0.6%) |
-| 8 | `num_support_tasks` ⭐ | 6.50 | Hỗ trợ (5.8%) |
-| 9 | `num_other_tasks` | 22.15 | Công việc khác |
-
-### **🔹 NHÓM 2: AREA COVERAGE (10 features)**
-
-| # | Feature | Trung bình | Mô tả |
-|:-:|:--------|:----------:|:------|
-| 10 | `num_wc_tasks` | 20.90 | WC (20.4%) |
-| 11 | `num_hallway_tasks` | 14.69 | Hành lang (13.7%) |
-| 12 | `num_lobby_tasks` | 7.74 | Sảnh (7.6%) |
-| 13 | `num_patient_room_tasks` ⭐ | 1.55 | Phòng bệnh (1.5%) |
-| 14 | `num_clinic_room_tasks` ⭐ | 0.32 | Phòng khám (0.3%) |
-| 15 | `num_surgery_room_tasks` ⭐ | 0.36 | Phòng mổ (0.4%) |
-| 16 | `num_outdoor_tasks` | 4.37 | Ngoại cảnh (4.3%) |
-| 17 | `num_elevator_tasks` | 13.58 | Thang máy (10.6%) |
-| 18 | `num_office_tasks` ⭐ | 6.13 | Văn phòng (4.4%) |
-| 19 | `num_technical_room_tasks` ⭐ | 0.75 | Phòng kỹ thuật (0.2%) |
-
-### **🔹 NHÓM 3: RATIOS & COMPLEXITY (6 features)**
-
-| # | Feature | Trung bình | Mô tả |
-|:-:|:--------|:----------:|:------|
-| 20 | `cleaning_ratio` | 0.6148 | Tỷ lệ vệ sinh (61.48%) |
-| 21 | `trash_collection_ratio` ⭐ | 0.1951 | Tỷ lệ thu gom rác (19.51%) |
-| 22 | `monitoring_ratio` ⭐ | 0.1824 | Tỷ lệ trực/kiểm tra (18.24%) |
-| 23 | `room_cleaning_ratio` ⭐ | 0.0079 | Tỷ lệ dọn phòng (0.79%) |
-| 24 | `area_diversity` | 4.14 | Độ đa dạng (4.14/10) |
-| 25 | `task_complexity_score` ⭐ | 4.47 | Điểm phức tạp (4.47/10) |
-
-**⭐ = 10 features mới so với phân loại ban đầu**
-
---
-
-## 🏆 TOP PERFORMERS
-
-### **Top 5 tòa có nhiều công việc nhất:**
-
-| Tòa | Số công việc | Cleaning ratio | Complexity score |
-|:---:|:------------:|:--------------:|:----------------:|
-| 79-1 | 1,009 | 70.7% | 9.0/10 |
-| 288-1 | 971 | 47.6% | 10.0/10 |
-| 105-1 | 775 | 57.4% | 9.2/10 |
-| 105-2 | 775 | 57.4% | 9.2/10 |
-| 55-1 | 772 | 55.3% | 9.6/10 |
-
-### **Top 5 tòa phức tạp nhất (complexity score = 10):**
-
-| Tòa | Số công việc | Area diversity | Cleaning ratio |
-|:---:|:------------:|:--------------:|:--------------:|
-| 101-1 | 441 | 10/10 | 58.5% |
-| 114-1 | 593 | 10/10 | 56.3% |
-| 288-1 | 971 | 10/10 | 47.6% |
-| 81-1 | 159 | 10/10 | 51.6% |
-
---
-
-## 📈 THỐNG KÊ CHI TIẾT
-
-### **Phân bố công việc:**
-```
-Vệ sinh thường ngày:    55.9% (dominant)
-Trực/kiểm tra:          16.1% (significant)
-Thu gom rác:             7.9% (important)
-Vệ sinh chuyên sâu:      4.5%
-Hỗ trợ:                  5.8%
-Khác:                   ~10%
-```
-
-### **Khu vực quan trọng nhất:**
-```
-1. WC/Nhà vệ sinh:      20.4% (highest)
-2. Hành lang:           13.7%
-3. Thang máy:           10.6%
-4. Sảnh:                 7.6%
-5. Văn phòng:            4.4%
-```
-
-### **Độ phức tạp:**
- **Trung bình:** 4.47/10 (medium complexity)
- **Max:** 10.0/10 (4 tòa)
- **Min:** 0.0/10 (4 tòa thiếu data)
- **75th percentile:** 7.0/10
-
---
-
-## 💡 INSIGHTS QUAN TRỌNG
-
-### **1. Vệ sinh thường ngày là dominant (55.9%)**
-→ Mô hình dự đoán cần weight cao cho feature này
-
-### **2. "Trực phát sinh" rất phổ biến (16.1%)**
-→ Đặc trưng của bệnh viện, cần feature riêng
-→ Khác với "kiểm tra" thông thường
-
-### **3. Thu gom rác là công việc quan trọng (7.9%)**
-→ Không thể gộp chung với vệ sinh
-→ Có thể ảnh hưởng khác nhau đến số nhân sự
-
-### **4. WC chiếm 20% công việc**
-→ Feature quan trọng nhất trong nhóm area coverage
-→ Correlation cao với số nhân sự
-
-### **5. Tòa Y TẾ có đặc thù riêng**
-→ Dọn mổ, dọn đẻ, phòng bệnh
-→ Cần features riêng dù tỷ lệ thấp
-
-### **6. Complexity score đa dạng (0-10)**
-→ Giúp phân biệt tòa đơn giản vs phức tạp
-→ Kết hợp nhiều yếu tố: text length, tasks, areas, technical keywords
-
---
-
-## 🔍 VALIDATION & QUALITY CHECK
-
-### **✅ Data Quality:**
- [x] 302/302 tòa được xử lý thành công (100%)
- [x] Không có missing values trong output
- [x] Tất cả features có giá trị hợp lệ
- [x] Ratios trong khoảng [0, 1]
- [x] Area diversity trong khoảng [0, 10]
- [x] Complexity score trong khoảng [0, 10]
-
-### **✅ Logical Consistency:**
- [x] num_tasks = sum của các loại công việc (có overlap cho phép)
- [x] Ratios tính đúng công thức
- [x] Area diversity = số khu vực có tasks > 0
- [x] Complexity score tăng theo: length, tasks, technical, diversity
-
-### **✅ Sample Validation:**
-Tòa 101-1 (test case):
- ✅ num_tasks = 441 (hợp lý cho bệnh viện lớn)
- ✅ cleaning_ratio = 58.5% (phù hợp với phân bố chung)
- ✅ area_diversity = 10/10 (bệnh viện có đầy đủ khu vực)
- ✅ complexity_score = 10.0 (tòa phức tạp nhất)
-
---
-
-## 📂 FILE STRUCTURE
-
-```
-Predict_calamviecHM/
-├── ket_qua_cong_viec_full.xlsx        [INPUT] 302 tòa × 3 cột
-├── extract_25_features.py             [CODE] Main extraction script
-├── analyze_task_keywords.py           [CODE] Analysis script
-├── features_25_keywords.csv           [OUTPUT] 302 tòa × 26 cột ✅
-├── RA_SOAT_PHAN_LOAI_FEATURES.md     [DOC] Phân tích chi tiết
-├── TASK_FEATURES.ipynb                [NOTEBOOK] Interactive analysis ✅
-└── COMPLETE_25_FEATURES.md            [DOC] Báo cáo này
-```
-
---
-
-## 🎯 NEXT STEPS
-
-### **✅ COMPLETED (Phase 1):**
- [x] Phân tích 30,917 công việc từ 302 tòa
- [x] Thiết kế 25 features dựa trên dữ liệu thực tế
- [x] Implement extraction function
- [x] Test và validate
- [x] Chạy batch cho 302 tòa
- [x] Generate output CSV
- [x] Documentation đầy đủ
-
-### **⏳ TODO (Phase 2): TF-IDF Features**
- [ ] Implement TfidfVectorizer (500 features)
- [ ] Apply SVD giảm chiều (500 → 10)
- [ ] Validate 10 TF-IDF features
- [ ] Save to `features_10_tfidf.csv`
-
-### **⏳ TODO (Phase 3): Integration**
- [ ] Join keyword + TF-IDF features (35 features)
- [ ] Join với `Du_Lieu_Toa_Nha_Aggregate.xlsx` (18 features)
- [ ] Add shift features (3-4 features)
- [ ] Final dataset: ~56 features
-
-### **⏳ TODO (Phase 4): Modeling**
- [ ] Obtain target variable (`so_nhan_su`)
- [ ] Expand to per-shift samples (×3)
- [ ] Train/validation/test split
- [ ] Train models (RF, GB, XGB, LGBM)
- [ ] Hyperparameter tuning
- [ ] Model evaluation & comparison
-
---
-
-## 📊 EXPECTED FINAL DATASET
-
-```
-FEATURES BREAKDOWN:
-├── 25 Keyword features (from tasks text)      ✅ DONE
-├── 10 TF-IDF features (from tasks text)       ⏳ TODO
-├── 18 Building features (from aggregate)      ⏳ TODO
-└── 3-4 Shift features (time-based)            ⏳ TODO
-    ─────────────────────────────────────────
-    TOTAL: ~56 features
-
-SAMPLES:
-├── 302 tòa nhà
-├── × 3 ca/ngày (sáng, chiều, tối)
-    ─────────────────────────────────────────
-    TOTAL: ~900 samples
-
-TARGET:
-└── so_nhan_su (số nhân sự mỗi ca)            ⚠️ CẦN THU THẬP
-```
-
---
-
-## 🎉 ACHIEVEMENTS
-
-✅ **Phân tích dữ liệu thực tế** - Không dựa vào giả định  
-✅ **Thiết kế features phù hợp** - Dựa trên tần suất xuất hiện  
-✅ **Tách rõ đặc trưng Y TẾ** - Phòng mổ, dọn đẻ, phòng bệnh  
-✅ **Thêm complexity score** - Đo độ phức tạp đa chiều  
-✅ **Code quality cao** - Type hints, docstrings, error handling  
-✅ **Documentation đầy đủ** - Markdown reports, notebook cells  
-✅ **Reproducible** - Script có thể chạy lại bất cứ lúc nào  
-
---
-
-## 📞 SUMMARY FOR USER
-
-**🎯 Đã hoàn thành:**
- ✅ Phân loại lại features dựa trên phân tích 30,917 công việc thực tế
- ✅ Tăng từ 15 → 25 features (thêm 10 features quan trọng)
- ✅ Loại bỏ features ít xuất hiện, thêm features phổ biến (thu gom rác, trực phát sinh)
- ✅ Tách rõ khu vực Y TẾ (phòng bệnh, phòng mổ, phòng khám)
- ✅ Thêm task_complexity_score để đo độ phức tạp
- ✅ Implement và test thành công trên 302 tòa
- ✅ Tạo file output `features_25_keywords.csv` (302 × 26)
-
-**🎯 File quan trọng:**
-1. `extract_25_features.py` - Script extraction
-2. `features_25_keywords.csv` - Output data
-3. `RA_SOAT_PHAN_LOAI_FEATURES.md` - Báo cáo phân tích
-4. `TASK_FEATURES.ipynb` - Notebook interactive
-
-**🎯 Bước tiếp theo:**
-Bạn có thể:
- **A)** Tiếp tục với TF-IDF features (10 features)
- **B)** Join với building features ngay
- **C)** Phân tích correlation giữa 25 features
- **D)** Visualize distribution của features
-
-Bạn muốn làm gì tiếp theo? 😊
-
---
-
-_Hoàn thành: January 5, 2026_  
-_Status: ✅ PHASE 1 COMPLETED - READY FOR PHASE 2_
--- a/FINAL_DATASET_WITH_TEXT_BACKUP_20260105_213507.xlsx
+++ b/FINAL_DATASET_WITH_TEXT_BACKUP_20260105_213507.xlsx
--- a/ML_PIPELINE_PLAN.md
+++ b/ML_PIPELINE_PLAN.md
@ -1,647 +0,0 @@
-# 📋 KẾ HOẠCH TIỀN XỬ LÝ DỮ LIỆU & HUẤN LUYỆN MODEL
-
-**Ngày:** January 5, 2026  
-**Target Variable:** `so_luong` (Số lượng nhân sự)  
-**Dataset:** FINAL_DATASET_WITH_TEXT.xlsx (454 rows × 51 columns)
-
---
-
-## 🎯 MỤC TIÊU
-
-Dự đoán số lượng nhân sự cần thiết cho mỗi ca làm việc dựa trên:
- ✅ Task features (25 features từ text)
- ✅ Shift features (5 features về ca làm việc)
- ✅ Building features (17 features về tòa nhà)
- ⚪ Text columns (2 cột) - **GIỮ LẠI CHO SAU**
-
-**Tổng features sử dụng:** 47 features (không tính 2 cột text và ma_dia_diem)
-
---
-
-## 📊 PHÂN TÍCH DATASET HIỆN TẠI
-
-### Thông tin cơ bản:
- **Tổng samples:** 454 shifts
- **Target:** `so_luong` (0-64, mean=4.64, median=4.0)
- **Features:** 47 numeric + categorical
- **Missing values:** Cần kiểm tra
-
-### Phân bố features:
-
-#### 1. Shift Features (5):
- `loai_ca` - Categorical (Part time, Ca sáng, Ca chiều, Hành chính, etc.)
- `bat_dau` - Time (cần parse)
- `ket_thuc` - Time (cần parse)
- `tong_gio_lam` - Numeric/Time (cần parse)
- `so_ca_cua_toa` - Numeric (1-41)
-
-#### 2. Task Features (25):
- Counts (9): `num_tasks`, `num_cleaning_tasks`, etc.
- Areas (10): `num_wc_tasks`, `num_hallway_tasks`, etc.
- Ratios (4): `cleaning_ratio`, `trash_collection_ratio`, etc.
- Complexity (2): `area_diversity`, `task_complexity_score`
-
-#### 3. Building Features (17):
- Categorical (3): `loai_hinh`, `ten_toa_thap`, `muc_do_luu_luong`
- Numeric (14): `so_tang`, `dien_tich_*`, binary features
-
---
-
-## 🔧 PHASE 1: TIỀN XỬ LÝ DỮ LIỆU
-
-### Step 1.1: Exploratory Data Analysis (EDA)
-
-**Script:** `01_eda_analysis.py`
-
-```python
-Nhiệm vụ:
-1. Load dataset và kiểm tra shape, dtypes
-2. Phân tích missing values
-3. Thống kê mô tả (describe) cho tất cả features
-4. Phân tích target variable (so_luong):
-   - Distribution (histogram, boxplot)
-   - Outliers detection
-   - Skewness và kurtosis
-5. Correlation matrix với target
-6. Identify zero-variance features
-7. Save EDA report
-```
-
-**Output:**
- `EDA_REPORT.md` - Báo cáo chi tiết
- `eda_plots/` - Các biểu đồ phân tích
-
-**Thời gian:** ~30 phút
-
---
-
-### Step 1.2: Data Cleaning
-
-**Script:** `02_data_cleaning.py`
-
-```python
-Nhiệm vụ:
-1. Xử lý missing values:
-   - Task features: Fill 0 (không có task đó)
-   - Building features: Fill median hoặc mode
-   - Shift features: Xác định cách xử lý
-   
-2. Xử lý outliers trong target:
-   - Kiểm tra so_luong = 0 (có 16 ca)
-   - Quyết định keep/remove/cap
-   
-3. Loại bỏ duplicate rows (nếu có)
-
-4. Loại bỏ features không cần thiết:
-   - ma_dia_diem (identifier, không dùng cho training)
-   - all_task_normal (giữ lại cho sau)
-   - all_task_dinhky (giữ lại cho sau)
-   - ten_toa_thap (có thể redundant với ma_dia_diem)
-   
-5. Validate data quality sau cleaning
-```
-
-**Output:**
- `CLEANED_DATA.csv` - Dataset sau cleaning
- `CLEANING_REPORT.md` - Báo cáo chi tiết
-
-**Thời gian:** ~20 phút
-
---
-
-### Step 1.3: Feature Engineering
-
-**Script:** `03_feature_engineering.py`
-
-```python
-Nhiệm vụ:
-
-A. TIME FEATURES (từ bat_dau, ket_thuc, tong_gio_lam):
-   - hour_start (giờ bắt đầu: 0-23)
-   - hour_end (giờ kết thúc: 0-23)
-   - work_hours_numeric (số giờ làm: float)
-   - is_morning_shift (6-12h: 1/0)
-   - is_afternoon_shift (12-18h: 1/0)
-   - is_evening_shift (18-24h: 1/0)
-   - is_night_shift (0-6h: 1/0)
-   - is_cross_day (ca qua đêm: 1/0)
-
-B. INTERACTION FEATURES:
-   - tasks_per_hour = num_tasks / work_hours_numeric
-   - tasks_per_floor = num_tasks / so_tang
-   - wc_per_floor = num_wc_tasks / so_tang
-   - cleaning_workload = num_cleaning_tasks * area_diversity
-   
-C. AGGREGATION FEATURES:
-   - total_area = sum(dien_tich_*)
-   - area_per_floor = total_area / so_tang
-   - has_special_areas = (num_patient_room_tasks + 
-                           num_surgery_room_tasks + 
-                           num_clinic_room_tasks) > 0
-
-D. CATEGORICAL ENCODING:
-   - loai_ca: One-hot encoding
-   - loai_hinh: Label encoding (có thứ tự)
-   - muc_do_luu_luong: Label encoding (có thứ tự)
-   
-E. FEATURE SELECTION (optional):
-   - Remove zero-variance features
-   - Remove highly correlated features (>0.95)
-```
-
-**Output:**
- `ENGINEERED_DATA.csv` - Dataset với features mới
- `FEATURE_ENGINEERING_REPORT.md` - Báo cáo
-
-**Thời gian:** ~40 phút
-
---
-
-### Step 1.4: Feature Scaling
-
-**Script:** `04_feature_scaling.py`
-
-```python
-Nhiệm vụ:
-1. Tách features theo loại:
-   - Numeric features cần scaling
-   - Binary features (0/1) - không cần scaling
-   - One-hot encoded features - không cần scaling
-
-2. Apply scaling methods:
-   Option A: StandardScaler (mean=0, std=1) - Recommended
-   Option B: MinMaxScaler (0-1)
-   Option C: RobustScaler (robust to outliers)
-
-3. Save scaler object để dùng cho inference
-
-4. Validate scaling: check mean, std sau khi scale
-```
-
-**Output:**
- `SCALED_DATA.csv` - Dataset đã scale
- `scaler.pkl` - Scaler object
- `SCALING_REPORT.md` - Báo cáo
-
-**Thời gian:** ~15 phút
-
---
-
-### Step 1.5: Train/Validation/Test Split
-
-**Script:** `05_train_test_split.py`
-
-```python
-Nhiệm vụ:
-1. Split strategy:
-   - Train: 70% (318 samples)
-   - Validation: 15% (68 samples)
-   - Test: 15% (68 samples)
-
-2. Stratified split (nếu cần):
-   - Stratify by loai_ca hoặc binned so_luong
-   
-3. Random state = 42 (reproducibility)
-
-4. Validate split:
-   - Check distribution của so_luong trong mỗi set
-   - Check no data leakage
-
-5. Save splits
-```
-
-**Output:**
- `train.csv` (318 rows)
- `val.csv` (68 rows)
- `test.csv` (68 rows)
- `SPLIT_REPORT.md`
-
-**Thời gian:** ~10 phút
-
---
-
-## 🤖 PHASE 2: MODEL TRAINING
-
-### Step 2.1: Baseline Model
-
-**Script:** `06_baseline_model.py`
-
-```python
-Mục đích: Tạo baseline đơn giản để so sánh
-
-Models:
-1. Mean Baseline: Dự đoán = mean(so_luong) cho tất cả
-2. Linear Regression: Simple linear model
-3. Decision Tree (max_depth=5): Simple tree
-
-Metrics:
- MAE (Mean Absolute Error)
- RMSE (Root Mean Squared Error)
- R² Score
- MAPE (Mean Absolute Percentage Error)
-
-Output:
- Baseline scores
- Simple visualizations
-```
-
-**Output:**
- `BASELINE_RESULTS.md`
- `baseline_models/` - Saved models
-
-**Thời gian:** ~15 phút
-
---
-
-### Step 2.2: Advanced Models Training
-
-**Script:** `07_train_models.py`
-
-```python
-Models to train:
-
-1. Random Forest Regressor
-   - n_estimators: [100, 200, 300]
-   - max_depth: [10, 20, 30, None]
-   - min_samples_split: [2, 5, 10]
-   
-2. Gradient Boosting Regressor
-   - n_estimators: [100, 200]
-   - learning_rate: [0.01, 0.05, 0.1]
-   - max_depth: [3, 5, 7]
-   
-3. XGBoost Regressor
-   - n_estimators: [100, 200]
-   - learning_rate: [0.01, 0.05, 0.1]
-   - max_depth: [3, 5, 7]
-   - subsample: [0.8, 1.0]
-   
-4. LightGBM Regressor
-   - n_estimators: [100, 200]
-   - learning_rate: [0.01, 0.05, 0.1]
-   - num_leaves: [31, 50, 70]
-
-Training approach:
- Cross-validation (5-fold) trên train set
- Hyperparameter tuning với GridSearchCV/RandomizedSearchCV
- Evaluate trên validation set
- Save best models
-```
-
-**Output:**
- `trained_models/` - Saved models
- `MODEL_TRAINING_REPORT.md`
- `training_logs/` - Training logs
-
-**Thời gian:** ~1-2 giờ (tùy hyperparameter tuning)
-
---
-
-### Step 2.3: Model Evaluation
-
-**Script:** `08_model_evaluation.py`
-
-```python
-Nhiệm vụ:
-
-1. Load all trained models
-
-2. Evaluate trên validation set:
-   - MAE, RMSE, R², MAPE
-   - Residual plots
-   - Actual vs Predicted plots
-   
-3. Compare models:
-   - Performance comparison table
-   - Visual comparison (bar charts)
-   
-4. Analyze errors:
-   - Where models fail (high errors)
-   - Error distribution by loai_ca, so_tang, etc.
-   
-5. Feature importance:
-   - Top 20 important features
-   - SHAP values (if possible)
-```
-
-**Output:**
- `EVALUATION_REPORT.md`
- `evaluation_plots/`
- `feature_importance.csv`
-
-**Thời gian:** ~30 phút
-
---
-
-### Step 2.4: Final Model Selection & Test
-
-**Script:** `09_final_evaluation.py`
-
-```python
-Nhiệm vụ:
-
-1. Select best model based on validation performance
-
-2. Re-train best model on train+val combined (optional)
-
-3. Final evaluation on test set:
-   - MAE, RMSE, R², MAPE
-   - Confidence intervals
-   - Error analysis
-   
-4. Create production-ready pipeline:
-   - Preprocessing + Model
-   - Save as single pickle file
-   
-5. Generate final report
-```
-
-**Output:**
- `FINAL_MODEL.pkl` - Production model
- `FINAL_EVALUATION_REPORT.md`
- `final_plots/`
-
-**Thời gian:** ~20 phút
-
---
-
-## 📊 PHASE 3: ANALYSIS & INSIGHTS
-
-### Step 3.1: Feature Importance Analysis
-
-**Script:** `10_feature_analysis.py`
-
-```python
-Nhiệm vụ:
-1. Feature importance from best model
-2. SHAP values analysis (detailed)
-3. Partial dependence plots
-4. Feature interactions
-5. Recommendations for feature engineering v2
-```
-
-**Output:**
- `FEATURE_IMPORTANCE_REPORT.md`
- `shap_plots/`
-
-**Thời gian:** ~30 phút
-
---
-
-### Step 3.2: Business Insights
-
-**Script:** `11_business_insights.py`
-
-```python
-Nhiệm vụ:
-1. Phân tích theo loại ca:
-   - Ca nào cần nhiều nhân sự nhất?
-   - Features quan trọng cho từng loại ca
-   
-2. Phân tích theo loại tòa:
-   - Loại tòa nào phức tạp nhất?
-   - Correlation với số nhân sự
-   
-3. Task features impact:
-   - Task nào ảnh hưởng nhiều nhất?
-   - Recommendations cho task planning
-   
-4. Optimization opportunities:
-   - Cách giảm số nhân sự mà vẫn hiệu quả
-   - Resource allocation recommendations
-```
-
-**Output:**
- `BUSINESS_INSIGHTS_REPORT.md`
- `insights_plots/`
-
-**Thời gian:** ~30 phút
-
---
-
-## 📁 FOLDER STRUCTURE
-
-```
-Predict_calamviecHM/
-├── data/
-│   ├── raw/
-│   │   └── FINAL_DATASET_WITH_TEXT.xlsx
-│   ├── cleaned/
-│   │   ├── CLEANED_DATA.csv
-│   │   ├── ENGINEERED_DATA.csv
-│   │   └── SCALED_DATA.csv
-│   └── splits/
-│       ├── train.csv
-│       ├── val.csv
-│       └── test.csv
-│
-├── models/
-│   ├── baseline_models/
-│   ├── trained_models/
-│   │   ├── random_forest.pkl
-│   │   ├── xgboost.pkl
-│   │   ├── lightgbm.pkl
-│   │   └── gradient_boosting.pkl
-│   ├── scaler.pkl
-│   └── FINAL_MODEL.pkl
-│
-├── scripts/
-│   ├── 01_eda_analysis.py
-│   ├── 02_data_cleaning.py
-│   ├── 03_feature_engineering.py
-│   ├── 04_feature_scaling.py
-│   ├── 05_train_test_split.py
-│   ├── 06_baseline_model.py
-│   ├── 07_train_models.py
-│   ├── 08_model_evaluation.py
-│   ├── 09_final_evaluation.py
-│   ├── 10_feature_analysis.py
-│   └── 11_business_insights.py
-│
-├── reports/
-│   ├── EDA_REPORT.md
-│   ├── CLEANING_REPORT.md
-│   ├── FEATURE_ENGINEERING_REPORT.md
-│   ├── MODEL_TRAINING_REPORT.md
-│   ├── EVALUATION_REPORT.md
-│   ├── FINAL_EVALUATION_REPORT.md
-│   ├── FEATURE_IMPORTANCE_REPORT.md
-│   └── BUSINESS_INSIGHTS_REPORT.md
-│
-├── plots/
-│   ├── eda_plots/
-│   ├── evaluation_plots/
-│   ├── final_plots/
-│   ├── shap_plots/
-│   └── insights_plots/
-│
-└── notebooks/
-    ├── EDA_Notebook.ipynb
-    └── Model_Comparison.ipynb
-```
-
---
-
-## 📊 EXPECTED PERFORMANCE
-
-### Target Metrics (Realistic):
-
-| Metric | Baseline | Target | Stretch |
-|--------|----------|--------|---------|
-| MAE | ~2.5 | <2.0 | <1.5 |
-| RMSE | ~3.5 | <3.0 | <2.5 |
-| R² | 0.30 | >0.60 | >0.75 |
-| MAPE | ~50% | <30% | <20% |
-
-**Giải thích:**
- Baseline: Simple mean/linear model
- Target: Mục tiêu hợp lý với dataset này
- Stretch: Mục tiêu lý tưởng (có thể khó đạt được)
-
---
-
-## ⏱️ TIMELINE TỔNG QUAN
-
-| Phase | Time | Status |
-|-------|------|--------|
-| **Phase 1: Tiền xử lý** | ~2 giờ | 📋 Planned |
-| - EDA | 30 min | |
-| - Cleaning | 20 min | |
-| - Feature Engineering | 40 min | |
-| - Scaling | 15 min | |
-| - Split | 10 min | |
-| **Phase 2: Training** | ~2-3 giờ | 📋 Planned |
-| - Baseline | 15 min | |
-| - Advanced Models | 1-2h | |
-| - Evaluation | 30 min | |
-| - Final Selection | 20 min | |
-| **Phase 3: Analysis** | ~1 giờ | 📋 Planned |
-| - Feature Analysis | 30 min | |
-| - Business Insights | 30 min | |
-| **TOTAL** | **~5-6 giờ** | 📋 Planned |
-
---
-
-## 🎯 FEATURES SỬ DỤNG (47 features)
-
-### ✅ Sẽ dùng ngay:
-
-**Shift Features (5):**
-1. loai_ca (encoded)
-2. hour_start (engineered)
-3. hour_end (engineered)
-4. work_hours_numeric (engineered)
-5. so_ca_cua_toa
-
-**Task Features (25):**
-6-14. Task counts (9)
-15-24. Area coverage (10)
-25-28. Ratios (4)
-29-30. Diversity & Complexity (2)
-
-**Building Features (17):**
-31-33. Categorical (3 - encoded)
-34-47. Numeric (14)
-
-### ⚪ Giữ lại cho Phase 2 (sau này):
- `all_task_normal` - Text column (TF-IDF/BERT)
- `all_task_dinhky` - Text column (TF-IDF/BERT)
-
-### ❌ Không dùng:
- `ma_dia_diem` - Identifier
- `bat_dau` - Raw time (đã engineer)
- `ket_thuc` - Raw time (đã engineer)
- `tong_gio_lam` - Raw time (đã engineer)
-
---
-
-## 🚀 BẮT ĐẦU TỪ BƯỚC NÀO?
-
-### Option 1: Toàn bộ pipeline tự động
-```python
-# Script: run_full_pipeline.py
-# Chạy tất cả steps từ 1-11 tự động
-```
-
-### Option 2: Từng bước (Recommended)
-```python
-# Bắt đầu với EDA
-python scripts/01_eda_analysis.py
-# Sau đó tiếp tục 02, 03, ...
-```
-
-### Option 3: Notebook interactive
-```python
-# Sử dụng Jupyter Notebook
-# EDA_Notebook.ipynb để khám phá và thử nghiệm
-```
-
---
-
-## 📝 NOTES QUAN TRỌNG
-
-### 1. Data Quality:
- ⚠️ 27 rows không có task text (5.9%)
- ⚠️ 16 shifts có so_luong = 0
- ✅ 429/454 shifts có đầy đủ data (94.5%)
-
-### 2. Feature Engineering:
- Time features rất quan trọng (giờ bắt đầu ảnh hưởng lớn)
- Interaction features có thể boost performance
- Task complexity score đã tính sẵn, có thể dùng trực tiếp
-
-### 3. Model Selection:
- Tree-based models (RF, XGB, LGBM) thường tốt nhất cho tabular data
- Ensemble methods có thể kết hợp nhiều models
- Consider model interpretability cho business
-
-### 4. Text Features (Phase 2):
- Sau khi có baseline với 47 features
- Thêm TF-IDF vectors từ text
- Compare performance improvement
- Có thể tăng R² thêm 5-10%
-
---
-
-## ✅ CHECKLIST TRƯỚC KHI BẮT ĐẦU
-
- [ ] Dataset sẵn sàng: `FINAL_DATASET_WITH_TEXT.xlsx`
- [ ] Python environment setup (pandas, sklearn, xgboost, lightgbm)
- [ ] Tạo folder structure
- [ ] Backup data gốc
- [ ] Set random seed = 42 (reproducibility)
- [ ] Chuẩn bị notebook/IDE để code
-
---
-
-## 🎯 SUCCESS CRITERIA
-
-### Minimum Viable Model:
- ✅ MAE < 2.0
- ✅ R² > 0.60
- ✅ Model có thể explain được (feature importance)
- ✅ Reproducible (có scripts + saved models)
-
-### Stretch Goals:
- 🎯 MAE < 1.5
- 🎯 R² > 0.75
- 🎯 SHAP analysis hoàn chỉnh
- 🎯 Business insights actionable
-
---
-
-**Kế hoạch tạo bởi:** GitHub Copilot  
-**Ngày:** January 5, 2026  
-**Status:** 📋 READY TO START  
-**Next Step:** Bắt đầu với `01_eda_analysis.py`
-
---
-
-## ❓ BẠN MUỐN BẮT ĐẦU VỚI BƯỚC NÀO?
-
-1. **Tạo toàn bộ scripts** (11 files) - Recommended
-2. **Chỉ tạo EDA script** để khám phá data trước
-3. **Tạo full pipeline script** chạy một lần
-4. **Tạo Jupyter Notebook** để interactive
-
-Hãy cho tôi biết! 🚀
--- a/ML_Phase2_Text_Features.ipynb
+++ b/ML_Phase2_Text_Features.ipynb
--- a/ML_Training_Pipeline.ipynb
+++ b/ML_Training_Pipeline.ipynb
--- a/ML_Training_Pipeline_Complete.ipynb
+++ b/ML_Training_Pipeline_Complete.ipynb
--- a/README.md
+++ b/README.md
@ -0,0 +1,491 @@
+# 🏢 Dự Đoán Nhân Sự Ca Làm Việc
+
+> **Hệ thống Machine Learning dự đoán số lượng nhân sự cần thiết cho mỗi ca làm việc dựa trên đặc điểm tòa nhà và công việc**
+
+---
+
+## 📋 Mục Lục
+
+- [Tổng Quan](#-tổng-quan)
+- [Cấu Trúc Dữ Liệu](#-cấu-trúc-dữ-liệu)
+- [Các File Chính](#-các-file-chính)
+- [Pipeline Dự Đoán](#-pipeline-dự-đoán)
+- [Hướng Dẫn Sử Dụng](#-hướng-dẫn-sử-dụng)
+
+---
+
+## 🎯 Tổng Quan
+
+Dự án này sử dụng Machine Learning để tự động dự đoán số lượng nhân sự cần thiết cho mỗi ca làm việc tại các tòa nhà, dựa trên:
+- Đặc điểm vật lý của tòa nhà (diện tích, số tầng, số WC, v.v.)
+- Loại công việc và nội dung công việc (text features)
+- Thông tin ca làm việc (ca sáng, chiều, tối, đêm)
+
+---
+
+## 📊 Cấu Trúc Dữ Liệu
+
+### 🗂️ Dữ Liệu Gốc (`data_raw/`)
+
+#### 1. **`Link LLV 1_5_2025.json`**
+📌 **Mục đích:** Chứa thông tin về các ca làm việc
+
+**Nội dung:**
+- Danh sách các ca làm việc của từng tòa nhà
+- Thời gian làm việc (ca sáng, chiều, tối, đêm)
+- Công việc thường (`all_task_normal`)
+- Công việc định kỳ (`all_task_dinhky`)
+- Số nhân sự thực tế cho mỗi ca
+
+**Vai trò:** Cung cấp dữ liệu về **phân ca** và **nội dung công việc** cho model
+
+---
+
+#### 2. **`Link LLV 2025_clean.json`**
+📌 **Mục đích:** Chứa thông tin về đặc điểm vật lý của các tòa nhà
+
+**Nội dung:**
+- Diện tích từng tầng của mỗi tòa nhà
+- Số lượng WC, phòng, v.v.
+- Thông tin cơ sở hạ tầng
+
+**⚠️ LƯU Ý QUAN TRỌNG:**
+> **Các trường liên quan đến diện tích trong `final_2.csv` là TỔNG diện tích của TẤT CẢ các tầng trong mỗi tòa nhà**, không phải diện tích trung bình hay diện tích một tầng!
+> 
+> Ví dụ: Nếu tòa nhà có 10 tầng, mỗi tầng 500m², thì `tong_dien_tich = 5000m²`
+
+---
+
+### 📈 Dữ Liệu Training
+
+#### **`final_2.csv`** (hoặc `final_2.xlsx`)
+📌 **Mục đích:** Dataset chính để huấn luyện model
+
+**Nguồn dữ liệu:**
+- **Thông tin ca làm việc** ← từ `data_raw/Link LLV 1_5_2025.json`
+- **Thông tin diện tích tòa nhà** ← từ `data_raw/Link LLV 2025_clean.json` (TỔNG diện tích các tầng)
+
+**Các nhóm features chính:**
+
+| Nhóm Features | Nguồn | Ví dụ |
+|---------------|-------|-------|
+| **Thông tin ca** | Link LLV | `ca_sang`, `ca_chieu`, `ca_toi`, `ca_dem` |
+| **Diện tích** | Link LLV 2025 | `tong_dien_tich`, `dien_tich_wc`, `dien_tich_hanh_lang` |
+| **Cơ sở vật chất** | Link LLV 2025 | `so_tang`, `so_wc`, `so_phong` |
+| **Text features** | Sinh ra từ tasks | `num_cleaning_tasks`, `cleaning_ratio`, `area_diversity` |
+| **Target** | Link LLV | `so_nhan_su` (số nhân sự thực tế) |
+
+---
+
+## 📁 Các File Chính
+
+### 1. 📓 **`All_feature_Readme.ipynb`**
+> **Notebook giải thích chi tiết về Text Features**
+
+**Nội dung:**
+- 📖 Giải thích từng loại text feature được trích xuất từ công việc
+- 🔍 Các keyword được sử dụng để nhận diện công việc
+- 📊 Ví dụ minh họa cho từng feature
+- 📈 Phân tích tầm quan trọng của các features
+
+**Khi nào đọc:** Khi bạn muốn hiểu cách hệ thống trích xuất thông tin từ đoạn text công việc
+
+---
+
+### 2. 🤖 **`artifacts/`**
+> **Thư mục chứa các model đã được huấn luyện**
+
+**Các file trong thư mục:**
+
+```
+artifacts/
+├── extratrees_log1p.joblib           # Model chính (Extra Trees với log transform)
+├── extratrees_staff_model.joblib     # Model dự đoán nhân sự
+├── model_meta.joblib                 # Metadata của model (params, metrics)
+└── X_proc_columns.joblib             # Danh sách features và thứ tự columns
+```
+
+**Định dạng:** Joblib (compressed pickle format)
+
+**Cách load:**
+```python
+import joblib
+model = joblib.load('artifacts/extratrees_log1p.joblib')
+```
+
+---
+
+### 3. 📄 **`input.json`**
+> **File input mẫu để dự đoán cho MỘT ca làm việc của MỘT tòa nhà**
+
+**Cấu trúc:**
+```json
+{
+  "ma_dia_diem": "TD-001",
+  "loai_ca": "Hành chính",
+  "bat_dau": "07:00:00",
+  "ket_thuc": "15:00:00",
+  "tong_gio_lam": 8.0,
+  "so_ca_cua_toa": 3,
+  "all_task_normal": "Lau sàn hành lang; Thu gom rác WC; Vệ sinh sảnh chính; Lau kính thang máy",
+  "all_task_dinhky": "Cọ bồn cầu WC tầng 2; Đánh sàn lobby; Trực phát sinh",
+  "so_tang": 12,
+  "so_cua_thang_may": 4,
+  "dien_tich_ngoai_canh": 350.0,
+  "dien_tich_sanh": 220.0,
+  "dien_tich_hanh_lang": 1800.0,
+  "dien_tich_wc": 420.0,
+  "dien_tich_phong": 2600.0,
+  "dien_tich_tham": 800.0,
+  "dien_tich_kinh": 560.0,
+  "doc_ham": 1,
+  "vien_phan_quang": 0,
+  "op_tuong": 1,
+  "op_chan_tuong": 1,
+  "ranh_thoat_nuoc": 1
+}
+```
+
+#### 📝 Giải Thích Chi Tiết Các Trường:
+
+**🏢 Thông tin tòa nhà & ca:**
+- `ma_dia_diem`: Mã định danh tòa nhà
+- `loai_ca`: Loại ca làm việc (Hành chính / Ca sáng / Ca chiều / Ca tối / Ca đêm)
+- `bat_dau`, `ket_thuc`: Thời gian bắt đầu và kết thúc ca (HH:MM:SS)
+- `tong_gio_lam`: Tổng số giờ làm việc trong ca
+- `so_ca_cua_toa`: Tổng số ca làm việc trong ngày của tòa nhà
+
+**📋 Công việc - 2 loại quan trọng:**
+
+> **⭐ `all_task_normal` - CÔNG VIỆC HẰNG NGÀY:**
+> - Các task được thực hiện **MỖI NGÀY** trong ca này
+> - Ví dụ: Lau sàn, Thu rác, Vệ sinh WC (mỗi ngày đều phải làm)
+> - **Tần suất:** Ngày nào cũng làm, ca nào cũng làm
+> - **Format:** Các task cách nhau bởi dấu chấm phẩy (`;`)
+
+> **⭐ `all_task_dinhky` - CÔNG VIỆC ĐỊNH KỲ (TUẦN/THÁNG):**
+> - Các task được thực hiện **THEO TUẦN** hoặc **THEO THÁNG**
+> - Ví dụ: Cọ bồn cầu (1 tuần/lần), Đánh sàn (1 tháng/lần), Trực phát sinh
+> - **Tần suất:** Không làm hàng ngày, chỉ làm định kỳ
+> - **Format:** Các task cách nhau bởi dấu chấm phẩy (`;`)
+
+**Sự khác biệt:**
+| | `all_task_normal` | `all_task_dinhky` |
+|---|---|---|
+| **Tần suất** | Hằng ngày | Theo tuần/tháng |
+| **Ví dụ** | Lau sàn, Thu rác | Cọ rửa, Đánh sàn |
+| **Khối lượng** | Công việc nhẹ, lặp lại | Công việc nặng, ít lần |
+
+**🏗️ Đặc điểm vật lý:**
+- `so_tang`: Tổng số tầng
+- `so_cua_thang_may`: Số cửa thang máy
+
+**📏 Diện tích (m²) - ⚠️ TỔNG TẤT CẢ CÁC TẦNG:**
+- `dien_tich_ngoai_canh`: Khu vực ngoại cảnh (sân, vỉa hè)
+- `dien_tich_sanh`: Sảnh, lobby
+- `dien_tich_hanh_lang`: Hành lang, lối đi
+- `dien_tich_wc`: WC, toilet
+- `dien_tich_phong`: Các phòng (văn phòng, phòng họp)
+- `dien_tich_tham`: Sàn có thảm
+- `dien_tich_kinh`: Kính cần lau
+
+> **⚠️ CHÚ Ý:** Tất cả diện tích là TỔNG của TẤT CẢ các tầng!
+> 
+> Ví dụ: Tòa 12 tầng, mỗi tầng 150m² hành lang → `dien_tich_hanh_lang = 1800m²`
+
+**🔧 Đặc điểm bề mặt (1=Có, 0=Không):**
+- `doc_ham`: Có tầng hầm
+- `vien_phan_quang`: Có viền phản quang
+- `op_tuong`: Có ốp tường
+- `op_chan_tuong`: Có ốp chân tường
+- `ranh_thoat_nuoc`: Có rãnh thoát nước
+
+---
+
+### 4. 🎯 **`all_predict.py`**
+> **Script chính để chạy dự đoán**
+
+**Chức năng:**
+1. 📖 Đọc file `input.json`
+2. 🔄 Trích xuất text features từ công việc
+3. 🏗️ Xây dựng feature vector đầy đủ
+4. 🤖 Load model từ `artifacts/`
+5. 📊 Dự đoán số nhân sú cần thiết
+6. 💾 Xuất kết quả
+
+**Cách chạy:**
+```bash
+python all_predict.py
+```
+
+**Output:** Số nhân sự dự đoán cho ca làm việc trong `input.json`
+
+---
+
+### 5. 🔧 **`predict.py`**
+> **Module chứa hàm trích xuất text features từ công việc**
+
+**Hàm chính:**
+```python
+extract_text_features_to_json(task_normal: str, task_dinhky: str) -> str
+```
+
+**Input:**
+- `task_normal`: Text công việc **HẰNG NGÀY** (các task làm mỗi ngày)
+- `task_dinhky`: Text công việc **ĐỊNH KỲ TUẦN/THÁNG** (các task làm định kỳ)
+
+**Output:** JSON string chứa 18 text features:
+- `task_counts` (7 features): Số lượng từng loại công việc
+- `area_coverage` (7 features): Số công việc theo khu vực
+- `ratios_and_diversity` (4 features): Tỷ lệ và độ đa dạng
+
+**Ví dụ sử dụng:**
+```python
+from predict import extract_text_features_to_json
+
+# Công việc hằng ngày
+task_normal = "Lau sàn WC; Thu gom rác; Vệ sinh sảnh"
+
+# Công việc định kỳ (tuần/tháng)
+task_dinhky = "Cọ bồn cầu; Đánh sàn; Trực phát sinh"
+
+# Trích xuất features
+json_features = extract_text_features_to_json(task_normal, task_dinhky)
+print(json_features)
+```
+
+**Kết quả trả về:**
+```json
+{
+  "task_counts": {
+    "num_tasks": 6,
+    "num_cleaning_tasks": 3,
+    "num_trash_collection_tasks": 1,
+    "num_monitoring_tasks": 1,
+    "num_deep_cleaning_tasks": 2,
+    "num_support_tasks": 0,
+    "num_other_tasks": 0
+  },
+  "area_coverage": {
+    "num_wc_tasks": 2,
+    "num_hallway_tasks": 0,
+    "num_lobby_tasks": 1,
+    ...
+  },
+  ...
+}
+```
+
+---
+
+## 🔄 Pipeline Dự Đoán
+
+```mermaid
+graph LR
+    A[input.json] --> B[all_predict.py]
+    B --> C[predict.py<br/>Text Features]
+    C --> D[Feature Vector]
+    D --> E[Model<br/>artifacts/]
+    E --> F[Dự đoán<br/>Số nhân sự]
+```
+
+### Chi tiết từng bước:
+
+1. **Input** 📥
+   - Đọc `input.json` chứa thông tin ca làm việc
+   - Bao gồm 2 loại công việc: `all_task_normal` (hằng ngày) và `all_task_dinhky` (định kỳ)
+   
+2. **Text Feature Extraction** 🔤
+   - `predict.py` trích xuất 18 features từ text công việc
+   - **Gộp cả 2 loại công việc** (normal + dinhky) để phân tích
+   - Phân loại công việc: cleaning, trash, monitoring, deep cleaning, support, other
+   - Phân vùng khu vực: WC, hallway, lobby, outdoor, elevator, medical, office
+   
+3. **Feature Engineering** 🏗️
+   - Kết hợp text features với building features (diện tích, số tầng, v.v.)
+   - Chuẩn hóa dữ liệu theo chuẩn của training set
+   
+4. **Model Prediction** 🤖
+   - Load model đã được train từ `artifacts/`
+   - Dự đoán số nhân sự dựa trên toàn bộ features
+   
+5. **Output** 📤
+   - Trả về số nhân sự cần thiết cho ca đó
+
+---
+
+## 🚀 Hướng Dẫn Sử Dụng
+
+### Yêu cầu hệ thống
+```bash
+pip install pandas numpy scikit-learn joblib openpyxl
+```
+
+### 1. Dự đoán cho MỘT ca làm việc
+
+**Bước 1:** Chuẩn bị `input.json`
+```json
+{
+  "ma_dia_diem": "TD-001",
+  "loai_ca": "Hành chính",
+  "bat_dau": "07:00:00",
+  "ket_thuc": "15:00:00",
+  "tong_gio_lam": 8.0,
+  "so_ca_cua_toa": 3,
+  "all_task_normal": "Lau sàn hành lang; Thu gom rác WC; Vệ sinh sảnh",
+  "all_task_dinhky": "Cọ bồn cầu WC; Đánh sàn lobby; Trực phát sinh",
+  "so_tang": 12,
+  "so_cua_thang_may": 4,
+  "dien_tich_hanh_lang": 1800.0,
+  "dien_tich_wc": 420.0,
+  ...
+}
+```
+
+> **💡 Lưu ý về 2 trường công việc:**
+> - `all_task_normal`: Các task làm **MỖI NGÀY** (lau sàn, thu rác hàng ngày)
+> - `all_task_dinhky`: Các task làm **THEO TUẦN/THÁNG** (cọ rửa định kỳ, đánh sàn định kỳ)
+
+**Bước 2:** Chạy prediction
+```bash
+python all_predict.py
+```
+
+**Bước 3:** Xem kết quả
+```
+📊 Dự đoán số nhân sự: 8 người
+```
+
+---
+
+### 2. Trích xuất Text Features riêng lẻ
+
+```python
+from predict import extract_text_features_to_json
+
+# Công việc hằng ngày (daily tasks)
+task_normal = "Lau sàn WC tầng 1; Thu gom rác hành lang; Vệ sinh sảnh"
+
+# Công việc định kỳ (weekly/monthly tasks)
+task_dinhky = "Cọ bồn cầu WC; Đánh sàn lobby; Trực phát sinh"
+
+# Trích xuất features (sẽ gộp cả 2 loại công việc để phân tích)
+features_json = extract_text_features_to_json(task_normal, task_dinhky)
+
+# Parse JSON
+import json
+features = json.loads(features_json)
+
+print(f"Tổng số tasks (cả normal + dinhky): {features['task_counts']['num_tasks']}")
+print(f"Số tasks vệ sinh: {features['task_counts']['num_cleaning_tasks']}")
+print(f"Số tasks cọ rửa (deep cleaning): {features['task_counts']['num_deep_cleaning_tasks']}")
+print(f"Tỷ lệ vệ sinh: {features['ratios_and_diversity']['cleaning_ratio']}")
+print(f"Số tasks vệ sinh: {features['task_counts']['num_cleaning_tasks']}")
+print(f"Tỷ lệ vệ sinh: {features['ratios_and_diversity']['cleaning_ratio']}")
+```
+
+---
+
+### 3. Training lại model (nếu cần)
+
+**Bước 1:** Chuẩn bị data
+- Đảm bảo `final_2.csv` có đủ features
+- Đảm bảo diện tích là TỔNG các tầng
+
+**Bước 2:** Chạy notebook training
+```bash
+jupyter notebook ML_Training_Pipeline_Complete.ipynb
+```
+
+**Bước 3:** Model mới sẽ được lưu vào `artifacts/`
+
+---
+
+## 📈 Text Features Chi Tiết
+
+### Nhóm 1: Task Counts (7 features)
+| Feature | Mô tả | Keywords |
+|---------|-------|----------|
+| `num_tasks` | Tổng số công việc | - |
+| `num_cleaning_tasks` | Công việc vệ sinh | vệ sinh, lau, chùi, quét, hút |
+| `num_trash_collection_tasks` | Thu gom rác | thu gom rác, thay rác, đổ rác |
+| `num_monitoring_tasks` | Trực/giám sát | trực, kiểm tra, check, giám sát |
+| `num_deep_cleaning_tasks` | Vệ sinh sâu | cọ rửa, cọ bồn cầu, đánh sàn |
+| `num_support_tasks` | Hỗ trợ | giao ca, bàn giao, chụp ảnh, chuẩn bị |
+| `num_other_tasks` | Công việc khác | (không match keywords trên) |
+
+### Nhóm 2: Area Coverage (7 features)
+| Feature | Mô tả | Keywords |
+|---------|-------|----------|
+| `num_wc_tasks` | WC/Toilet | wc, toilet, nhà vệ sinh, bồn cầu |
+| `num_hallway_tasks` | Hành lang | hành lang, corridor, lối đi |
+| `num_lobby_tasks` | Sảnh | sảnh, lobby, tiền sảnh |
+| `num_outdoor_tasks` | Ngoại cảnh | ngoại cảnh, sân, vỉa hè, khuôn viên |
+| `num_elevator_tasks` | Thang máy | thang máy, elevator, cầu thang |
+| `num_medical_tasks_total` | Khu vực y tế | phòng bệnh, phòng khám, phòng mổ, xét nghiệm |
+| `num_indoor_room_tasks` | Phòng văn phòng | phòng họp, văn phòng, phòng giám đốc |
+
+### Nhóm 3: Ratios & Diversity (4 features)
+| Feature | Mô tả | Công thức |
+|---------|-------|-----------|
+| `cleaning_ratio` | Tỷ lệ vệ sinh | cleaning_tasks / total_tasks |
+| `trash_collection_ratio` | Tỷ lệ thu rác | trash_tasks / total_tasks |
+| `monitoring_ratio` | Tỷ lệ trực | monitoring_tasks / total_tasks |
+| `area_diversity` | Độ đa dạng khu vực | Số khu vực có ít nhất 1 task |
+
+---
+
+## 📝 Lưu Ý Quan Trọng
+
+### ⚠️ Về 2 Loại Công Việc
+
+> **📋 `all_task_normal` - CÔNG VIỆC HẰNG NGÀY:**
+> - Các task thực hiện **MỖI NGÀY** trong ca
+> - Ví dụ: Lau sàn, Thu rác, Vệ sinh WC
+> - Đặc điểm: Công việc nhẹ, lặp lại hàng ngày
+> - Tần suất: **Ngày nào cũng làm**
+
+> **📋 `all_task_dinhky` - CÔNG VIỆC ĐỊNH KỲ:**
+> - Các task thực hiện **THEO TUẦN** hoặc **THEO THÁNG**
+> - Ví dụ: Cọ bồn cầu (1 tuần/lần), Đánh sàn (1 tháng/lần)
+> - Đặc điểm: Công việc nặng hơn, không làm hàng ngày
+> - Tần suất: **Định kỳ (weekly/monthly)**
+
+**Cách xử lý trong model:**
+- Hệ thống sẽ **GỘP CẢ 2 LOẠI** công việc để phân tích
+- Trích xuất features từ tổng hợp: `all_task_normal + all_task_dinhky`
+- Điều này giúp model hiểu được **TOÀN BỘ KHỐI LƯỢNG CÔNG VIỆC** của ca đó
+
+---
+
+### ⚠️ Về Diện Tích
+> **TỔNG diện tích trong `final_2.csv` và `input.json` là tổng của TẤT CẢ các tầng trong tòa nhà!**
+> 
+> Ví dụ:
+> - Tòa nhà 15 tầng, mỗi tầng 300m² hành lang
+> - → `dien_tich_hanh_lang = 15 × 300 = 4500m²`
+> - KHÔNG PHẢI 300m²!
+> 
+> Công thức: `dien_tich_X = so_tang × dien_tich_X_moi_tang`
+
+---
+
+### 📌 Về Ca Làm Việc
+- Mỗi `input.json` chỉ đại diện cho **MỘT ca** của **MỘT tòa nhà**
+- Loại ca: Hành chính, Ca sáng, Ca chiều, Ca tối, Ca đêm
+- Bao gồm thông tin thời gian: `bat_dau`, `ket_thuc`, `tong_gio_lam`
+
+---
+
+### 🔤 Về Text Format
+- Text công việc phải bằng **tiếng Việt** (có dấu hoặc không dấu đều được)
+- Các công việc tách nhau bằng **dấu chấm phẩy (`;`)** hoặc `|` hoặc xuống dòng
+- Hệ thống tự động lowercase và normalize
+- **Nên viết rõ ràng:** "Lau sàn hành lang" thay vì "Lau sàn"
+- **Nên có thông tin khu vực:** "Vệ sinh WC tầng 2" thay vì chỉ "Vệ sinh"
+
+
+<div align="center">
+
+
+</div>
--- a/pycache/predict.cpython-313.pyc
+++ b/pycache/predict.cpython-313.pyc
--- a/all_predict.py
+++ b/all_predict.py
@ -0,0 +1,228 @@
+"""
+ALL_PREDICT MODULE
+End-to-end inference pipeline:
+JSON input -> preprocess -> model -> prediction
+
+Created: January 2026
+"""
+
+from __future__ import annotations
+
+import json
+import joblib
+import numpy as np
+import pandas as pd
+from typing import Dict, Any, List
+
+# =========================================================
+# GLOBAL CONFIG (TRAINING CONTRACT)
+# =========================================================
+
+# Các cột KHÔNG BAO GIỜ đưa vào model
+DEFAULT_DROP_COLS = {
+    "ma_dia_diem",
+    "all_task_normal",
+    "all_task_dinhky",
+    "is_tasks_text_missing",
+    "so_luong",  # target nếu lỡ có
+}
+
+TASK_NORMAL_COL = "all_task_normal"
+TASK_DINHKY_COL = "all_task_dinhky"
+
+
+# =========================================================
+# IMPORT KEYWORD FEATURE EXTRACTOR
+# =========================================================
+from predict import extract_keyword_features_reduced_from_two_texts
+
+
+# =========================================================
+# 1) KEYWORD FEATURES
+# =========================================================
+def add_keyword_features_to_df(df: pd.DataFrame) -> pd.DataFrame:
+    """
+    Add keyword-based features from all_task_normal + all_task_dinhky
+    """
+    if TASK_NORMAL_COL not in df.columns:
+        df[TASK_NORMAL_COL] = None
+    if TASK_DINHKY_COL not in df.columns:
+        df[TASK_DINHKY_COL] = None
+
+    feats = df.apply(
+        lambda r: extract_keyword_features_reduced_from_two_texts(
+            r.get(TASK_NORMAL_COL),
+            r.get(TASK_DINHKY_COL),
+        ),
+        axis=1,
+    )
+    feats_df = pd.DataFrame(list(feats))
+    return pd.concat([df.reset_index(drop=True), feats_df.reset_index(drop=True)], axis=1)
+
+
+# =========================================================
+# 2) PREPROCESSING (MATCH TRAINING)
+# =========================================================
+def time_to_hour(x) -> float:
+    if pd.isna(x):
+        return np.nan
+
+    if hasattr(x, "hour"):
+        return float(x.hour) + float(getattr(x, "minute", 0)) / 60.0
+
+    s = str(x).strip()
+    if " " in s and ":" in s:
+        s = s.split(" ", 1)[1].strip()
+
+    if ":" in s:
+        try:
+            h, m = s.split(":")[0:2]
+            return float(h) + float(m) / 60.0
+        except Exception:
+            return np.nan
+
+    try:
+        return float(s)
+    except Exception:
+        return np.nan
+
+
+def build_X_from_raw_df(df_raw: pd.DataFrame) -> pd.DataFrame:
+    """
+    Feature selection: tự động drop các cột không dùng cho model
+    """
+    df = df_raw.copy()
+    drop_cols = [c for c in df.columns if c in DEFAULT_DROP_COLS]
+    if drop_cols:
+        df = df.drop(columns=drop_cols)
+    return df
+
+
+def preprocess_like_training(X: pd.DataFrame) -> pd.DataFrame:
+    """
+    Same logic as training CELL 3
+    """
+    X_proc = X.copy()
+
+    if "bat_dau" in X_proc.columns:
+        X_proc["hour_start"] = X_proc["bat_dau"].apply(time_to_hour)
+    if "ket_thuc" in X_proc.columns:
+        X_proc["hour_end"] = X_proc["ket_thuc"].apply(time_to_hour)
+
+    if "hour_start" in X_proc.columns and "hour_end" in X_proc.columns:
+        end_adj = X_proc["hour_end"].copy()
+        cross = (
+            X_proc["hour_start"].notna()
+            & X_proc["hour_end"].notna()
+            & (X_proc["hour_end"] < X_proc["hour_start"])
+        )
+        end_adj.loc[cross] += 24
+        X_proc["shift_length"] = (end_adj - X_proc["hour_start"]).clip(lower=0)
+        X_proc["is_cross_day"] = cross.astype(int)
+
+    for c in ["bat_dau", "ket_thuc"]:
+        if c in X_proc.columns:
+            X_proc = X_proc.drop(columns=[c])
+
+    cat_cols = [c for c in X_proc.columns if X_proc[c].dtype == "object"]
+    if cat_cols:
+        X_proc = pd.get_dummies(X_proc, columns=cat_cols, dummy_na=True)
+
+    X_proc = X_proc.replace([np.inf, -np.inf], np.nan).fillna(0)
+    return X_proc
+
+
+# =========================================================
+# 3) ALIGN TO TRAINING SCHEMA
+# =========================================================
+def load_schema_columns(columns_joblib_path: str) -> List[str]:
+    cols = joblib.load(columns_joblib_path)
+    if not isinstance(cols, list):
+        raise ValueError("Invalid schema columns file")
+    return cols
+
+
+def align_to_schema(X_proc: pd.DataFrame, schema_columns: List[str]) -> pd.DataFrame:
+    X = X_proc.copy()
+
+    for c in schema_columns:
+        if c not in X.columns:
+            X[c] = 0
+
+    extra_cols = [c for c in X.columns if c not in schema_columns]
+    if extra_cols:
+        X = X.drop(columns=extra_cols)
+
+    X = X[schema_columns]
+    X = X.apply(pd.to_numeric, errors="coerce").fillna(0)
+    return X
+
+
+# =========================================================
+# 4) FULL PIPELINE: RECORD -> MODEL INPUT
+# =========================================================
+def record_to_model_input_df(
+    record: Dict[str, Any],
+    schema_columns: List[str],
+) -> pd.DataFrame:
+    df_raw = pd.DataFrame([record])
+    df_kw = add_keyword_features_to_df(df_raw)
+    X_raw = build_X_from_raw_df(df_kw)
+    X_proc = preprocess_like_training(X_raw)
+    X_aligned = align_to_schema(X_proc, schema_columns)
+    return X_aligned
+
+
+# =========================================================
+# 5) PREDICT
+# =========================================================
+def predict_from_record(
+    record: Dict[str, Any],
+    model_joblib_path: str,
+    columns_joblib_path: str,
+) -> Dict[str, Any]:
+    schema_columns = load_schema_columns(columns_joblib_path)
+    X = record_to_model_input_df(record, schema_columns)
+
+    model = joblib.load(model_joblib_path)
+
+    pred_log = model.predict(X.values)[0]
+    pred_raw = float(np.maximum(0.0, np.expm1(pred_log)))
+
+    return {
+        "so_luong_du_doan_raw": round(pred_raw, 2),
+        "so_luong_du_doan_round": int(np.rint(pred_raw)),
+    }
+
+
+def predict_from_json_file(
+    json_path: str,
+    model_joblib_path: str,
+    columns_joblib_path: str,
+) -> str:
+    with open(json_path, "r", encoding="utf-8") as f:
+        record = json.load(f)
+
+    result = predict_from_record(
+        record,
+        model_joblib_path,
+        columns_joblib_path,
+    )
+    return json.dumps(result, ensure_ascii=False, indent=2)
+
+
+# =========================================================
+# 6) MAIN
+# =========================================================
+if __name__ == "__main__":
+    MODEL_JOBLIB = "./artifacts/extratrees_staff_model.joblib"
+    COLUMNS_JOBLIB = "./artifacts/X_proc_columns.joblib"
+    INPUT_JSON = "./input.json"
+
+    print(
+        predict_from_json_file(
+            json_path=INPUT_JSON,
+            model_joblib_path=MODEL_JOBLIB,
+            columns_joblib_path=COLUMNS_JOBLIB,
+        )
+    )
--- a/artifacts/X_proc_columns.joblib
+++ b/artifacts/X_proc_columns.joblib
--- a/artifacts/extratrees_log1p.joblib
+++ b/artifacts/extratrees_log1p.joblib
--- a/artifacts/extratrees_staff_model.joblib
+++ b/artifacts/extratrees_staff_model.joblib
--- a/artifacts/model_meta.joblib
+++ b/artifacts/model_meta.joblib
--- a/create_aggregate_buildings.py
+++ b/create_aggregate_buildings.py
@ -1,295 +0,0 @@
-import json
-import pandas as pd
-from pathlib import Path
-from collections import defaultdict
-
-def main():
-    # Đọc file JSON
-    json_file = Path(r'c:\Users\SLG PC\Documents\Predict_calamviecHM\Link LLV 2025_clean.json')
-    print("Đang đọc file JSON...")
-    with open(json_file, 'r', encoding='utf-8') as f:
-        json_data = json.load(f)
-    
-    print(f"✓ Đã đọc {len(json_data)} records từ JSON")
-    
-    # Danh sách các trường cần aggregate (khả năng ≥50%)
-    # Nhóm 1: Trường định danh (không aggregate, lấy giá trị đầu tiên)
-    identity_fields = ['Mã địa điểm', 'Loại hình']
-    
-    # Nhóm 2: Các trường diện tích/số lượng cần SUM
-    sum_fields = [
-        'Sàn ngoại cảnh',
-        'Sàn Sảnh', 
-        'Sàn Hành lang',
-        'Sàn WC',
-        'Sàn Phòng',
-        'Thảm',
-        'Dốc hầm',
-        'Viền phản quang',
-        'Ốp tường',
-        'Ốp chân tường',
-        'Rãnh thoát nước',
-        'Kính'
-    ]
-    
-    # Nhóm 3: Các trường cần xử lý đặc biệt
-    special_fields = {
-        'Tên Tòa Tháp': 'first',  # Lấy tên tòa đầu tiên
-        'Tầng': 'count_distinct',  # Đếm số tầng khác nhau
-        'Cửa thang máy': 'sum',  # Tổng số cửa thang máy
-        'Mức độ Lưu lượng KH hoạt động trên mặt sàn giờ': 'first'  # Lấy giá trị đầu tiên
-    }
-    
-    # Aggregate dữ liệu theo Mã địa điểm
-    print("\nĐang aggregate dữ liệu theo Mã địa điểm...")
-    aggregated_data = defaultdict(lambda: {
-        'Mã địa điểm': '',
-        'Loại hình': '',
-        'Tên Tòa Tháp': '',
-        'Số tầng': 0,
-        'Tổng số cửa thang máy': 0,
-        'Mức độ Lưu lượng KH': '',
-        'Diện tích ngoại cảnh Tòa tháp (m2)': 0,
-        'Sàn Sảnh (m2)': 0,
-        'Sàn Hành lang (m2)': 0,
-        'Sàn WC (m2)': 0,
-        'Sàn Phòng (m2)': 0,
-        'Thảm (m2)': 0,
-        'Dốc hầm (m)': 0,
-        'Viền phản quang (m)': 0,
-        'Ốp tường (m2)': 0,
-        'Ốp chân tường (m2)': 0,
-        'Rãnh thoát nước (m)': 0,
-        'Kính (m2)': 0,
-        'tangs': set(),  # Tập hợp các tầng (để đếm)
-    })
-    
-    # Xử lý từng record
-    for record in json_data:
-        ma_dia_diem = record.get('Mã địa điểm', '')
-        if not ma_dia_diem:
-            continue
-            
-        agg = aggregated_data[ma_dia_diem]
-        
-        # Lấy thông tin định danh (lần đầu tiên)
-        if not agg['Mã địa điểm']:
-            agg['Mã địa điểm'] = ma_dia_diem
-            agg['Loại hình'] = record.get('Loại hình', '')
-            agg['Tên Tòa Tháp'] = record.get('Tên Tòa Tháp', '')
-            agg['Mức độ Lưu lượng KH'] = record.get('Mức độ Lưu lượng KH hoạt động trên mặt sàn giờ', '')
-        
-        # Tập hợp các tầng
-        tang = record.get('Tầng', '')
-        if tang:
-            agg['tangs'].add(str(tang))
-        
-        # SUM các trường diện tích - Hàm helper để convert an toàn
-        def safe_float(value):
-            try:
-                return float(value) if value else 0
-            except (ValueError, TypeError):
-                return 0
-        
-        agg['Diện tích ngoại cảnh Tòa tháp (m2)'] += safe_float(record.get('Sàn ngoại cảnh', 0))
-        agg['Sàn Sảnh (m2)'] += safe_float(record.get('Sàn Sảnh', 0))
-        agg['Sàn Hành lang (m2)'] += safe_float(record.get('Sàn Hành lang', 0))
-        agg['Sàn WC (m2)'] += safe_float(record.get('Sàn WC', 0))
-        agg['Sàn Phòng (m2)'] += safe_float(record.get('Sàn Phòng', 0))
-        agg['Thảm (m2)'] += safe_float(record.get('Thảm', 0))
-        agg['Dốc hầm (m)'] += safe_float(record.get('Dốc hầm', 0))
-        agg['Viền phản quang (m)'] += safe_float(record.get('Viền phản quang', 0))
-        agg['Ốp tường (m2)'] += safe_float(record.get('Ốp tường', 0))
-        agg['Ốp chân tường (m2)'] += safe_float(record.get('Ốp chân tường', 0))
-        agg['Rãnh thoát nước (m)'] += safe_float(record.get('Rãnh thoát nước', 0))
-        agg['Kính (m2)'] += safe_float(record.get('Kính', 0))
-        agg['Tổng số cửa thang máy'] += safe_float(record.get('Cửa thang máy', 0))
-    
-    # Tính số tầng và làm sạch dữ liệu
-    print("Đang xử lý dữ liệu cuối cùng...")
-    final_data = []
-    for ma_dia_diem, agg in aggregated_data.items():
-        agg['Số tầng'] = len(agg['tangs'])
-        del agg['tangs']  # Xóa trường tạm
-        final_data.append(agg)
-    
-    # Tạo DataFrame
-    df = pd.DataFrame(final_data)
-    
-    # Sắp xếp theo Mã địa điểm
-    df = df.sort_values('Mã địa điểm')
-    
-    # Sắp xếp lại thứ tự cột
-    column_order = [
-        'Mã địa điểm',
-        'Loại hình',
-        'Tên Tòa Tháp',
-        'Mức độ Lưu lượng KH',
-        'Số tầng',
-        'Tổng số cửa thang máy',
-        'Diện tích ngoại cảnh Tòa tháp (m2)',
-        'Sàn Sảnh (m2)',
-        'Sàn Hành lang (m2)',
-        'Sàn WC (m2)',
-        'Sàn Phòng (m2)',
-        'Thảm (m2)',
-        'Dốc hầm (m)',
-        'Viền phản quang (m)',
-        'Ốp tường (m2)',
-        'Ốp chân tường (m2)',
-        'Rãnh thoát nước (m)',
-        'Kính (m2)',
-    ]
-    
-    df = df[column_order]
-    
-    # Làm tròn số
-    numeric_columns = df.select_dtypes(include=['float64', 'int64']).columns
-    for col in numeric_columns:
-        if col not in ['Mã địa điểm', 'Loại hình']:
-            df[col] = df[col].round(2)
-    
-    print(f"\n✓ Đã aggregate thành {len(df)} tòa nhà")
-    
-    # Lọc bỏ các tòa có tất cả các trường số bằng 0
-    print("Đang lọc bỏ các tòa nhà có tất cả trường số = 0...")
-    numeric_cols_to_check = [col for col in numeric_columns if col not in ['Mã địa điểm', 'Loại hình']]
-    
-    # Tìm các dòng có ít nhất một trường số khác 0
-    df_filtered = df[(df[numeric_cols_to_check] != 0).any(axis=1)]
-    
-    removed_count = len(df) - len(df_filtered)
-    print(f"✓ Đã loại bỏ {removed_count} tòa nhà có tất cả trường số = 0")
-    print(f"✓ Còn lại {len(df_filtered)} tòa nhà có dữ liệu hợp lệ")
-    
-    # Cập nhật df
-    df = df_filtered
-    
-    # Xuất ra Excel
-    output_file = 'Du_Lieu_Toa_Nha_Aggregate.xlsx'
-    print(f"\nĐang xuất dữ liệu ra file Excel...")
-    
-    with pd.ExcelWriter(output_file, engine='openpyxl') as writer:
-        # Sheet 1: Dữ liệu đầy đủ
-        df.to_excel(writer, sheet_name='Dữ liệu tòa nhà', index=False)
-        
-        # Sheet 2: Thống kê
-        stats_data = {
-            'Chỉ số': [
-                'Tổng số tòa nhà',
-                'Tổng diện tích Sàn Sảnh (m2)',
-                'Tổng diện tích Sàn Hành lang (m2)',
-                'Tổng diện tích Sàn WC (m2)',
-                'Tổng diện tích Sàn Phòng (m2)',
-                'Tổng diện tích Thảm (m2)',
-                'Tổng diện tích Kính (m2)',
-                'Trung bình số tầng/tòa',
-                'Trung bình số cửa thang máy/tòa',
-            ],
-            'Giá trị': [
-                len(df),
-                df['Sàn Sảnh (m2)'].sum(),
-                df['Sàn Hành lang (m2)'].sum(),
-                df['Sàn WC (m2)'].sum(),
-                df['Sàn Phòng (m2)'].sum(),
-                df['Thảm (m2)'].sum(),
-                df['Kính (m2)'].sum(),
-                df['Số tầng'].mean(),
-                df['Tổng số cửa thang máy'].mean(),
-            ]
-        }
-        df_stats = pd.DataFrame(stats_data)
-        df_stats['Giá trị'] = df_stats['Giá trị'].round(2)
-        df_stats.to_excel(writer, sheet_name='Thống kê', index=False)
-        
-        # Sheet 3: Danh sách các trường đã aggregate
-        field_info = {
-            'Tên trường': column_order,
-            'Cách xử lý': [
-                'Khóa chính',
-                'Giá trị đầu tiên',
-                'Giá trị đầu tiên',
-                'Giá trị đầu tiên',
-                'COUNT DISTINCT các tầng',
-                'SUM tất cả cửa thang máy',
-                'SUM tất cả diện tích',
-                'SUM tất cả diện tích',
-                'SUM tất cả diện tích',
-                'SUM tất cả diện tích',
-                'SUM tất cả diện tích',
-                'SUM tất cả diện tích',
-                'SUM tất cả độ dài',
-                'SUM tất cả độ dài',
-                'SUM tất cả diện tích',
-                'SUM tất cả diện tích',
-                'SUM tất cả độ dài',
-                'SUM tất cả diện tích',
-            ],
-            'Nguồn JSON': [
-                'Mã địa điểm',
-                'Loại hình',
-                'Tên Tòa Tháp',
-                'Mức độ Lưu lượng KH hoạt động trên mặt sàn giờ',
-                'Tầng',
-                'Cửa thang máy',
-                'Sàn ngoại cảnh',
-                'Sàn Sảnh',
-                'Sàn Hành lang',
-                'Sàn WC',
-                'Sàn Phòng',
-                'Thảm',
-                'Dốc hầm',
-                'Viền phản quang',
-                'Ốp tường',
-                'Ốp chân tường',
-                'Rãnh thoát nước',
-                'Kính',
-            ],
-            'Khả năng đáp ứng CSV': [
-                '✅ 100%',
-                '✅ 100%',
-                '🔶 40%',
-                '⚠️ 50%',
-                '🔶 40%',
-                '🔶 30%',
-                '✅ 100%',
-                '✅ 100%',
-                '✅ 100%',
-                '✅ 100%',
-                '✅ 100%',
-                '✅ 100%',
-                '✅ 100%',
-                '✅ 100%',
-                '✅ 100%',
-                '✅ 100%',
-                '✅ 100%',
-                '✅ 100%',
-            ]
-        }
-        df_fields = pd.DataFrame(field_info)
-        df_fields.to_excel(writer, sheet_name='Thông tin các trường', index=False)
-    
-    print(f"✅ ĐÃ HOÀN THÀNH!")
-    print("="*100)
-    print(f"\n📁 File xuất ra: {output_file}")
-    print(f"📊 Tổng số tòa nhà: {len(df)}")
-    print(f"📋 Số trường dữ liệu: {len(column_order)}")
-    print("\n📄 Nội dung file gồm 3 sheets:")
-    print("   1. Dữ liệu tòa nhà        - Bảng dữ liệu đầy đủ của tất cả tòa nhà")
-    print("   2. Thống kê               - Các chỉ số thống kê tổng hợp")
-    print("   3. Thông tin các trường   - Giải thích cách xử lý từng trường")
-    
-    print("\n" + "="*100)
-    print("📊 MẪU DỮ LIỆU (5 tòa nhà đầu tiên):")
-    print("="*100)
-    print(df.head().to_string())
-    
-    print("\n" + "="*100)
-    print("📈 THỐNG KÊ TỔNG QUAN:")
-    print("="*100)
-    print(df.describe().round(2).to_string())
-    
-    print("\n" + "="*100)
-
-if __name__ == "__main__":
-    main()
--- a/1_5_2025.json
+++ b/1_5_2025.json
--- a/2025_clean.json
+++ b/2025_clean.json
--- a/example_text_features.py
+++ b/example_text_features.py
@ -1,367 +0,0 @@
-"""
-Example: Using TextFeatureExtractor for Staff Prediction
-=========================================================
-
-This script demonstrates how to use extract_text_features.py
-to extract TF-IDF+SVD features and train a prediction model.
-
-Run this script to see a complete example workflow.
-"""
-
-import pandas as pd
-import numpy as np
-from extract_text_features import TextFeatureExtractor, extract_features_from_dataframe
-from sklearn.model_selection import train_test_split
-from sklearn.preprocessing import StandardScaler
-from sklearn.tree import DecisionTreeRegressor
-from sklearn.metrics import r2_score, mean_absolute_error
-import pickle
-import os
-
-
-def example_1_basic_usage():
-    """Example 1: Basic text feature extraction"""
-    print("=" * 80)
-    print("EXAMPLE 1: BASIC USAGE")
-    print("=" * 80)
-    
-    # Sample Vietnamese task texts
-    texts = [
-        "Kiểm tra hệ thống điện tòa nhà A định kỳ",
-        "Bảo trì thang máy tầng 5 và kiểm tra an toàn",
-        "Sửa chữa điều hòa phòng họp B tầng 3",
-        "Vệ sinh kính và kiểm tra hệ thống chiếu sáng",
-        "Bảo trì máy phát điện dự phòng"
-    ]
-    
-    print(f"\nInput: {len(texts)} task descriptions")
-    print(f"Sample: '{texts[0]}'")
-    
-    # Initialize extractor
-    extractor = TextFeatureExtractor(
-        max_features=20,  # Small for demo
-        n_components=5    # Small for demo
-    )
-    
-    # Fit and transform
-    features = extractor.fit_transform(texts)
-    
-    print(f"\nOutput shape: {features.shape}")
-    print(f"Feature names: {extractor.get_feature_names()}")
-    print(f"\nFirst sample features:")
-    print(features[0])
-    
-    # Show summary
-    print(f"\nExtractor summary:")
-    for key, value in extractor.get_summary().items():
-        print(f"  {key}: {value}")
-    
-    print("\n✅ Example 1 complete!\n")
-
-
-def example_2_dataframe_extraction():
-    """Example 2: Extract features from DataFrame"""
-    print("=" * 80)
-    print("EXAMPLE 2: DATAFRAME EXTRACTION")
-    print("=" * 80)
-    
-    # Create sample DataFrame
-    df = pd.DataFrame({
-        'ma_dia_diem': ['A01', 'A02', 'A03', 'A04', 'A05'],
-        'all_task_normal': [
-            'Kiểm tra điện',
-            'Bảo trì thang máy',
-            'Sửa điều hòa',
-            'Vệ sinh kính',
-            'Bảo trì máy phát'
-        ],
-        'all_task_dinhky': [
-            'Định kỳ hàng tháng',
-            'Định kỳ hàng tuần',
-            'Khẩn cấp',
-            'Hàng ngày',
-            'Định kỳ quý'
-        ],
-        'so_luong': [5, 3, 2, 8, 4]
-    })
-    
-    print(f"\nInput DataFrame shape: {df.shape}")
-    print(df)
-    
-    # Extract features
-    text_features_df, extractor = extract_features_from_dataframe(
-        df,
-        text_columns=['all_task_normal', 'all_task_dinhky'],
-        fit=True
-    )
-    
-    print(f"\nExtracted features shape: {text_features_df.shape}")
-    print(f"\nSample features:")
-    print(text_features_df.head())
-    
-    print("\n✅ Example 2 complete!\n")
-
-
-def example_3_save_and_load():
-    """Example 3: Save and load extractor"""
-    print("=" * 80)
-    print("EXAMPLE 3: SAVE AND LOAD")
-    print("=" * 80)
-    
-    # Training data
-    train_texts = [
-        "Kiểm tra hệ thống điện",
-        "Bảo trì thang máy",
-        "Sửa chữa điều hòa"
-    ]
-    
-    # Fit extractor
-    print("\n1. Training extractor...")
-    extractor = TextFeatureExtractor(max_features=10, n_components=3)
-    train_features = extractor.fit_transform(train_texts)
-    print(f"   Train features shape: {train_features.shape}")
-    
-    # Save
-    save_path = 'example_extractor.pkl'
-    extractor.save(save_path)
-    print(f"   Saved to: {save_path}")
-    
-    # Load
-    print("\n2. Loading extractor...")
-    loaded_extractor = TextFeatureExtractor.load(save_path)
-    
-    # Use loaded extractor on new data
-    print("\n3. Using loaded extractor on new data...")
-    new_texts = ["Vệ sinh kính tầng 5", "Kiểm tra máy phát điện"]
-    new_features = loaded_extractor.transform(new_texts)
-    print(f"   New features shape: {new_features.shape}")
-    print(f"   Features:\n{new_features}")
-    
-    # Cleanup
-    if os.path.exists(save_path):
-        os.remove(save_path)
-        print(f"\n   Cleaned up: {save_path}")
-    
-    print("\n✅ Example 3 complete!\n")
-
-
-def example_4_full_pipeline():
-    """Example 4: Complete ML pipeline with text features"""
-    print("=" * 80)
-    print("EXAMPLE 4: FULL ML PIPELINE")
-    print("=" * 80)
-    
-    # Create sample dataset
-    np.random.seed(42)
-    n_samples = 100
-    
-    tasks_pool = [
-        "Kiểm tra hệ thống điện",
-        "Bảo trì thang máy",
-        "Sửa chữa điều hòa",
-        "Vệ sinh kính",
-        "Bảo trì máy phát điện",
-        "Kiểm tra an toàn",
-        "Sửa chữa ống nước",
-        "Bảo trì hệ thống PCCC"
-    ]
-    
-    df = pd.DataFrame({
-        'all_task_normal': [np.random.choice(tasks_pool) for _ in range(n_samples)],
-        'all_task_dinhky': [np.random.choice(['Hàng ngày', 'Hàng tuần', 'Hàng tháng', 'Quý']) 
-                            for _ in range(n_samples)],
-        'dien_tich': np.random.uniform(100, 500, n_samples),
-        'so_tang': np.random.randint(5, 30, n_samples),
-        'so_luong': np.random.randint(1, 10, n_samples)
-    })
-    
-    print(f"\n📊 Dataset: {df.shape}")
-    print(f"   Target (so_luong): mean={df['so_luong'].mean():.2f}, std={df['so_luong'].std():.2f}")
-    
-    # === TRAINING PHASE ===
-    print("\n1️⃣ TRAINING PHASE")
-    print("-" * 80)
-    
-    # Split data
-    train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)
-    print(f"\n   Train: {len(train_df)}, Test: {len(test_df)}")
-    
-    # Extract text features
-    print("\n   Extracting text features...")
-    text_features_train, extractor = extract_features_from_dataframe(
-        train_df,
-        text_columns=['all_task_normal', 'all_task_dinhky'],
-        fit=True
-    )
-    
-    # Prepare numeric features
-    numeric_cols = ['dien_tich', 'so_tang']
-    X_numeric_train = train_df[numeric_cols].reset_index(drop=True)
-    
-    # Combine features
-    X_train = pd.concat([X_numeric_train, text_features_train], axis=1)
-    y_train = train_df['so_luong'].values
-    
-    print(f"\n   Combined features: {X_train.shape}")
-    print(f"     - Numeric: {len(numeric_cols)}")
-    print(f"     - Text SVD: {text_features_train.shape[1]}")
-    
-    # Scale features
-    scaler = StandardScaler()
-    X_train_scaled = scaler.fit_transform(X_train)
-    
-    # Train model
-    print("\n   Training model...")
-    model = DecisionTreeRegressor(max_depth=5, random_state=42)
-    model.fit(X_train_scaled, y_train)
-    
-    # Evaluate on training set
-    y_train_pred = model.predict(X_train_scaled)
-    train_r2 = r2_score(y_train, y_train_pred)
-    train_mae = mean_absolute_error(y_train, y_train_pred)
-    
-    print(f"\n   Training metrics:")
-    print(f"     R²:  {train_r2:.4f}")
-    print(f"     MAE: {train_mae:.4f}")
-    
-    # === INFERENCE PHASE ===
-    print("\n2️⃣ INFERENCE PHASE")
-    print("-" * 80)
-    
-    # Extract text features (transform only, no fitting!)
-    print("\n   Extracting text features (transform only)...")
-    text_features_test, _ = extract_features_from_dataframe(
-        test_df,
-        text_columns=['all_task_normal', 'all_task_dinhky'],
-        extractor=extractor,
-        fit=False  # Important!
-    )
-    
-    # Prepare numeric features
-    X_numeric_test = test_df[numeric_cols].reset_index(drop=True)
-    
-    # Combine features
-    X_test = pd.concat([X_numeric_test, text_features_test], axis=1)
-    y_test = test_df['so_luong'].values
-    
-    # Scale features
-    X_test_scaled = scaler.transform(X_test)
-    
-    # Predict
-    print("\n   Making predictions...")
-    y_test_pred = model.predict(X_test_scaled)
-    
-    # Evaluate
-    test_r2 = r2_score(y_test, y_test_pred)
-    test_mae = mean_absolute_error(y_test, y_test_pred)
-    
-    print(f"\n   Test metrics:")
-    print(f"     R²:  {test_r2:.4f}")
-    print(f"     MAE: {test_mae:.4f}")
-    
-    # Show sample predictions
-    print(f"\n   Sample predictions:")
-    results_df = pd.DataFrame({
-        'actual': y_test[:5],
-        'predicted': y_test_pred[:5],
-        'error': y_test[:5] - y_test_pred[:5]
-    })
-    print(results_df.to_string(index=False))
-    
-    # === FEATURE IMPORTANCE ===
-    print("\n3️⃣ FEATURE IMPORTANCE")
-    print("-" * 80)
-    
-    importances = model.feature_importances_
-    feature_names = X_train.columns.tolist()
-    
-    importance_df = pd.DataFrame({
-        'feature': feature_names,
-        'importance': importances
-    }).sort_values('importance', ascending=False)
-    
-    print("\n   Top 10 important features:")
-    print(importance_df.head(10).to_string(index=False))
-    
-    # Aggregate by feature type
-    n_numeric = len(numeric_cols)
-    text_importance = importances[n_numeric:].sum()
-    numeric_importance = importances[:n_numeric].sum()
-    
-    print(f"\n   Feature type contribution:")
-    print(f"     Numeric features: {numeric_importance:.4f} ({numeric_importance/(text_importance+numeric_importance)*100:.1f}%)")
-    print(f"     Text features:    {text_importance:.4f} ({text_importance/(text_importance+numeric_importance)*100:.1f}%)")
-    
-    print("\n✅ Example 4 complete!\n")
-
-
-def example_5_top_tfidf_terms():
-    """Example 5: Analyze top TF-IDF terms"""
-    print("=" * 80)
-    print("EXAMPLE 5: TOP TF-IDF TERMS ANALYSIS")
-    print("=" * 80)
-    
-    # Sample task texts
-    texts = [
-        "Kiểm tra hệ thống điện tòa nhà",
-        "Bảo trì thang máy và kiểm tra an toàn",
-        "Sửa chữa hệ thống điều hòa không khí",
-        "Kiểm tra và vệ sinh kính tòa nhà",
-        "Bảo trì máy phát điện dự phòng",
-        "Kiểm tra hệ thống PCCC định kỳ",
-        "Sửa chữa ống nước và hệ thống cấp thoát",
-        "Bảo trì hệ thống thang máy tòa nhà"
-    ]
-    
-    print(f"\nInput: {len(texts)} task descriptions")
-    
-    # Fit extractor
-    extractor = TextFeatureExtractor(max_features=50, n_components=10)
-    extractor.fit(texts)
-    
-    # Get top TF-IDF features
-    print("\n📋 Top 20 TF-IDF terms (by document frequency):")
-    top_features = extractor.get_top_tfidf_features(top_n=20)
-    print(top_features.to_string(index=False))
-    
-    # Get summary
-    summary = extractor.get_summary()
-    print(f"\n📊 Summary:")
-    print(f"   Actual TF-IDF features: {summary['actual_tfidf_features']}")
-    print(f"   SVD components: {summary['n_components']}")
-    print(f"   Explained variance: {summary['explained_variance']*100:.2f}%")
-    
-    print("\n✅ Example 5 complete!\n")
-
-
-def main():
-    """Run all examples"""
-    print("\n" + "=" * 80)
-    print("TEXT FEATURE EXTRACTION - EXAMPLES")
-    print("=" * 80 + "\n")
-    
-    try:
-        example_1_basic_usage()
-        example_2_dataframe_extraction()
-        example_3_save_and_load()
-        example_4_full_pipeline()
-        example_5_top_tfidf_terms()
-        
-        print("\n" + "=" * 80)
-        print("✅ ALL EXAMPLES COMPLETED SUCCESSFULLY!")
-        print("=" * 80 + "\n")
-        
-        print("Next steps:")
-        print("  1. Try with your own dataset: FINAL_DATASET_WITH_TEXT_BACKUP_20260105_213507.xlsx")
-        print("  2. Adjust hyperparameters: max_features, n_components")
-        print("  3. Integrate into your ML pipeline")
-        print("  4. Save extractor for production use")
-        
-    except Exception as e:
-        print(f"\n❌ Error: {e}")
-        import traceback
-        traceback.print_exc()
-
-
-if __name__ == '__main__':
-    main()
--- a/extract_25_features.py
+++ b/extract_25_features.py
@ -1,439 +0,0 @@
-"""
-EXTRACTION FUNCTION: 25 KEYWORD FEATURES TỪ TASKS TEXT
-Dựa trên phân tích 30,917 công việc từ 302 tòa nhà
-Updated: January 5, 2026
-"""
-
-import pandas as pd
-import re
-from typing import Dict, List
-
-def extract_25_keyword_features(tasks_text: str) -> Dict[str, float]:
-    """
-    Trích xuất 25 keyword-based features từ tasks text
-    
-    Args:
-        tasks_text: Chuỗi text chứa tất cả công việc (all_task_normal + all_task_dinhky)
-        
-    Returns:
-        Dict với 25 features
-    """
-    
-    # Xử lý missing values
-    if pd.isna(tasks_text) or str(tasks_text).strip() == '':
-        return _get_empty_features()
-    
-    # Chuyển về lowercase và tách tasks
-    tasks_text = str(tasks_text).lower()
-    
-    # Tách tasks bằng các delimiter: ; | hoặc xuống dòng
-    tasks = re.split(r'[;|\n]+', tasks_text)
-    tasks = [t.strip() for t in tasks if t.strip()]
-    
-    # =================================================================
-    # NHÓM 1: TASK COUNTS THEO LOẠI CÔNG VIỆC (9 features)
-    # =================================================================
-    
-    # 1. Tổng số công việc
-    num_tasks = len(tasks)
-    
-    # 2. Vệ sinh thường ngày (55.9% dữ liệu)
-    cleaning_keywords = [
-        'vệ sinh', 'tvs', 'tổng vệ sinh', 'lau', 'chùi', 'quét', 'hút', 
-        'đẩy khô', 'lau ẩm', 'đẩy ẩm', 'làm sạch', 'lau bụi', 'quét bụi',
-        'lau kính', 'lau sàn', 'quét sàn', 'hút bụi', 'lau nền'
-    ]
-    num_cleaning_tasks = _count_tasks_with_keywords(tasks, cleaning_keywords)
-    
-    # 3. Thu gom/thay rác (7.9% dữ liệu) - MỚI ⭐
-    trash_keywords = [
-        'thu gom rác', 'thay rác', 'vận chuyển rác', 'tua rác', 'đổ rác',
-        'thu rác', 'gom rác', 'chuyển rác', 'bỏ rác', 'đẩy rác',
-        'quét rác nổi', 'trực rác', 'rác nổi'
-    ]
-    num_trash_collection_tasks = _count_tasks_with_keywords(tasks, trash_keywords)
-    
-    # 4. Trực/kiểm tra phát sinh (16.1% dữ liệu) - MỚI ⭐
-    monitoring_keywords = [
-        'trực', 'trực phát sinh', 'trực lại', 'trực ps', 'trực tua',
-        'kiểm tra', 'check', 'giám sát', 'theo dõi', 'tuần tra'
-    ]
-    num_monitoring_tasks = _count_tasks_with_keywords(tasks, monitoring_keywords)
-    
-    # 5. Dọn phòng Y TẾ (0.4% nhưng đặc trưng) - MỚI ⭐
-    room_cleaning_keywords = [
-        'dọn mổ', 'dọn đẻ', 'dọn can thiệp', 'ra viện', 'dọn phòng',
-        'bệnh nhân ra viện', 'dọn khi bệnh nhân', 'dọn phòng bệnh'
-    ]
-    num_room_cleaning_tasks = _count_tasks_with_keywords(tasks, room_cleaning_keywords)
-    
-    # 6. Vệ sinh chuyên sâu (4.5% dữ liệu) - MỚI ⭐
-    deep_cleaning_keywords = [
-        'cọ rửa', 'cọ bồn cầu', 'cọ', 'gạt kính', 'gạt', 'đánh sàn', 
-        'đánh chân tường', 'đánh cọ', 'đánh vết bẩn', 'chà tường',
-        'đánh dép', 'cọ gương', 'cọ lavabo', 'cọ thùng rác'
-    ]
-    num_deep_cleaning_tasks = _count_tasks_with_keywords(tasks, deep_cleaning_keywords)
-    
-    # 7. Bảo trì/sửa chữa (0.6% dữ liệu)
-    maintenance_keywords = [
-        'bảo dưỡng', 'sửa chữa', 'bảo trì', 'thay thế', 'sửa', 
-        'thay', 'kiểm định', 'bảo dưỡng máy'
-    ]
-    num_maintenance_tasks = _count_tasks_with_keywords(tasks, maintenance_keywords)
-    
-    # 8. Hỗ trợ (5.8% dữ liệu) - MỚI ⭐
-    support_keywords = [
-        'giao ca', 'bàn giao', 'bàn giao ca', 'chụp ảnh', 'nhận ca',
-        'vsdc', 'vệ sinh dụng cụ', 'chuẩn bị dụng cụ', 'vệ sinh xe đồ',
-        'chuẩn bị nước', 'chuẩn bị', 'giao ban'
-    ]
-    num_support_tasks = _count_tasks_with_keywords(tasks, support_keywords)
-    
-    # 9. Công việc khác (không thuộc các loại trên)
-    # Tạm tính = tổng - (các loại đã đếm, nhưng lưu ý có overlap)
-    # Để tính chính xác, cần đếm tasks không match bất kỳ keyword nào
-    all_keywords = (cleaning_keywords + trash_keywords + monitoring_keywords + 
-                   room_cleaning_keywords + deep_cleaning_keywords + 
-                   maintenance_keywords + support_keywords)
-    num_other_tasks = _count_tasks_without_keywords(tasks, all_keywords)
-    
-    # =================================================================
-    # NHÓM 2: AREA COVERAGE - KHU VỰC (10 features)
-    # =================================================================
-    
-    # 10. WC/Nhà vệ sinh (20.4% dữ liệu)
-    wc_keywords = [
-        'wc', 'toilet', 'nhà vệ sinh', 'restroom', 'phòng vệ sinh',
-        'bồn cầu', 'lavabo', 'tiểu nam', 'bồn tiểu', 'wc công cộng',
-        'wc nhân viên', 'nhà wc'
-    ]
-    num_wc_tasks = _count_tasks_with_keywords(tasks, wc_keywords)
-    
-    # 11. Hành lang (13.7% dữ liệu)
-    hallway_keywords = [
-        'hành lang', 'corridor', 'lối đi', 'hall', 'hành lang tầng',
-        'hl', 'hanh lang'
-    ]
-    num_hallway_tasks = _count_tasks_with_keywords(tasks, hallway_keywords)
-    
-    # 12. Sảnh (7.6% dữ liệu)
-    lobby_keywords = [
-        'sảnh', 'lobby', 'tiền sảnh', 'sảnh đỏ', 'sảnh chính',
-        'tiền sảnh', 'sảnh tầng', 'sanh'
-    ]
-    num_lobby_tasks = _count_tasks_with_keywords(tasks, lobby_keywords)
-    
-    # 13. Phòng bệnh Y TẾ (1.5% dữ liệu) - MỚI ⭐
-    patient_room_keywords = [
-        'phòng bệnh', 'giường bệnh', 'phòng víp', 'phòng vip',
-        'phòng bệnh nhân', 'pb', 'phòng bv'
-    ]
-    num_patient_room_tasks = _count_tasks_with_keywords(tasks, patient_room_keywords)
-    
-    # 14. Phòng khám Y TẾ (0.3% dữ liệu) - MỚI ⭐
-    clinic_room_keywords = [
-        'phòng khám', 'khoa khám', 'phòng nội', 'phòng sản',
-        'phòng khám bệnh', 'khu khám', 'pk'
-    ]
-    num_clinic_room_tasks = _count_tasks_with_keywords(tasks, clinic_room_keywords)
-    
-    # 15. Phòng mổ Y TẾ (0.4% dữ liệu) - MỚI ⭐
-    surgery_room_keywords = [
-        'phòng mổ', 'hậu phẫu', 'phòng phẫu thuật', 'khu mổ',
-        'phòng pt', 'ngoài phòng mổ', 'trong phòng mổ'
-    ]
-    num_surgery_room_tasks = _count_tasks_with_keywords(tasks, surgery_room_keywords)
-    
-    # 16. Ngoại cảnh (4.3% dữ liệu)
-    outdoor_keywords = [
-        'ngoại cảnh', 'sân', 'vỉa hè', 'khuôn viên', 'cổng',
-        'outdoor', 'bãi xe', 'tầng hầm', 'sân sau', 'sân trước'
-    ]
-    num_outdoor_tasks = _count_tasks_with_keywords(tasks, outdoor_keywords)
-    
-    # 17. Thang máy/Cầu thang (10.6% dữ liệu)
-    elevator_keywords = [
-        'thang máy', 'elevator', 'lift', 'cầu thang', 'bậc tam cấp',
-        'thang bộ', 'cầu thang bộ', 'tay vịn', 'tam cấp'
-    ]
-    num_elevator_tasks = _count_tasks_with_keywords(tasks, elevator_keywords)
-    
-    # 18. Phòng nhân viên/hành chính (4.4% dữ liệu) - MỚI ⭐
-    office_keywords = [
-        'phòng nhân viên', 'phòng giám đốc', 'phòng họp', 'phòng hành chính',
-        'văn phòng', 'phòng gd', 'phòng pgd', 'phòng ban', 'phòng giao ban',
-        'phòng bác sĩ', 'phòng trưởng khoa', 'hội trường', 'phòng kế toán'
-    ]
-    num_office_tasks = _count_tasks_with_keywords(tasks, office_keywords)
-    
-    # 19. Phòng kỹ thuật Y TẾ (0.2% dữ liệu) - MỚI ⭐
-    technical_room_keywords = [
-        'phòng xét nghiệm', 'phòng chụp', 'xq', 'siêu âm', 'kho dược',
-        'phòng xn', 'labo', 'phòng thí nghiệm', 'phòng kỹ thuật',
-        'phòng điện tim', 'phòng nội soi', 'phòng cấp cứu', 'phòng hồi sức'
-    ]
-    num_technical_room_tasks = _count_tasks_with_keywords(tasks, technical_room_keywords)
-    
-    # =================================================================
-    # NHÓM 3: RATIOS & COMPLEXITY (6 features)
-    # =================================================================
-    
-    # 20. Tỷ lệ vệ sinh thường ngày
-    cleaning_ratio = num_cleaning_tasks / num_tasks if num_tasks > 0 else 0.0
-    
-    # 21. Tỷ lệ thu gom rác - MỚI ⭐
-    trash_collection_ratio = num_trash_collection_tasks / num_tasks if num_tasks > 0 else 0.0
-    
-    # 22. Tỷ lệ trực/kiểm tra - MỚI ⭐
-    monitoring_ratio = num_monitoring_tasks / num_tasks if num_tasks > 0 else 0.0
-    
-    # 23. Tỷ lệ dọn phòng Y TẾ - MỚI ⭐
-    room_cleaning_ratio = num_room_cleaning_tasks / num_tasks if num_tasks > 0 else 0.0
-    
-    # 24. Độ đa dạng khu vực (0-10)
-    area_counts = [
-        num_wc_tasks, num_hallway_tasks, num_lobby_tasks,
-        num_patient_room_tasks, num_clinic_room_tasks, num_surgery_room_tasks,
-        num_outdoor_tasks, num_elevator_tasks, num_office_tasks, num_technical_room_tasks
-    ]
-    area_diversity = sum(1 for count in area_counts if count > 0)
-    
-    # 25. Điểm phức tạp (0.0 - 10.0) - MỚI ⭐
-    # Dựa vào:
-    # - Độ dài text (càng dài càng phức tạp)
-    # - Số lượng công việc (càng nhiều càng phức tạp)
-    # - Từ khóa kỹ thuật (Y TẾ, máy móc...)
-    
-    technical_keywords = [
-        'bms', 'hvac', 'camera', 'access control', 'máy phát', 'máy móc',
-        'hệ thống', 'thiết bị', 'bảo dưỡng máy', 'sửa máy', 'kiểm tra máy',
-        'xét nghiệm', 'chụp chiếu', 'điện tim', 'nội soi', 'phẫu thuật'
-    ]
-    num_technical_keywords = _count_tasks_with_keywords(tasks, technical_keywords)
-    
-    # Công thức tính task_complexity_score:
-    # - Text length: 0-3 điểm (0-1000 chars = 0, 1000-5000 = 1-2, 5000+ = 3)
-    # - Num tasks: 0-3 điểm (0-10 = 0-1, 10-50 = 1-2, 50+ = 2-3)
-    # - Technical keywords: 0-2 điểm (0 = 0, 1-3 = 1, 4+ = 2)
-    # - Area diversity: 0-2 điểm (0-3 = 0-1, 4-7 = 1-1.5, 8-10 = 1.5-2)
-    
-    text_length = len(tasks_text)
-    length_score = min(3.0, text_length / 2000)  # Max 3.0
-    
-    tasks_score = min(3.0, num_tasks / 20)  # Max 3.0
-    
-    technical_score = min(2.0, num_technical_keywords / 2)  # Max 2.0
-    
-    diversity_score = min(2.0, area_diversity / 5)  # Max 2.0
-    
-    task_complexity_score = round(length_score + tasks_score + technical_score + diversity_score, 2)
-    
-    # =================================================================
-    # TRẢ VỀ DICT VỚI 25 FEATURES
-    # =================================================================
-    
-    return {
-        # NHÓM 1: Task Counts (9 features)
-        'num_tasks': num_tasks,
-        'num_cleaning_tasks': num_cleaning_tasks,
-        'num_trash_collection_tasks': num_trash_collection_tasks,
-        'num_monitoring_tasks': num_monitoring_tasks,
-        'num_room_cleaning_tasks': num_room_cleaning_tasks,
-        'num_deep_cleaning_tasks': num_deep_cleaning_tasks,
-        'num_maintenance_tasks': num_maintenance_tasks,
-        'num_support_tasks': num_support_tasks,
-        'num_other_tasks': num_other_tasks,
-        
-        # NHÓM 2: Area Coverage (10 features)
-        'num_wc_tasks': num_wc_tasks,
-        'num_hallway_tasks': num_hallway_tasks,
-        'num_lobby_tasks': num_lobby_tasks,
-        'num_patient_room_tasks': num_patient_room_tasks,
-        'num_clinic_room_tasks': num_clinic_room_tasks,
-        'num_surgery_room_tasks': num_surgery_room_tasks,
-        'num_outdoor_tasks': num_outdoor_tasks,
-        'num_elevator_tasks': num_elevator_tasks,
-        'num_office_tasks': num_office_tasks,
-        'num_technical_room_tasks': num_technical_room_tasks,
-        
-        # NHÓM 3: Ratios & Complexity (6 features)
-        'cleaning_ratio': round(cleaning_ratio, 4),
-        'trash_collection_ratio': round(trash_collection_ratio, 4),
-        'monitoring_ratio': round(monitoring_ratio, 4),
-        'room_cleaning_ratio': round(room_cleaning_ratio, 4),
-        'area_diversity': area_diversity,
-        'task_complexity_score': task_complexity_score
-    }
-
-
-def _count_tasks_with_keywords(tasks: List[str], keywords: List[str]) -> int:
-    """Đếm số tasks chứa ít nhất 1 keyword"""
-    count = 0
-    for task in tasks:
-        task_lower = task.lower()
-        if any(keyword in task_lower for keyword in keywords):
-            count += 1
-    return count
-
-
-def _count_tasks_without_keywords(tasks: List[str], all_keywords: List[str]) -> int:
-    """Đếm số tasks KHÔNG chứa bất kỳ keyword nào"""
-    count = 0
-    for task in tasks:
-        task_lower = task.lower()
-        if not any(keyword in task_lower for keyword in all_keywords):
-            count += 1
-    return count
-
-
-def _get_empty_features() -> Dict[str, float]:
-    """Trả về dict với tất cả features = 0 (cho missing data)"""
-    return {
-        # NHÓM 1
-        'num_tasks': 0,
-        'num_cleaning_tasks': 0,
-        'num_trash_collection_tasks': 0,
-        'num_monitoring_tasks': 0,
-        'num_room_cleaning_tasks': 0,
-        'num_deep_cleaning_tasks': 0,
-        'num_maintenance_tasks': 0,
-        'num_support_tasks': 0,
-        'num_other_tasks': 0,
-        
-        # NHÓM 2
-        'num_wc_tasks': 0,
-        'num_hallway_tasks': 0,
-        'num_lobby_tasks': 0,
-        'num_patient_room_tasks': 0,
-        'num_clinic_room_tasks': 0,
-        'num_surgery_room_tasks': 0,
-        'num_outdoor_tasks': 0,
-        'num_elevator_tasks': 0,
-        'num_office_tasks': 0,
-        'num_technical_room_tasks': 0,
-        
-        # NHÓM 3
-        'cleaning_ratio': 0.0,
-        'trash_collection_ratio': 0.0,
-        'monitoring_ratio': 0.0,
-        'room_cleaning_ratio': 0.0,
-        'area_diversity': 0,
-        'task_complexity_score': 0.0
-    }
-
-
-# =================================================================
-# MAIN: ÁP DỤNG CHO TẤT CẢ TÒA NHÀ
-# =================================================================
-
-if __name__ == '__main__':
-    print("=" * 100)
-    print("TRÍCH XUẤT 25 KEYWORD FEATURES TỪ TASKS TEXT")
-    print("=" * 100)
-    
-    # Đọc file
-    print("\n📂 Đọc file ket_qua_cong_viec_full.xlsx...")
-    df = pd.read_excel('ket_qua_cong_viec_full.xlsx')
-    print(f"   ✅ Đọc thành công {len(df)} tòa nhà")
-    
-    # Gộp all_task_normal và all_task_dinhky
-    print("\n🔗 Gộp all_task_normal + all_task_dinhky...")
-    df['all_tasks_combined'] = (
-        df['all_task_normal'].fillna('') + ' ; ' + df['all_task_dinhky'].fillna('')
-    )
-    
-    # Áp dụng extraction function cho tất cả tòa
-    print("\n⚙️  Trích xuất 25 features cho từng tòa...")
-    features_list = []
-    
-    for idx, row in df.iterrows():
-        if (idx + 1) % 50 == 0:
-            print(f"   Đang xử lý... {idx + 1}/{len(df)} tòa")
-        
-        features = extract_25_keyword_features(row['all_tasks_combined'])
-        features['ma_dia_diem'] = row['ma_dia_diem']
-        features_list.append(features)
-    
-    print(f"   ✅ Hoàn thành {len(df)} tòa")
-    
-    # Tạo DataFrame
-    print("\n📊 Tạo DataFrame với 25 features...")
-    df_features = pd.DataFrame(features_list)
-    
-    # Sắp xếp lại cột: ma_dia_diem đầu tiên
-    cols = ['ma_dia_diem'] + [col for col in df_features.columns if col != 'ma_dia_diem']
-    df_features = df_features[cols]
-    
-    # Hiển thị thống kê
-    print("\n" + "=" * 100)
-    print("📈 THỐNG KÊ 25 FEATURES")
-    print("=" * 100)
-    
-    print("\n🔹 NHÓM 1: TASK COUNTS (9 features)")
-    group1_cols = [
-        'num_tasks', 'num_cleaning_tasks', 'num_trash_collection_tasks',
-        'num_monitoring_tasks', 'num_room_cleaning_tasks', 'num_deep_cleaning_tasks',
-        'num_maintenance_tasks', 'num_support_tasks', 'num_other_tasks'
-    ]
-    print(df_features[group1_cols].describe().round(2))
-    
-    print("\n🔹 NHÓM 2: AREA COVERAGE (10 features)")
-    group2_cols = [
-        'num_wc_tasks', 'num_hallway_tasks', 'num_lobby_tasks',
-        'num_patient_room_tasks', 'num_clinic_room_tasks', 'num_surgery_room_tasks',
-        'num_outdoor_tasks', 'num_elevator_tasks', 'num_office_tasks', 'num_technical_room_tasks'
-    ]
-    print(df_features[group2_cols].describe().round(2))
-    
-    print("\n🔹 NHÓM 3: RATIOS & COMPLEXITY (6 features)")
-    group3_cols = [
-        'cleaning_ratio', 'trash_collection_ratio', 'monitoring_ratio',
-        'room_cleaning_ratio', 'area_diversity', 'task_complexity_score'
-    ]
-    print(df_features[group3_cols].describe().round(4))
-    
-    # Top 10 tòa có nhiều công việc nhất
-    print("\n" + "=" * 100)
-    print("🏆 TOP 10 TÒA CÓ NHIỀU CÔNG VIỆC NHẤT")
-    print("=" * 100)
-    top10 = df_features.nlargest(10, 'num_tasks')[
-        ['ma_dia_diem', 'num_tasks', 'cleaning_ratio', 'area_diversity', 'task_complexity_score']
-    ]
-    print(top10.to_string(index=False))
-    
-    # Top 10 tòa phức tạp nhất
-    print("\n" + "=" * 100)
-    print("🎯 TOP 10 TÒA PHỨC TẠP NHẤT (task_complexity_score)")
-    print("=" * 100)
-    top10_complex = df_features.nlargest(10, 'task_complexity_score')[
-        ['ma_dia_diem', 'num_tasks', 'task_complexity_score', 'area_diversity']
-    ]
-    print(top10_complex.to_string(index=False))
-    
-    # Lưu file
-    output_file = 'features_25_keywords.csv'
-    print(f"\n💾 Lưu features vào file: {output_file}")
-    df_features.to_csv(output_file, index=False, encoding='utf-8-sig')
-    print(f"   ✅ Đã lưu {len(df_features)} tòa với 25 features")
-    
-    # Hiển thị 5 tòa đầu tiên
-    print("\n" + "=" * 100)
-    print("📋 MẪU DỮ LIỆU (5 tòa đầu tiên)")
-    print("=" * 100)
-    print(df_features.head().to_string(index=False))
-    
-    print("\n" + "=" * 100)
-    print("✅ HOÀN THÀNH!")
-    print("=" * 100)
-    print(f"\n📊 Tổng kết:")
-    print(f"   - Số tòa nhà: {len(df_features)}")
-    print(f"   - Số features: {len(df_features.columns) - 1}")  # Trừ cột ma_dia_diem
-    print(f"   - File output: {output_file}")
-    print(f"\n🎯 Bước tiếp theo:")
-    print(f"   1. Kiểm tra file {output_file}")
-    print(f"   2. Phân tích correlation giữa các features")
-    print(f"   3. Visualize distribution")
-    print(f"   4. Tiếp tục với TF-IDF features (10 features)")
-    print(f"   5. Join với building features (18 features)")
-    print(f"   → TỔNG: 25 + 10 + 18 = 53 features")
--- a/extract_25_features_new.py
+++ b/extract_25_features_new.py
@ -0,0 +1,250 @@
+"""
+EXTRACTION FUNCTION: REDUCED KEYWORD FEATURES TỪ TASKS TEXT
+Dựa trên phân tích 30,917 công việc từ 302 tòa nhà
+Updated: January 2026 (Reduced version for small dataset)
+"""
+
+import pandas as pd
+import re
+from typing import Dict, List
+
+
+# =========================================================
+# 1) HELPERS
+# =========================================================
+def _split_tasks(tasks_text: str) -> List[str]:
+    """Tách tasks bằng delimiter: ; | hoặc xuống dòng"""
+    tasks_text = str(tasks_text).lower()
+    tasks = re.split(r"[;|\n]+", tasks_text)
+    return [t.strip() for t in tasks if t.strip()]
+
+
+def _count_tasks_with_keywords(tasks: List[str], keywords: List[str]) -> int:
+    """Đếm số tasks chứa ít nhất 1 keyword"""
+    count = 0
+    for task in tasks:
+        if any(k in task for k in keywords):
+            count += 1
+    return count
+
+
+def _count_tasks_without_keywords(tasks: List[str], all_keywords: List[str]) -> int:
+    """Đếm số tasks KHÔNG chứa bất kỳ keyword nào"""
+    count = 0
+    for task in tasks:
+        if not any(k in task for k in all_keywords):
+            count += 1
+    return count
+
+
+def _get_empty_features() -> Dict[str, float]:
+    """Trả về dict với tất cả features = 0 (cho missing data)"""
+    return {
+        # TASK COUNTS (7)
+        "num_tasks": 0,
+        "num_cleaning_tasks": 0,
+        "num_trash_collection_tasks": 0,
+        "num_monitoring_tasks": 0,
+        "num_deep_cleaning_tasks": 0,
+        "num_support_tasks": 0,
+        "num_other_tasks": 0,
+
+        # AREA (reduced + aggregated) (7)
+        "num_wc_tasks": 0,
+        "num_hallway_tasks": 0,
+        "num_lobby_tasks": 0,
+        "num_outdoor_tasks": 0,
+        "num_elevator_tasks": 0,
+        "num_medical_tasks_total": 0,
+        "num_indoor_room_tasks": 0,
+
+        # RATIOS & DIVERSITY (4)
+        "cleaning_ratio": 0.0,
+        "trash_collection_ratio": 0.0,
+        "monitoring_ratio": 0.0,
+        "area_diversity": 0,
+    }
+
+
+# =========================================================
+# 2) MAIN EXTRACTION (NEW FEATURES)
+# =========================================================
+def extract_keyword_features_reduced(tasks_text: str) -> Dict[str, float]:
+    """
+    Trích xuất bộ keyword-based features (Reduced + Aggregated)
+
+    Args:
+        tasks_text: Chuỗi text chứa tất cả công việc (all_task_normal + all_task_dinhky)
+
+    Returns:
+        Dict với features mới (reduced)
+    """
+
+    if pd.isna(tasks_text) or str(tasks_text).strip() == "":
+        return _get_empty_features()
+
+    tasks = _split_tasks(tasks_text)
+    num_tasks = len(tasks)
+
+    # -----------------------------
+    # GROUP 1: TASK COUNTS (reduced)
+    # -----------------------------
+    cleaning_keywords = [
+        "vệ sinh", "lau", "chùi", "quét", "hút",
+        "đẩy khô", "lau ẩm", "làm sạch", "lau bụi", "lau kính", "lau sàn", "hút bụi"
+    ]
+    trash_keywords = [
+        "thu gom rác", "thay rác", "vận chuyển rác", "tua rác", "đổ rác",
+        "thu rác", "gom rác", "quét rác nổi", "trực rác", "rác nổi"
+    ]
+    monitoring_keywords = [
+        "trực", "trực phát sinh", "trực ps", "kiểm tra", "check",
+        "giám sát", "theo dõi", "tuần tra"
+    ]
+    deep_cleaning_keywords = [
+        "cọ rửa", "cọ bồn cầu", "cọ", "gạt kính", "đánh sàn",
+        "đánh chân tường", "chà tường", "cọ gương", "cọ lavabo"
+    ]
+    support_keywords = [
+        "giao ca", "bàn giao", "bàn giao ca", "chụp ảnh", "nhận ca",
+        "vsdc", "vệ sinh dụng cụ", "chuẩn bị dụng cụ", "chuẩn bị nước", "chuẩn bị", "giao ban"
+    ]
+
+    num_cleaning_tasks = _count_tasks_with_keywords(tasks, cleaning_keywords)
+    num_trash_collection_tasks = _count_tasks_with_keywords(tasks, trash_keywords)
+    num_monitoring_tasks = _count_tasks_with_keywords(tasks, monitoring_keywords)
+    num_deep_cleaning_tasks = _count_tasks_with_keywords(tasks, deep_cleaning_keywords)
+    num_support_tasks = _count_tasks_with_keywords(tasks, support_keywords)
+
+    all_keywords_for_other = (
+        cleaning_keywords + trash_keywords + monitoring_keywords + deep_cleaning_keywords + support_keywords
+    )
+    num_other_tasks = _count_tasks_without_keywords(tasks, all_keywords_for_other)
+
+    # -----------------------------
+    # GROUP 2: AREA COVERAGE (reduced + aggregated)
+    # -----------------------------
+    wc_keywords = [
+        "wc", "toilet", "nhà vệ sinh", "restroom", "phòng vệ sinh",
+        "bồn cầu", "lavabo", "tiểu nam", "bồn tiểu"
+    ]
+    hallway_keywords = ["hành lang", "corridor", "lối đi", "hall", "hl", "hanh lang"]
+    lobby_keywords = ["sảnh", "lobby", "tiền sảnh", "sảnh chính", "sanh"]
+    outdoor_keywords = ["ngoại cảnh", "sân", "vỉa hè", "khuôn viên", "cổng", "bãi xe", "tầng hầm"]
+    elevator_keywords = ["thang máy", "elevator", "lift", "cầu thang", "thang bộ", "tay vịn", "tam cấp"]
+
+    # medical detail keywords (we count but only output total)
+    patient_room_keywords = ["phòng bệnh", "giường bệnh", "phòng vip", "phòng bệnh nhân", "pb", "phòng bv"]
+    clinic_room_keywords = ["phòng khám", "khoa khám", "phòng khám bệnh", "khu khám", "pk"]
+    surgery_room_keywords = ["phòng mổ", "hậu phẫu", "phòng phẫu thuật", "khu mổ", "phòng pt"]
+    technical_room_keywords = [
+        "phòng xét nghiệm", "phòng chụp", "xq", "siêu âm", "kho dược",
+        "phòng xn", "labo", "phòng thí nghiệm", "nội soi", "cấp cứu", "hồi sức"
+    ]
+
+    office_keywords = [
+        "phòng nhân viên", "phòng giám đốc", "phòng họp", "phòng hành chính",
+        "văn phòng", "phòng ban", "phòng giao ban", "hội trường", "phòng kế toán"
+    ]
+
+    num_wc_tasks = _count_tasks_with_keywords(tasks, wc_keywords)
+    num_hallway_tasks = _count_tasks_with_keywords(tasks, hallway_keywords)
+    num_lobby_tasks = _count_tasks_with_keywords(tasks, lobby_keywords)
+    num_outdoor_tasks = _count_tasks_with_keywords(tasks, outdoor_keywords)
+    num_elevator_tasks = _count_tasks_with_keywords(tasks, elevator_keywords)
+
+    num_patient_room_tasks = _count_tasks_with_keywords(tasks, patient_room_keywords)
+    num_clinic_room_tasks = _count_tasks_with_keywords(tasks, clinic_room_keywords)
+    num_surgery_room_tasks = _count_tasks_with_keywords(tasks, surgery_room_keywords)
+    num_technical_room_tasks = _count_tasks_with_keywords(tasks, technical_room_keywords)
+
+    num_medical_tasks_total = (
+        num_patient_room_tasks + num_clinic_room_tasks + num_surgery_room_tasks + num_technical_room_tasks
+    )
+
+    num_indoor_room_tasks = _count_tasks_with_keywords(tasks, office_keywords)
+
+    # -----------------------------
+    # GROUP 3: RATIOS & DIVERSITY (reduced)
+    # -----------------------------
+    cleaning_ratio = num_cleaning_tasks / num_tasks if num_tasks > 0 else 0.0
+    trash_collection_ratio = num_trash_collection_tasks / num_tasks if num_tasks > 0 else 0.0
+    monitoring_ratio = num_monitoring_tasks / num_tasks if num_tasks > 0 else 0.0
+
+    area_counts = [
+        num_wc_tasks, num_hallway_tasks, num_lobby_tasks, num_outdoor_tasks, num_elevator_tasks,
+        num_medical_tasks_total, num_indoor_room_tasks
+    ]
+    area_diversity = sum(1 for c in area_counts if c > 0)
+
+    return {
+        # TASK COUNTS (7)
+        "num_tasks": num_tasks,
+        "num_cleaning_tasks": num_cleaning_tasks,
+        "num_trash_collection_tasks": num_trash_collection_tasks,
+        "num_monitoring_tasks": num_monitoring_tasks,
+        "num_deep_cleaning_tasks": num_deep_cleaning_tasks,
+        "num_support_tasks": num_support_tasks,
+        "num_other_tasks": num_other_tasks,
+
+        # AREA (reduced + aggregated) (7)
+        "num_wc_tasks": num_wc_tasks,
+        "num_hallway_tasks": num_hallway_tasks,
+        "num_lobby_tasks": num_lobby_tasks,
+        "num_outdoor_tasks": num_outdoor_tasks,
+        "num_elevator_tasks": num_elevator_tasks,
+        "num_medical_tasks_total": num_medical_tasks_total,
+        "num_indoor_room_tasks": num_indoor_room_tasks,
+
+        # RATIOS & DIVERSITY (4)
+        "cleaning_ratio": round(cleaning_ratio, 4),
+        "trash_collection_ratio": round(trash_collection_ratio, 4),
+        "monitoring_ratio": round(monitoring_ratio, 4),
+        "area_diversity": area_diversity,
+    }
+
+
+# =========================================================
+# 3) MAIN (APPLY TO EXCEL)
+# =========================================================
+if __name__ == "__main__":
+    print("=" * 100)
+    print("TRÍCH XUẤT REDUCED KEYWORD FEATURES TỪ TASKS TEXT")
+    print("=" * 100)
+
+    input_file = "ket_qua_cong_viec_full.xlsx"
+    output_csv = "features_keywords_reduced.csv"
+    output_xlsx = "features_keywords_reduced.xlsx"
+
+    print(f"\n📂 Đọc file {input_file} ...")
+    df = pd.read_excel(input_file)
+    print(f"✅ Đọc thành công {len(df)} rows")
+
+    print("\n🔗 Gộp all_task_normal + all_task_dinhky ...")
+    df["all_tasks_combined"] = df["all_task_normal"].fillna("") + " ; " + df["all_task_dinhky"].fillna("")
+
+    print("\n⚙️  Trích xuất features ...")
+    features_list = []
+    for idx, row in df.iterrows():
+        if (idx + 1) % 50 == 0:
+            print(f"   Đang xử lý... {idx + 1}/{len(df)}")
+        feats = extract_keyword_features_reduced(row["all_tasks_combined"])
+        feats["ma_dia_diem"] = row.get("ma_dia_diem", None)
+        features_list.append(feats)
+
+    df_features = pd.DataFrame(features_list)
+
+    # ma_dia_diem lên đầu
+    cols = ["ma_dia_diem"] + [c for c in df_features.columns if c != "ma_dia_diem"]
+    df_features = df_features[cols]
+
+    print("\n✅ DONE. Shape:", df_features.shape)
+
+    print(f"\n💾 Save CSV: {output_csv}")
+    df_features.to_csv(output_csv, index=False, encoding="utf-8-sig")
+
+    print(f"💾 Save XLSX: {output_xlsx}")
+    df_features.to_excel(output_xlsx, index=False, engine="openpyxl")
+
+    print("\n📋 Sample:")
+    print(df_features.head(5).to_string(index=False))
--- a/extract_shift_features.py
+++ b/extract_shift_features.py
@ -1,220 +0,0 @@
-"""
-EXTRACTION: SHIFT FEATURES FROM JSON FOR STAFF PREDICTION
-Trích xuất features ca làm việc để predict số nhân sự
-Created: January 5, 2026
-"""
-
-import pandas as pd
-import json
-from datetime import datetime
-from typing import Dict, List
-
-def parse_time_string(time_str):
-    """Parse time string to extract hours"""
-    if pd.isna(time_str) or time_str == 0:
-        return 0.0
-    
-    time_str = str(time_str)
-    
-    # Handle datetime format: "2025-01-01 22:00:00"
-    if '2025' in time_str or '2024' in time_str:
-        try:
-            dt = pd.to_datetime(time_str)
-            return dt.hour + dt.minute/60.0
-        except:
-            pass
-    
-    # Handle time format: "HH:MM:SS"
-    try:
-        parts = time_str.split(':')
-        if len(parts) >= 2:
-            hours = float(parts[0])
-            minutes = float(parts[1])
-            return hours + minutes/60.0
-    except:
-        pass
-    
-    return 0.0
-
-def extract_shift_features(json_file_path: str, output_excel_path: str):
-    """
-    Trích xuất shift features từ JSON và tạo Excel file
-    
-    Features:
-    - ma_dia_diem: Mã tòa nhà
-    - loai_ca: Loại ca làm việc
-    - bat_dau: Giờ bắt đầu ca
-    - ket_thuc: Giờ kết thúc ca
-    - tong_gio_lam: Tổng số giờ làm việc
-    - so_ca_cua_toa: Số lượng ca của tòa nhà xuất hiện trong file
-    - so_luong: Target variable - Số nhân sự trong ca (để predict)
-    
-    Args:
-        json_file_path: Path đến file JSON
-        output_excel_path: Path output file Excel
-    """
-    
-    print("=" * 80)
-    print("🚀 TRÍCH XUẤT SHIFT FEATURES TỪ JSON")
-    print("=" * 80)
-    
-    # 1. Đọc JSON file
-    print(f"\n📂 Đọc file: {json_file_path}")
-    with open(json_file_path, 'r', encoding='utf-8') as f:
-        data = json.load(f)
-    
-    print(f"✅ Đọc thành công: {len(data)} records")
-    
-    # 2. Chuyển sang DataFrame
-    df = pd.DataFrame(data)
-    
-    print(f"\n📊 Cấu trúc dữ liệu gốc:")
-    print(f"   - Số dòng: {len(df)}")
-    print(f"   - Số cột: {len(df.columns)}")
-    print(f"   - Các cột: {list(df.columns)}")
-    
-    # 3. Đếm số ca của mỗi tòa nhà
-    print(f"\n🔢 Đếm số ca của mỗi tòa nhà...")
-    shift_counts = df['Mã địa điểm'].value_counts().to_dict()
-    
-    # Thống kê số ca
-    unique_buildings = len(shift_counts)
-    total_shifts = sum(shift_counts.values())
-    avg_shifts = total_shifts / unique_buildings if unique_buildings > 0 else 0
-    
-    print(f"   - Số tòa nhà unique: {unique_buildings}")
-    print(f"   - Tổng số ca: {total_shifts}")
-    print(f"   - Trung bình ca/tòa: {avg_shifts:.2f}")
-    
-    # 4. Tạo features DataFrame
-    print(f"\n🔧 Trích xuất features...")
-    
-    features_data = []
-    
-    for idx, row in df.iterrows():
-        ma_dia_diem = row['Mã địa điểm']
-        
-        # Feature: Số ca của tòa nhà
-        so_ca_cua_toa = shift_counts.get(ma_dia_diem, 0)
-        
-        # Parse time strings
-        bat_dau_str = str(row['Bắt đầu'])
-        ket_thuc_str = str(row['Kết thúc'])
-        tong_gio_lam_str = str(row['Tổng giờ làm'])
-        
-        # Parse tổng giờ làm
-        try:
-            if 'day' in tong_gio_lam_str:
-                # Format: "7 days, 12:00:00"
-                parts = tong_gio_lam_str.split(',')
-                days = int(parts[0].split()[0])
-                time_parts = parts[1].strip().split(':')
-                hours = days * 24 + int(time_parts[0])
-                minutes = int(time_parts[1])
-                tong_gio_lam = hours + minutes/60.0
-            else:
-                # Format: "8:00:00"
-                time_parts = tong_gio_lam_str.split(':')
-                tong_gio_lam = float(time_parts[0]) + float(time_parts[1])/60.0
-        except:
-            tong_gio_lam = 0.0
-        
-        # Target variable
-        so_luong = row['Số lượng']
-        
-        # Thêm vào list
-        features_data.append({
-            'ma_dia_diem': ma_dia_diem,
-            'loai_ca': row['Loại ca'],
-            'bat_dau': bat_dau_str,
-            'ket_thuc': ket_thuc_str,
-            'tong_gio_lam': round(tong_gio_lam, 2),
-            'so_ca_cua_toa': so_ca_cua_toa,
-            'so_luong': so_luong
-        })
-    
-    # 5. Tạo DataFrame
-    features_df = pd.DataFrame(features_data)
-    
-    print(f"✅ Trích xuất thành công {len(features_df)} shifts")
-    
-    # 6. Thống kê
-    print(f"\n📈 THỐNG KÊ FEATURES:")
-    print(f"\n   🏢 Tòa nhà:")
-    print(f"      - Số tòa unique: {features_df['ma_dia_diem'].nunique()}")
-    
-    print(f"\n   🕐 Loại ca:")
-    for loai_ca, count in features_df['loai_ca'].value_counts().items():
-        print(f"      - {loai_ca}: {count} ca ({count/len(features_df)*100:.1f}%)")
-    
-    print(f"\n   ⏱️ Tổng giờ làm:")
-    print(f"      - Min: {features_df['tong_gio_lam'].min():.2f} giờ")
-    print(f"      - Mean: {features_df['tong_gio_lam'].mean():.2f} giờ")
-    print(f"      - Max: {features_df['tong_gio_lam'].max():.2f} giờ")
-    
-    print(f"\n   📊 Số ca của tòa:")
-    print(f"      - Min: {features_df['so_ca_cua_toa'].min()} ca")
-    print(f"      - Mean: {features_df['so_ca_cua_toa'].mean():.2f} ca")
-    print(f"      - Max: {features_df['so_ca_cua_toa'].max()} ca")
-    
-    print(f"\n   👥 Số lượng nhân sự (TARGET):")
-    print(f"      - Min: {features_df['so_luong'].min()} người")
-    print(f"      - Mean: {features_df['so_luong'].mean():.2f} người")
-    print(f"      - Median: {features_df['so_luong'].median():.0f} người")
-    print(f"      - Max: {features_df['so_luong'].max()} người")
-    
-    # Top 5 tòa có nhiều nhân sự nhất
-    print(f"\n   🏆 TOP 5 CA CÓ NHIỀU NHÂN SỰ NHẤT:")
-    top5 = features_df.nlargest(5, 'so_luong')[['ma_dia_diem', 'loai_ca', 'tong_gio_lam', 'so_luong']]
-    for idx, row in top5.iterrows():
-        print(f"      - {row['ma_dia_diem']}: {row['loai_ca']} ({row['tong_gio_lam']:.1f}h) → {row['so_luong']} người")
-    
-    # 7. Export sang Excel
-    print(f"\n💾 Xuất file Excel...")
-    features_df.to_excel(output_excel_path, index=False, engine='openpyxl')
-    
-    print(f"✅ Đã tạo file: {output_excel_path}")
-    print(f"   - Số dòng: {len(features_df)}")
-    print(f"   - Số cột: {len(features_df.columns)}")
-    
-    # 8. Tạo thêm CSV backup
-    csv_path = output_excel_path.replace('.xlsx', '.csv')
-    features_df.to_csv(csv_path, index=False, encoding='utf-8-sig')
-    print(f"✅ Đã tạo CSV backup: {csv_path}")
-    
-    # 9. Summary statistics
-    print(f"\n📊 SUMMARY STATISTICS:")
-    print(features_df.describe())
-    
-    print(f"\n" + "=" * 80)
-    print("✅ HOÀN THÀNH!")
-    print("=" * 80)
-    
-    return features_df
-
-
-if __name__ == "__main__":
-    # Paths
-    json_file = "Link LLV 2025.json"
-    output_excel = "shift_features_for_prediction.xlsx"
-    
-    # Run extraction
-    df = extract_shift_features(json_file, output_excel)
-    
-    print(f"\n📋 SAMPLE DATA (10 dòng đầu):")
-    print(df.head(10).to_string())
-    
-    print(f"\n🎯 CÁC FILE ĐÃ TẠO:")
-    print(f"   1. {output_excel} - File Excel chính (để mở và xem)")
-    print(f"   2. shift_features_for_prediction.csv - File CSV backup")
-    
-    print(f"\n🎯 CẤU TRÚC FILE:")
-    print(f"   - ma_dia_diem: Mã tòa nhà")
-    print(f"   - loai_ca: Loại ca (Hành chính, Ca sáng, Ca chiều, ...)")
-    print(f"   - bat_dau: Giờ bắt đầu ca")
-    print(f"   - ket_thuc: Giờ kết thúc ca")
-    print(f"   - tong_gio_lam: Tổng số giờ làm việc")
-    print(f"   - so_ca_cua_toa: Số ca của tòa nhà (feature)")
-    print(f"   - so_luong: Số nhân sự (TARGET VARIABLE)")
-    
-    print(f"\n✨ READY FOR MODELING!")
--- a/extract_text_features.py
+++ b/extract_text_features.py
@ -1,500 +0,0 @@
-"""
-Text Feature Extraction Pipeline for Staff Prediction Model
-=============================================================
-
-This script extracts TF-IDF and SVD features from Vietnamese task descriptions.
-Can be used for both training (fit_transform) and inference (transform).
-
-Usage:
------
-Training mode:
-    python extract_text_features.py --mode train --input data.xlsx --output features.csv
-
-Inference mode:
-    python extract_text_features.py --mode predict --input new_data.xlsx --output predictions.csv
-
-As a module:
-    from extract_text_features import TextFeatureExtractor
-    extractor = TextFeatureExtractor()
-    features = extractor.fit_transform(texts)
-
-Author: ML Team
-Date: 2026-01-06
-"""
-
-import pandas as pd
-import numpy as np
-import re
-import pickle
-import argparse
-import os
-from typing import List, Tuple, Optional, Union
-from sklearn.feature_extraction.text import TfidfVectorizer
-from sklearn.decomposition import TruncatedSVD
-
-
-class TextFeatureExtractor:
-    """
-    Extract TF-IDF and SVD features from Vietnamese text.
-    
-    Attributes:
-        max_features (int): Maximum number of TF-IDF features
-        n_components (int): Number of SVD components
-        ngram_range (tuple): N-gram range for TF-IDF
-        min_df (int): Minimum document frequency
-        max_df (float): Maximum document frequency
-        tfidf (TfidfVectorizer): Fitted TF-IDF vectorizer
-        svd (TruncatedSVD): Fitted SVD model
-        is_fitted (bool): Whether the extractor has been fitted
-    """
-    
-    def __init__(
-        self,
-        max_features: int = 200,
-        n_components: int = 50,
-        ngram_range: Tuple[int, int] = (1, 2),
-        min_df: int = 2,
-        max_df: float = 0.95,
-        random_state: int = 42
-    ):
-        """
-        Initialize the TextFeatureExtractor.
-        
-        Args:
-            max_features: Maximum number of TF-IDF features (default: 200)
-            n_components: Number of SVD components (default: 50)
-            ngram_range: N-gram range for TF-IDF (default: (1, 2))
-            min_df: Minimum document frequency (default: 2)
-            max_df: Maximum document frequency (default: 0.95)
-            random_state: Random seed for reproducibility (default: 42)
-        """
-        self.max_features = max_features
-        self.n_components = n_components
-        self.ngram_range = ngram_range
-        self.min_df = min_df
-        self.max_df = max_df
-        self.random_state = random_state
-        
-        # Initialize models
-        self.tfidf = TfidfVectorizer(
-            max_features=max_features,
-            ngram_range=ngram_range,
-            min_df=min_df,
-            max_df=max_df,
-            sublinear_tf=True,
-            strip_accents=None
-        )
-        
-        self.svd = TruncatedSVD(
-            n_components=n_components,
-            random_state=random_state
-        )
-        
-        self.is_fitted = False
-        
-    @staticmethod
-    def preprocess_text(text: Union[str, float, None]) -> str:
-        """
-        Preprocess Vietnamese text.
-        
-        Args:
-            text: Input text (can be str, float, or None)
-            
-        Returns:
-            Cleaned text string
-        """
-        if pd.isna(text) or text is None or str(text).strip() == '':
-            return ''
-        
-        text = str(text).lower()
-        
-        # Keep Vietnamese characters, numbers, spaces
-        text = re.sub(
-            r'[^a-zàáạảãâầấậẩẫăằắặẳẵèéẹẻẽêềếệểễìíịỉĩòóọỏõôồốộổỗơờớợởỡùúụủũưừứựửữỳýỵỷỹđ0-9\s]',
-            ' ',
-            text
-        )
-        
-        # Remove multiple spaces
-        text = re.sub(r'\s+', ' ', text).strip()
-        
-        return text
-    
-    def preprocess_texts(self, texts: List[str]) -> List[str]:
-        """
-        Preprocess a list of texts.
-        
-        Args:
-            texts: List of text strings
-            
-        Returns:
-            List of cleaned text strings
-        """
-        return [self.preprocess_text(text) for text in texts]
-    
-    def combine_task_columns(
-        self,
-        task_normal: pd.Series,
-        task_dinhky: pd.Series
-    ) -> List[str]:
-        """
-        Combine two task columns into one.
-        
-        Args:
-            task_normal: Series of normal task descriptions
-            task_dinhky: Series of scheduled task descriptions
-            
-        Returns:
-            List of combined and cleaned texts
-        """
-        # Preprocess both columns
-        normal_clean = task_normal.apply(self.preprocess_text)
-        dinhky_clean = task_dinhky.apply(self.preprocess_text)
-        
-        # Combine
-        combined = (normal_clean + ' ' + dinhky_clean).str.strip()
-        
-        return combined.tolist()
-    
-    def fit(self, texts: List[str]) -> 'TextFeatureExtractor':
-        """
-        Fit the TF-IDF and SVD models on training texts.
-        
-        Args:
-            texts: List of text strings
-            
-        Returns:
-            self
-        """
-        print(f"Fitting TF-IDF on {len(texts)} documents...")
-        
-        # Preprocess
-        texts_clean = self.preprocess_texts(texts)
-        
-        # Fit TF-IDF
-        tfidf_matrix = self.tfidf.fit_transform(texts_clean)
-        
-        print(f"  TF-IDF shape: {tfidf_matrix.shape}")
-        print(f"  Sparsity: {(1.0 - tfidf_matrix.nnz / (tfidf_matrix.shape[0] * tfidf_matrix.shape[1])) * 100:.2f}%")
-        
-        # Fit SVD
-        print(f"\nFitting SVD ({self.n_components} components)...")
-        self.svd.fit(tfidf_matrix)
-        
-        print(f"  Explained variance: {self.svd.explained_variance_ratio_.sum()*100:.2f}%")
-        
-        self.is_fitted = True
-        print("\n✅ Fitting complete!")
-        
-        return self
-    
-    def transform(self, texts: List[str]) -> np.ndarray:
-        """
-        Transform texts to SVD features.
-        
-        Args:
-            texts: List of text strings
-            
-        Returns:
-            Array of SVD features (n_samples, n_components)
-        """
-        if not self.is_fitted:
-            raise ValueError("Extractor must be fitted before transform. Call fit() first.")
-        
-        # Preprocess
-        texts_clean = self.preprocess_texts(texts)
-        
-        # TF-IDF transform
-        tfidf_matrix = self.tfidf.transform(texts_clean)
-        
-        # SVD transform
-        svd_features = self.svd.transform(tfidf_matrix)
-        
-        return svd_features
-    
-    def fit_transform(self, texts: List[str]) -> np.ndarray:
-        """
-        Fit and transform in one step.
-        
-        Args:
-            texts: List of text strings
-            
-        Returns:
-            Array of SVD features (n_samples, n_components)
-        """
-        self.fit(texts)
-        return self.transform(texts)
-    
-    def get_feature_names(self) -> List[str]:
-        """
-        Get feature names for the SVD components.
-        
-        Returns:
-            List of feature names
-        """
-        return [f'text_svd_{i+1}' for i in range(self.n_components)]
-    
-    def get_top_tfidf_features(self, top_n: int = 20) -> pd.DataFrame:
-        """
-        Get top TF-IDF features by document frequency.
-        
-        Args:
-            top_n: Number of top features to return
-            
-        Returns:
-            DataFrame with feature names and document frequencies
-        """
-        if not self.is_fitted:
-            raise ValueError("Extractor must be fitted first.")
-        
-        feature_names = self.tfidf.get_feature_names_out()
-        doc_freq = np.asarray(self.tfidf.transform(self.tfidf.get_feature_names_out()).sum(axis=0)).ravel()
-        
-        top_features = pd.DataFrame({
-            'feature': feature_names,
-            'doc_frequency': doc_freq
-        }).sort_values('doc_frequency', ascending=False).head(top_n)
-        
-        return top_features
-    
-    def save(self, filepath: str):
-        """
-        Save the fitted extractor to disk.
-        
-        Args:
-            filepath: Path to save the extractor (should end with .pkl)
-        """
-        if not self.is_fitted:
-            raise ValueError("Extractor must be fitted before saving.")
-        
-        with open(filepath, 'wb') as f:
-            pickle.dump(self, f)
-        
-        print(f"✅ Saved extractor to: {filepath}")
-    
-    @staticmethod
-    def load(filepath: str) -> 'TextFeatureExtractor':
-        """
-        Load a fitted extractor from disk.
-        
-        Args:
-            filepath: Path to the saved extractor
-            
-        Returns:
-            Loaded TextFeatureExtractor
-        """
-        with open(filepath, 'rb') as f:
-            extractor = pickle.load(f)
-        
-        print(f"✅ Loaded extractor from: {filepath}")
-        return extractor
-    
-    def get_summary(self) -> dict:
-        """
-        Get summary statistics of the extractor.
-        
-        Returns:
-            Dictionary with summary information
-        """
-        if not self.is_fitted:
-            return {'status': 'not_fitted'}
-        
-        return {
-            'status': 'fitted',
-            'max_features': self.max_features,
-            'n_components': self.n_components,
-            'ngram_range': self.ngram_range,
-            'min_df': self.min_df,
-            'max_df': self.max_df,
-            'actual_tfidf_features': len(self.tfidf.get_feature_names_out()),
-            'explained_variance': float(self.svd.explained_variance_ratio_.sum()),
-            'random_state': self.random_state
-        }
-
-
-def extract_features_from_dataframe(
-    df: pd.DataFrame,
-    text_columns: List[str] = ['all_task_normal', 'all_task_dinhky'],
-    extractor: Optional[TextFeatureExtractor] = None,
-    fit: bool = True
-) -> Tuple[pd.DataFrame, TextFeatureExtractor]:
-    """
-    Extract text features from a DataFrame.
-    
-    Args:
-        df: Input DataFrame
-        text_columns: List of text column names to combine
-        extractor: Pre-fitted extractor (optional, for inference)
-        fit: Whether to fit the extractor (True for training, False for inference)
-        
-    Returns:
-        Tuple of (features_df, extractor)
-    """
-    print("=" * 80)
-    print("TEXT FEATURE EXTRACTION")
-    print("=" * 80)
-    
-    # Combine text columns
-    if len(text_columns) == 1:
-        texts = df[text_columns[0]].tolist()
-    else:
-        print(f"\nCombining {len(text_columns)} text columns...")
-        texts = []
-        for _, row in df.iterrows():
-            combined = ' '.join([str(row[col]) if pd.notna(row[col]) else '' for col in text_columns])
-            texts.append(combined)
-    
-    print(f"Total documents: {len(texts)}")
-    
-    # Initialize or use existing extractor
-    if extractor is None:
-        print("\nInitializing new TextFeatureExtractor...")
-        extractor = TextFeatureExtractor()
-    
-    # Extract features
-    if fit:
-        print("\nMode: TRAINING (fit_transform)")
-        features = extractor.fit_transform(texts)
-    else:
-        print("\nMode: INFERENCE (transform)")
-        features = extractor.transform(texts)
-    
-    # Create DataFrame
-    feature_names = extractor.get_feature_names()
-    features_df = pd.DataFrame(features, columns=feature_names)
-    
-    print(f"\n✅ Extraction complete!")
-    print(f"   Output shape: {features_df.shape}")
-    print(f"   Feature names: {feature_names[:5]}... (showing first 5)")
-    
-    # Summary
-    summary = extractor.get_summary()
-    print(f"\n📊 Extractor Summary:")
-    for key, value in summary.items():
-        print(f"   {key}: {value}")
-    
-    return features_df, extractor
-
-
-def main():
-    """Command-line interface for text feature extraction."""
-    parser = argparse.ArgumentParser(
-        description='Extract TF-IDF and SVD features from Vietnamese task descriptions'
-    )
-    
-    parser.add_argument(
-        '--mode',
-        type=str,
-        choices=['train', 'predict'],
-        required=True,
-        help='Mode: train (fit and save) or predict (load and transform)'
-    )
-    
-    parser.add_argument(
-        '--input',
-        type=str,
-        required=True,
-        help='Input file path (Excel or CSV)'
-    )
-    
-    parser.add_argument(
-        '--output',
-        type=str,
-        required=True,
-        help='Output file path for features (CSV)'
-    )
-    
-    parser.add_argument(
-        '--text-columns',
-        type=str,
-        nargs='+',
-        default=['all_task_normal', 'all_task_dinhky'],
-        help='Text column names to combine (default: all_task_normal all_task_dinhky)'
-    )
-    
-    parser.add_argument(
-        '--extractor-path',
-        type=str,
-        default='text_feature_extractor.pkl',
-        help='Path to save/load the extractor (default: text_feature_extractor.pkl)'
-    )
-    
-    parser.add_argument(
-        '--max-features',
-        type=int,
-        default=200,
-        help='Maximum TF-IDF features (default: 200)'
-    )
-    
-    parser.add_argument(
-        '--n-components',
-        type=int,
-        default=50,
-        help='Number of SVD components (default: 50)'
-    )
-    
-    args = parser.parse_args()
-    
-    # Load data
-    print(f"\n📂 Loading data from: {args.input}")
-    if args.input.endswith('.xlsx'):
-        df = pd.read_excel(args.input)
-    elif args.input.endswith('.csv'):
-        df = pd.read_csv(args.input)
-    else:
-        raise ValueError("Input file must be .xlsx or .csv")
-    
-    print(f"   Shape: {df.shape}")
-    
-    # Check columns
-    missing_cols = [col for col in args.text_columns if col not in df.columns]
-    if missing_cols:
-        raise ValueError(f"Missing columns in input data: {missing_cols}")
-    
-    # Extract features
-    if args.mode == 'train':
-        # Training mode: fit and save
-        features_df, extractor = extract_features_from_dataframe(
-            df,
-            text_columns=args.text_columns,
-            extractor=TextFeatureExtractor(
-                max_features=args.max_features,
-                n_components=args.n_components
-            ),
-            fit=True
-        )
-        
-        # Save extractor
-        extractor.save(args.extractor_path)
-        
-    else:  # predict mode
-        # Inference mode: load and transform
-        if not os.path.exists(args.extractor_path):
-            raise FileNotFoundError(f"Extractor not found: {args.extractor_path}")
-        
-        extractor = TextFeatureExtractor.load(args.extractor_path)
-        
-        features_df, _ = extract_features_from_dataframe(
-            df,
-            text_columns=args.text_columns,
-            extractor=extractor,
-            fit=False
-        )
-    
-    # Save features
-    features_df.to_csv(args.output, index=False)
-    print(f"\n✅ Saved features to: {args.output}")
-    
-    # Show top TF-IDF features (training mode only)
-    if args.mode == 'train':
-        print("\n📋 Top 20 TF-IDF features:")
-        top_features = extractor.get_top_tfidf_features(top_n=20)
-        print(top_features.to_string(index=False))
-    
-    print("\n" + "=" * 80)
-    print("✅ COMPLETE!")
-    print("=" * 80)
-
-
-if __name__ == '__main__':
-    main()
--- a/final.ipynb
+++ b/final.ipynb
--- a/input.json
+++ b/input.json
@ -0,0 +1,31 @@
+{
+  "ma_dia_diem": "TD-001",
+  "loai_ca": "Hành chính",
+
+  "bat_dau": "07:00:00", 
+  "ket_thuc": "15:00:00",
+  "tong_gio_lam": 8.0,
+
+  "so_ca_cua_toa": 3,
+
+  "all_task_normal": "Lau sàn hành lang; Thu gom rác WC; Vệ sinh sảnh chính; Lau kính thang máy",
+  "all_task_dinhky": "Cọ bồn cầu WC tầng 2; Đánh sàn lobby; Trực phát sinh",
+
+  "so_tang": 12,
+  "so_cua_thang_may": 4,
+
+  "dien_tich_ngoai_canh": 350.0,
+  "dien_tich_sanh": 220.0,
+  "dien_tich_hanh_lang": 1800.0,
+  "dien_tich_wc": 420.0,
+  "dien_tich_phong": 2600.0,
+  "dien_tich_tham": 800.0,
+  "dien_tich_kinh": 560.0,
+
+  "doc_ham": 1,
+  "vien_phan_quang": 0,
+  "op_tuong": 1,
+  "op_chan_tuong": 1,
+  "ranh_thoat_nuoc": 1
+
+}
--- a/input_explained.jsonc
+++ b/input_explained.jsonc
@ -0,0 +1,72 @@
+{
+  // ========================================
+  // THÔNG TIN TÒA NHÀ & CA LÀM VIỆC
+  // ========================================
+  
+  "ma_dia_diem": "TD-001",              // Mã định danh tòa nhà
+  "loai_ca": "Hành chính",              // Loại ca: Hành chính / Ca sáng / Ca chiều / Ca tối / Ca đêm
+
+  // Thời gian ca làm việc
+  "bat_dau": "07:00:00",                // Giờ bắt đầu ca (HH:MM:SS)
+  "ket_thuc": "15:00:00",               // Giờ kết thúc ca (HH:MM:SS)
+  "tong_gio_lam": 8.0,                  // Tổng số giờ làm việc trong ca
+
+  "so_ca_cua_toa": 3,                   // Tổng số ca làm việc trong ngày của toà nhà
+
+  // ========================================
+  // CÔNG VIỆC (TEXT FEATURES INPUT)
+  // ========================================
+  
+  // ⭐ all_task_normal: Công việc HẰNG NGÀY
+  // - Các task được thực hiện MỖI NGÀY trong ca này
+  // - Ví dụ: Lau sàn, Thu rác, Vệ sinh WC (mỗi ngày đều làm)
+  // - Format: Các task cách nhau bởi dấu chấm phẩy (;)
+  "all_task_normal": "Lau sàn hành lang; Thu gom rác WC; Vệ sinh sảnh chính; Lau kính thang máy",
+  
+  // ⭐ all_task_dinhky: Công việc ĐỊNH KỲ (TUẦN / THÁNG)
+  // - Các task được thực hiện THEO TUẦN hoặc THEO THÁNG
+  // - Ví dụ: Cọ bồn cầu (1 tuần/lần), Đánh sàn (1 tháng/lần), Trực phát sinh
+  // - Format: Các task cách nhau bởi dấu chấm phẩy (;)
+  "all_task_dinhky": "Cọ bồn cầu WC tầng 2; Đánh sàn lobby; Trực phát sinh",
+
+  // ========================================
+  // ĐẶC ĐIỂM VẬT LÝ TÒA NHÀ
+  // ========================================
+  
+  "so_tang": 12,                        // Tổng số tầng của tòa nhà
+  "so_cua_thang_may": 4,                // Số lượng cửa thang máy (tổng tất cả các tầng)
+
+  // ========================================
+  // DIỆN TÍCH (m²) - TỔNG CỦA TẤT CẢ CÁC TẦNG
+  // ========================================
+  // ⚠️⚠️⚠️ LƯU Ý QUAN TRỌNG ⚠️⚠️⚠️
+  // Tất cả diện tích bên dưới là TỔNG diện tích của TẤT CẢ các tầng, KHÔNG PHẢI diện tích 1 tầng!
+  // 
+  // Ví dụ: 
+  //   - Tòa nhà có 12 tầng
+  //   - Mỗi tầng có 150m² hành lang
+  //   - → dien_tich_hanh_lang = 12 × 150 = 1800m² (KHÔNG PHẢI 150m²!)
+  // 
+  // Công thức: dien_tich_X = so_tang × dien_tich_X_moi_tang
+  
+  "dien_tich_ngoai_canh": 350.0,       // Tổng diện tích khu vực ngoại cảnh (sân, vỉa hè, khuôn viên, bãi xe)
+  "dien_tich_sanh": 220.0,              // Tổng diện tích sảnh, lobby (tất cả các tầng)
+  "dien_tich_hanh_lang": 1800.0,       // Tổng diện tích hành lang, lối đi (tất cả các tầng)
+  "dien_tich_wc": 420.0,                // Tổng diện tích WC, toilet, nhà vệ sinh (tất cả các tầng)
+  "dien_tich_phong": 2600.0,            // Tổng diện tích các phòng (văn phòng, phòng họp, phòng bệnh, v.v.)
+  "dien_tich_tham": 800.0,              // Tổng diện tích sàn có thảm (cần hút bụi thay vì lau)
+  "dien_tich_kinh": 560.0,              // Tổng diện tích kính cần lau (cửa kính, vách kính, mặt dựng)
+
+  // ========================================
+  // ĐẶC ĐIỂM BỀ MẶT & KẾT CẤU
+  // ========================================
+  // Giá trị: 1 = Có, 0 = Không
+  // Các đặc điểm này ảnh hưởng đến khối lượng công việc và thời gian vệ sinh
+  
+  "doc_ham": 1,                         // Có độc hầm (tầng hầm, basement) hay không - Cần thêm công việc vệ sinh
+  "vien_phan_quang": 0,                 // Có viền phản quang hay không - Cần làm sạch cẩn thận hơn
+  "op_tuong": 1,                        // Có ốp tường hay không (gạch, tấm ốp) - Dễ vệ sinh hơn tường sơn
+  "op_chan_tuong": 1,                   // Có ốp chân tường hay không - Cần chùi thêm phần chân tường
+  "ranh_thoat_nuoc": 1                  // Có rãnh thoát nước hay không - Cần vệ sinh rãnh
+
+}
--- a/merge_all_features.py
+++ b/merge_all_features.py
@ -1,270 +0,0 @@
-"""
-MERGE 3 FILES: SHIFT + TASK + BUILDING FEATURES
-Gộp 3 file Excel thành 1 file tổng để predict số nhân sự
-Created: January 5, 2026
-"""
-
-import pandas as pd
-import numpy as np
-
-def merge_three_datasets(
-    shift_file: str,
-    task_file: str, 
-    building_file: str,
-    output_excel: str,
-    output_csv: str
-):
-    """
-    Merge 3 datasets:
-    1. shift_features_for_prediction.xlsx (942 rows × 7 cols)
-    2. ket_qua_cong_viec_full_WITH_FEATURES.xlsx (302 rows × 28 cols)
-    3. Du_Lieu_Toa_Nha_Aggregate.xlsx (233 rows × 18 cols)
-    
-    Join key: ma_dia_diem (Mã địa điểm)
-    Join type: LEFT JOIN (keep all shifts)
-    
-    Expected output: 942 rows × (7 + 25 + 17) = 49 cols
-    """
-    
-    print("=" * 80)
-    print("🚀 MERGE 3 DATASETS: SHIFT + TASK + BUILDING FEATURES")
-    print("=" * 80)
-    
-    # =====================================================================
-    # 1. ĐỌC FILE 1: SHIFT FEATURES (base dataset)
-    # =====================================================================
-    print(f"\n📂 [1/3] Đọc file SHIFT features: {shift_file}")
-    df_shift = pd.read_excel(shift_file)
-    
-    print(f"   ✅ Shape: {df_shift.shape}")
-    print(f"   ✅ Columns: {list(df_shift.columns)}")
-    print(f"   ✅ Unique buildings: {df_shift['ma_dia_diem'].nunique()}")
-    
-    # =====================================================================
-    # 2. ĐỌC FILE 2: TASK FEATURES
-    # =====================================================================
-    print(f"\n📂 [2/3] Đọc file TASK features: {task_file}")
-    df_task = pd.read_excel(task_file)
-    
-    print(f"   ✅ Shape: {df_task.shape}")
-    print(f"   ✅ Unique buildings: {df_task['ma_dia_diem'].nunique()}")
-    
-    # Loại bỏ cột text gốc (không cần cho modeling)
-    cols_to_drop = ['all_task_normal', 'all_task_dinhky']
-    df_task = df_task.drop(columns=[c for c in cols_to_drop if c in df_task.columns])
-    
-    print(f"   ✅ Đã loại bỏ cột text, còn lại: {df_task.shape[1]} cột")
-    print(f"   ✅ Task feature columns: {list(df_task.columns)}")
-    
-    # =====================================================================
-    # 3. ĐỌC FILE 3: BUILDING FEATURES
-    # =====================================================================
-    print(f"\n📂 [3/3] Đọc file BUILDING features: {building_file}")
-    df_building = pd.read_excel(building_file)
-    
-    print(f"   ✅ Shape: {df_building.shape}")
-    print(f"   ✅ Unique buildings: {df_building['Mã địa điểm'].nunique()}")
-    
-    # Rename column để match với các file khác
-    df_building = df_building.rename(columns={'Mã địa điểm': 'ma_dia_diem'})
-    
-    # Rename các cột để dễ dùng (bỏ dấu, khoảng trắng)
-    column_mapping = {
-        'Loại hình': 'loai_hinh',
-        'Tên Tòa Tháp': 'ten_toa_thap',
-        'Mức độ Lưu lượng KH': 'muc_do_luu_luong',
-        'Số tầng': 'so_tang',
-        'Tổng số cửa thang máy': 'so_cua_thang_may',
-        'Diện tích ngoại cảnh Tòa tháp (m2)': 'dien_tich_ngoai_canh',
-        'Sàn Sảnh (m2)': 'dien_tich_sanh',
-        'Sàn Hành lang (m2)': 'dien_tich_hanh_lang',
-        'Sàn WC (m2)': 'dien_tich_wc',
-        'Sàn Phòng (m2)': 'dien_tich_phong',
-        'Thảm (m2)': 'dien_tich_tham',
-        'Dốc hầm (m)': 'doc_ham',
-        'Viền phản quang (m)': 'vien_phan_quang',
-        'Ốp tường (m2)': 'op_tuong',
-        'Ốp chân tường (m2)': 'op_chan_tuong',
-        'Rãnh thoát nước (m)': 'ranh_thoat_nuoc',
-        'Kính (m2)': 'dien_tich_kinh'
-    }
-    df_building = df_building.rename(columns=column_mapping)
-    
-    print(f"   ✅ Đã rename columns: {list(df_building.columns)}")
-    
-    # =====================================================================
-    # 4. MERGE DATASET 1 (SHIFT) + DATASET 2 (TASK)
-    # =====================================================================
-    print(f"\n🔗 [MERGE 1/2] Merge SHIFT + TASK features...")
-    print(f"   Join key: ma_dia_diem")
-    print(f"   Join type: LEFT (keep all shifts)")
-    
-    df_merged = df_shift.merge(
-        df_task,
-        on='ma_dia_diem',
-        how='left',
-        suffixes=('', '_task')
-    )
-    
-    print(f"   ✅ Result shape: {df_merged.shape}")
-    print(f"   ✅ Số shift giữ nguyên: {len(df_merged)} (expected: {len(df_shift)})")
-    
-    # Check missing values sau merge
-    task_features = [col for col in df_task.columns if col != 'ma_dia_diem']
-    missing_task_features = df_merged[task_features].isna().sum().sum()
-    print(f"   ⚠️  Missing values trong task features: {missing_task_features}")
-    
-    # =====================================================================
-    # 5. MERGE RESULT + DATASET 3 (BUILDING)
-    # =====================================================================
-    print(f"\n🔗 [MERGE 2/2] Merge (SHIFT+TASK) + BUILDING features...")
-    print(f"   Join key: ma_dia_diem")
-    print(f"   Join type: LEFT (keep all shifts)")
-    
-    df_final = df_merged.merge(
-        df_building,
-        on='ma_dia_diem',
-        how='left',
-        suffixes=('', '_building')
-    )
-    
-    print(f"   ✅ Result shape: {df_final.shape}")
-    print(f"   ✅ Số shift giữ nguyên: {len(df_final)} (expected: {len(df_shift)})")
-    
-    # Check missing values sau merge
-    building_features = [col for col in df_building.columns if col != 'ma_dia_diem']
-    missing_building_features = df_final[building_features].isna().sum().sum()
-    print(f"   ⚠️  Missing values trong building features: {missing_building_features}")
-    
-    # =====================================================================
-    # 6. FINAL STATISTICS
-    # =====================================================================
-    print(f"\n📊 FINAL DATASET STATISTICS:")
-    print(f"   📐 Shape: {df_final.shape}")
-    print(f"   🏢 Unique buildings: {df_final['ma_dia_diem'].nunique()}")
-    print(f"   📋 Total columns: {len(df_final.columns)}")
-    
-    print(f"\n   📋 COLUMN BREAKDOWN:")
-    print(f"      - Shift features: 7 cols")
-    print(f"      - Task features: {len(task_features)} cols")
-    print(f"      - Building features: {len(building_features)} cols")
-    print(f"      - Total: {7 + len(task_features) + len(building_features)} cols")
-    
-    # Check missing values tổng thể
-    print(f"\n   ⚠️  MISSING VALUES BY COLUMN:")
-    missing_summary = df_final.isna().sum()
-    missing_summary = missing_summary[missing_summary > 0].sort_values(ascending=False)
-    
-    if len(missing_summary) > 0:
-        print(f"      Found {len(missing_summary)} columns with missing values:")
-        for col, count in missing_summary.head(20).items():
-            pct = count / len(df_final) * 100
-            print(f"      - {col}: {count} ({pct:.1f}%)")
-    else:
-        print(f"      ✅ No missing values!")
-    
-    # =====================================================================
-    # 7. DATA VALIDATION
-    # =====================================================================
-    print(f"\n✅ DATA VALIDATION:")
-    
-    # Check target variable
-    print(f"   🎯 Target variable (so_luong):")
-    print(f"      - Count: {df_final['so_luong'].notna().sum()}")
-    print(f"      - Missing: {df_final['so_luong'].isna().sum()}")
-    print(f"      - Min: {df_final['so_luong'].min()}")
-    print(f"      - Mean: {df_final['so_luong'].mean():.2f}")
-    print(f"      - Median: {df_final['so_luong'].median():.0f}")
-    print(f"      - Max: {df_final['so_luong'].max()}")
-    
-    # Check feature coverage
-    print(f"\n   📊 Feature coverage:")
-    shift_buildings = set(df_shift['ma_dia_diem'].unique())
-    task_buildings = set(df_task['ma_dia_diem'].unique())
-    building_buildings = set(df_building['ma_dia_diem'].unique())
-    
-    print(f"      - Shifts: {len(shift_buildings)} buildings")
-    print(f"      - Tasks: {len(task_buildings)} buildings")
-    print(f"      - Building info: {len(building_buildings)} buildings")
-    
-    # Overlap analysis
-    shift_with_task = shift_buildings.intersection(task_buildings)
-    shift_with_building = shift_buildings.intersection(building_buildings)
-    all_three = shift_buildings.intersection(task_buildings).intersection(building_buildings)
-    
-    print(f"\n   🔗 Overlap analysis:")
-    print(f"      - Shifts ∩ Tasks: {len(shift_with_task)} buildings")
-    print(f"      - Shifts ∩ Building: {len(shift_with_building)} buildings")
-    print(f"      - All three: {len(all_three)} buildings")
-    
-    shift_only = shift_buildings - task_buildings - building_buildings
-    if len(shift_only) > 0:
-        print(f"\n   ⚠️  Buildings with shift only (no task/building data): {len(shift_only)}")
-        print(f"      Examples: {list(shift_only)[:10]}")
-    
-    # =====================================================================
-    # 8. EXPORT FILES
-    # =====================================================================
-    print(f"\n💾 EXPORTING FILES...")
-    
-    # Excel
-    print(f"   [1/2] Exporting Excel: {output_excel}")
-    df_final.to_excel(output_excel, index=False, engine='openpyxl')
-    print(f"   ✅ Done!")
-    
-    # CSV
-    print(f"   [2/2] Exporting CSV: {output_csv}")
-    df_final.to_csv(output_csv, index=False, encoding='utf-8-sig')
-    print(f"   ✅ Done!")
-    
-    # =====================================================================
-    # 9. SUMMARY
-    # =====================================================================
-    print(f"\n" + "=" * 80)
-    print("✅ MERGE COMPLETED!")
-    print("=" * 80)
-    
-    print(f"\n📁 FILES CREATED:")
-    print(f"   1. {output_excel} ({df_final.shape[0]} rows × {df_final.shape[1]} columns)")
-    print(f"   2. {output_csv} (CSV backup)")
-    
-    print(f"\n📋 COLUMN STRUCTURE:")
-    print(f"   - ma_dia_diem (identifier)")
-    print(f"   - Shift features (6): loai_ca, bat_dau, ket_thuc, tong_gio_lam, so_ca_cua_toa")
-    print(f"   - Task features ({len(task_features)}): num_tasks, cleaning_ratio, ...")
-    print(f"   - Building features ({len(building_features)}): so_tang, dien_tich_*, ...")
-    print(f"   - so_luong (TARGET)")
-    
-    print(f"\n🎯 READY FOR MACHINE LEARNING!")
-    print(f"   - Total samples: {len(df_final)}")
-    print(f"   - Total features: {df_final.shape[1] - 2}  (excluding ma_dia_diem & target)")
-    print(f"   - Target variable: so_luong")
-    
-    return df_final
-
-
-if __name__ == "__main__":
-    # File paths
-    shift_file = "shift_features_for_prediction.xlsx"
-    task_file = "ket_qua_cong_viec_full_WITH_FEATURES.xlsx"
-    building_file = "Du_Lieu_Toa_Nha_Aggregate.xlsx"
-    
-    output_excel = "COMPLETE_DATASET_FOR_PREDICTION.xlsx"
-    output_csv = "COMPLETE_DATASET_FOR_PREDICTION.csv"
-    
-    # Run merge
-    df_final = merge_three_datasets(
-        shift_file,
-        task_file,
-        building_file,
-        output_excel,
-        output_csv
-    )
-    
-    # Display sample
-    print(f"\n📊 SAMPLE DATA (first 5 rows, first 15 columns):")
-    print(df_final.iloc[:5, :15].to_string())
-    
-    print(f"\n📊 COLUMN LIST:")
-    for i, col in enumerate(df_final.columns, 1):
-        print(f"   {i:2d}. {col}")
--- a/outputs/val_predictions_extratrees.csv
+++ b/outputs/val_predictions_extratrees.csv
@ -0,0 +1,69 @@
+ma_dia_diem,so_luong_thuc_te,so_luong_du_doan_raw,so_luong_du_doan_round,abs_error
+579-1,32.0,16.876830318931813,17,15.123169681068187
+114-1,29.0,16.242619100544022,16,12.757380899455978
+121-3,13.0,5.239388045466643,5,7.760611954533357
+227-1,1.0,7.99405810481789,8,6.99405810481789
+55-1,14.0,7.210027162489984,7,6.789972837510016
+55-1,12.0,5.706020245135266,6,6.293979754864734
+236-1,1.0,6.108303905636944,6,5.108303905636944
+236-1,1.0,5.948220706497975,6,4.948220706497975
+236-1,2.0,6.426549524381582,6,4.426549524381582
+121-4,11.0,6.758043664066231,7,4.241956335933769
+236-1,10.0,6.354209987632813,6,3.645790012367187
+106-2,6.0,2.3560557164421536,2,3.6439442835578464
+610-1,7.0,3.4421999613438974,3,3.5578000386561026
+144-1,2.0,5.504393567800781,6,3.504393567800781
+236-1,3.0,6.087933666699748,6,3.0879336666997483
+144-1,2.0,5.0271512782704155,5,3.0271512782704155
+236-1,3.0,5.94830383227512,6,2.94830383227512
+594-1,8.0,5.088674399049482,5,2.9113256009505184
+579-1,1.0,3.8002955137396297,4,2.8002955137396297
+144-1,1.0,3.506469883052797,4,2.506469883052797
+594-1,2.0,4.211254870473325,4,2.211254870473325
+274-1,1.0,3.178047802784146,3,2.178047802784146
+6-1,6.0,7.916426773002424,8,1.9164267730024243
+303-1,5.0,3.133410467406157,3,1.866589532593843
+578-1,6.0,4.135041783845124,4,1.864958216154876
+227-1,3.0,4.722704305237233,5,1.722704305237233
+121-4,2.0,3.7041646248921216,4,1.7041646248921216
+274-1,5.0,3.366022094040395,3,1.633977905959605
+594-1,2.0,3.4405676977301685,3,1.4405676977301685
+236-1,7.0,8.17514047691159,8,1.17514047691159
+244-1,2.0,0.9767275453908661,1,1.0232724546091339
+240-1,3.0,2.1486097889650977,2,0.8513902110349023
+578-1,2.0,2.837972818548389,3,0.8379728185483888
+128-1,2.0,1.2056361580083368,1,0.7943638419916632
+106-2,1.0,1.7707183497574754,2,0.7707183497574754
+114-1,4.0,3.3824321468654253,3,0.6175678531345747
+235-1,4.0,4.605771520245636,5,0.6057715202456357
+219-1,1.0,1.5811487297373823,2,0.5811487297373823
+270-1,1.0,1.4955783384921095,1,0.4955783384921095
+219-1,1.0,1.4710628007140238,1,0.47106280071402384
+490-1,1.0,1.4430468075886558,1,0.4430468075886558
+114-1,3.0,2.650398131344691,3,0.34960186865530885
+121-3,3.0,3.334741490991454,3,0.3347414909914539
+127-1,1.0,1.3111413061694792,1,0.3111413061694792
+227-1,6.0,6.2811157143942795,6,0.2811157143942795
+165-1,1.0,1.2779063848249663,1,0.2779063848249663
+165-1,1.0,1.2387545986836015,1,0.2387545986836015
+114-1,4.0,4.188440266318704,4,0.18844026631870392
+121-3,3.0,3.179400674143066,3,0.17940067414306604
+485-1,1.0,1.1352331314324848,1,0.13523313143248483
+121-4,3.0,3.0768945795365923,3,0.07689457953659229
+490-1,2.0,2.0719671535658875,2,0.07196715356588745
+127-1,2.0,1.9424849846525025,2,0.05751501534749748
+610-1,2.0,2.0511733786528867,2,0.051173378652886736
+43-078,1.0,1.0068620029819555,1,0.006862002981955495
+43-056,1.0,1.0033817317094949,1,0.0033817317094948507
+43-060,1.0,1.0033817317094949,1,0.0033817317094948507
+43-079,1.0,1.0013520071353357,1,0.0013520071353356755
+43-080,1.0,1.0013520071353357,1,0.0013520071353356755
+328-1,1.0,0.9994224607534798,1,0.0005775392465201534
+425-5,1.0,0.9999999999999947,1,5.329070518200751e-15
+43-002,1.0,0.9999999999999947,1,5.329070518200751e-15
+43-009,1.0,0.9999999999999947,1,5.329070518200751e-15
+2-014,1.0,0.9999999999999947,1,5.329070518200751e-15
+43-001,1.0,0.9999999999999947,1,5.329070518200751e-15
+43-005,1.0,0.9999999999999947,1,5.329070518200751e-15
+43-004,1.0,0.9999999999999947,1,5.329070518200751e-15
+443-3,1.0,0.9999999999999947,1,5.329070518200751e-15
--- a/predict.py
+++ b/predict.py
@ -0,0 +1,207 @@
+import re
+from typing import Dict, List, Optional
+
+import pandas as pd
+
+
+# =========================================================
+# 1) HELPERS
+# =========================================================
+_TASK_SPLIT_RE = re.compile(r"[;|\n]+")
+
+def _split_tasks(tasks_text: str) -> List[str]:
+    """Tách tasks bằng delimiter: ; | hoặc xuống dòng"""
+    tasks_text = str(tasks_text).lower()
+    tasks = _TASK_SPLIT_RE.split(tasks_text)
+    return [t.strip() for t in tasks if t.strip()]
+
+def _count_tasks_with_keywords(tasks: List[str], keywords: List[str]) -> int:
+    """Đếm số tasks chứa ít nhất 1 keyword"""
+    count = 0
+    for task in tasks:
+        if any(k in task for k in keywords):
+            count += 1
+    return count
+
+def _count_tasks_without_keywords(tasks: List[str], all_keywords: List[str]) -> int:
+    """Đếm số tasks KHÔNG chứa bất kỳ keyword nào"""
+    count = 0
+    for task in tasks:
+        if not any(k in task for k in all_keywords):
+            count += 1
+    return count
+
+def _get_empty_features() -> Dict[str, float]:
+    """Trả về dict với tất cả features = 0 (cho missing data)"""
+    return {
+        # TASK COUNTS (7)
+        "num_tasks": 0,
+        "num_cleaning_tasks": 0,
+        "num_trash_collection_tasks": 0,
+        "num_monitoring_tasks": 0,
+        "num_deep_cleaning_tasks": 0,
+        "num_support_tasks": 0,
+        "num_other_tasks": 0,
+
+        # AREA (reduced + aggregated) (7)
+        "num_wc_tasks": 0,
+        "num_hallway_tasks": 0,
+        "num_lobby_tasks": 0,
+        "num_outdoor_tasks": 0,
+        "num_elevator_tasks": 0,
+        "num_medical_tasks_total": 0,
+        "num_indoor_room_tasks": 0,
+
+        # RATIOS & DIVERSITY (4)
+        "cleaning_ratio": 0.0,
+        "trash_collection_ratio": 0.0,
+        "monitoring_ratio": 0.0,
+        "area_diversity": 0,
+    }
+
+
+# =========================================================
+# 2) MAIN: 2 TEXT INPUTS -> FEATURES
+# =========================================================
+def extract_keyword_features_reduced_from_two_texts(
+    task_normal: Optional[str],
+    task_dinhky: Optional[str],
+) -> Dict[str, float]:
+    """
+    Input:
+        task_normal: text công việc thường
+        task_dinhky: text công việc định kỳ
+    Output:
+        Dict keyword-features reduced (schema y như bạn định nghĩa)
+
+    Logic gộp giống bản gốc:
+        combined = task_normal + " ; " + task_dinhky
+    """
+
+    tn = "" if task_normal is None or (isinstance(task_normal, float) and pd.isna(task_normal)) else str(task_normal)
+    td = "" if task_dinhky is None or (isinstance(task_dinhky, float) and pd.isna(task_dinhky)) else str(task_dinhky)
+
+    combined = (tn.strip() + " ; " + td.strip()).strip()
+    if combined == "":
+        return _get_empty_features()
+
+    tasks = _split_tasks(combined)
+    num_tasks = len(tasks)
+    if num_tasks == 0:
+        return _get_empty_features()
+
+    # -----------------------------
+    # GROUP 1: TASK COUNTS (reduced)
+    # -----------------------------
+    cleaning_keywords = [
+        "vệ sinh", "lau", "chùi", "quét", "hút",
+        "đẩy khô", "lau ẩm", "làm sạch", "lau bụi", "lau kính", "lau sàn", "hút bụi"
+    ]
+    trash_keywords = [
+        "thu gom rác", "thay rác", "vận chuyển rác", "tua rác", "đổ rác",
+        "thu rác", "gom rác", "quét rác nổi", "trực rác", "rác nổi"
+    ]
+    monitoring_keywords = [
+        "trực", "trực phát sinh", "trực ps", "kiểm tra", "check",
+        "giám sát", "theo dõi", "tuần tra"
+    ]
+    deep_cleaning_keywords = [
+        "cọ rửa", "cọ bồn cầu", "cọ", "gạt kính", "đánh sàn",
+        "đánh chân tường", "chà tường", "cọ gương", "cọ lavabo"
+    ]
+    support_keywords = [
+        "giao ca", "bàn giao", "bàn giao ca", "chụp ảnh", "nhận ca",
+        "vsdc", "vệ sinh dụng cụ", "chuẩn bị dụng cụ", "chuẩn bị nước", "chuẩn bị", "giao ban"
+    ]
+
+    num_cleaning_tasks = _count_tasks_with_keywords(tasks, cleaning_keywords)
+    num_trash_collection_tasks = _count_tasks_with_keywords(tasks, trash_keywords)
+    num_monitoring_tasks = _count_tasks_with_keywords(tasks, monitoring_keywords)
+    num_deep_cleaning_tasks = _count_tasks_with_keywords(tasks, deep_cleaning_keywords)
+    num_support_tasks = _count_tasks_with_keywords(tasks, support_keywords)
+
+    all_keywords_for_other = (
+        cleaning_keywords + trash_keywords + monitoring_keywords + deep_cleaning_keywords + support_keywords
+    )
+    num_other_tasks = _count_tasks_without_keywords(tasks, all_keywords_for_other)
+
+    # -----------------------------
+    # GROUP 2: AREA COVERAGE (reduced + aggregated)
+    # -----------------------------
+    wc_keywords = [
+        "wc", "toilet", "nhà vệ sinh", "restroom", "phòng vệ sinh",
+        "bồn cầu", "lavabo", "tiểu nam", "bồn tiểu"
+    ]
+    hallway_keywords = ["hành lang", "corridor", "lối đi", "hall", "hl", "hanh lang"]
+    lobby_keywords = ["sảnh", "lobby", "tiền sảnh", "sảnh chính", "sanh"]
+    outdoor_keywords = ["ngoại cảnh", "sân", "vỉa hè", "khuôn viên", "cổng", "bãi xe", "tầng hầm"]
+    elevator_keywords = ["thang máy", "elevator", "lift", "cầu thang", "thang bộ", "tay vịn", "tam cấp"]
+
+    patient_room_keywords = ["phòng bệnh", "giường bệnh", "phòng vip", "phòng bệnh nhân", "pb", "phòng bv"]
+    clinic_room_keywords = ["phòng khám", "khoa khám", "phòng khám bệnh", "khu khám", "pk"]
+    surgery_room_keywords = ["phòng mổ", "hậu phẫu", "phòng phẫu thuật", "khu mổ", "phòng pt"]
+    technical_room_keywords = [
+        "phòng xét nghiệm", "phòng chụp", "xq", "siêu âm", "kho dược",
+        "phòng xn", "labo", "phòng thí nghiệm", "nội soi", "cấp cứu", "hồi sức"
+    ]
+
+    office_keywords = [
+        "phòng nhân viên", "phòng giám đốc", "phòng họp", "phòng hành chính",
+        "văn phòng", "phòng ban", "phòng giao ban", "hội trường", "phòng kế toán"
+    ]
+
+    num_wc_tasks = _count_tasks_with_keywords(tasks, wc_keywords)
+    num_hallway_tasks = _count_tasks_with_keywords(tasks, hallway_keywords)
+    num_lobby_tasks = _count_tasks_with_keywords(tasks, lobby_keywords)
+    num_outdoor_tasks = _count_tasks_with_keywords(tasks, outdoor_keywords)
+    num_elevator_tasks = _count_tasks_with_keywords(tasks, elevator_keywords)
+
+    num_patient_room_tasks = _count_tasks_with_keywords(tasks, patient_room_keywords)
+    num_clinic_room_tasks = _count_tasks_with_keywords(tasks, clinic_room_keywords)
+    num_surgery_room_tasks = _count_tasks_with_keywords(tasks, surgery_room_keywords)
+    num_technical_room_tasks = _count_tasks_with_keywords(tasks, technical_room_keywords)
+
+    num_medical_tasks_total = (
+        num_patient_room_tasks + num_clinic_room_tasks + num_surgery_room_tasks + num_technical_room_tasks
+    )
+
+    num_indoor_room_tasks = _count_tasks_with_keywords(tasks, office_keywords)
+
+    # -----------------------------
+    # GROUP 3: RATIOS & DIVERSITY (reduced)
+    # -----------------------------
+    cleaning_ratio = num_cleaning_tasks / num_tasks if num_tasks > 0 else 0.0
+    trash_collection_ratio = num_trash_collection_tasks / num_tasks if num_tasks > 0 else 0.0
+    monitoring_ratio = num_monitoring_tasks / num_tasks if num_tasks > 0 else 0.0
+
+    area_counts = [
+        num_wc_tasks, num_hallway_tasks, num_lobby_tasks, num_outdoor_tasks, num_elevator_tasks,
+        num_medical_tasks_total, num_indoor_room_tasks
+    ]
+    area_diversity = sum(1 for c in area_counts if c > 0)
+
+    return {
+        # TASK COUNTS (7)
+        "num_tasks": num_tasks,
+        "num_cleaning_tasks": num_cleaning_tasks,
+        "num_trash_collection_tasks": num_trash_collection_tasks,
+        "num_monitoring_tasks": num_monitoring_tasks,
+        "num_deep_cleaning_tasks": num_deep_cleaning_tasks,
+        "num_support_tasks": num_support_tasks,
+        "num_other_tasks": num_other_tasks,
+
+        # AREA (reduced + aggregated) (7)
+        "num_wc_tasks": num_wc_tasks,
+        "num_hallway_tasks": num_hallway_tasks,
+        "num_lobby_tasks": num_lobby_tasks,
+        "num_outdoor_tasks": num_outdoor_tasks,
+        "num_elevator_tasks": num_elevator_tasks,
+        "num_medical_tasks_total": num_medical_tasks_total,
+        "num_indoor_room_tasks": num_indoor_room_tasks,
+
+        # RATIOS & DIVERSITY (4)
+        "cleaning_ratio": round(cleaning_ratio, 4),
+        "trash_collection_ratio": round(trash_collection_ratio, 4),
+        "monitoring_ratio": round(monitoring_ratio, 4),
+        "area_diversity": area_diversity,
+    }
--- a/test3.ipynb
+++ b/test3.ipynb
--- a/train.ipynb
+++ b/train.ipynb