# Phase 2 Model Package - 20260106_011633 ## Model Information - **Model Type**: Decision Tree - **Dataset**: FINAL_DATASET_WITH_TEXT_BACKUP_20260105_213507.xlsx - **Total Samples**: 454 - **Total Features**: 92 (Numeric: 46, Text SVD: 50) ## Performance Metrics ### Validation Set - **R² Score**: 0.3339 - **MAE**: 2.5525 - **RMSE**: 4.2579 ### Test Set - **R² Score**: 0.2893 - **MAE**: 3.5730 - **RMSE**: 5.6899 ## Files Included 1. **best_model_Decision_Tree_20260106_011633.pkl** - Trained model 2. **scaler_20260106_011633.pkl** - StandardScaler for numeric features 3. **tfidf_vectorizer_20260106_011633.pkl** - TF-IDF vectorizer for text 4. **svd_model_20260106_011633.pkl** - SVD dimensionality reduction 5. **feature_names_20260106_011633.pkl** - List of all feature names 6. **model_metadata_20260106_011633.pkl** - Complete metadata dictionary 7. **phase2_complete_package_20260106_011633.joblib** - All-in-one package (recommended for deployment) ## How to Load and Use ### Option 1: Load Complete Package (Recommended) ```python import joblib import pandas as pd # Load package package = joblib.load('phase2_complete_package_20260106_011633.joblib') model = package['model'] scaler = package['scaler'] tfidf = package['tfidf'] svd = package['svd'] # Make prediction # 1. Process text text_combined = "your text here" # Combined task text tfidf_features = tfidf.transform([text_combined]) text_svd = svd.transform(tfidf_features) # 2. Prepare numeric features numeric_features = [...] # Your numeric features array # 3. Combine and scale all_features = pd.concat([ pd.DataFrame(numeric_features, columns=package['feature_names'][:len(numeric_features)]), pd.DataFrame(text_svd, columns=package['feature_names'][len(numeric_features):]) ], axis=1) all_features_scaled = scaler.transform(all_features) # 4. Predict prediction = model.predict(all_features_scaled) print(f"Predicted staff count: {prediction[0]:.0f}") ``` ### Option 2: Load Individual Files ```python import pickle with open('best_model_Decision_Tree_20260106_011633.pkl', 'rb') as f: model = pickle.load(f) with open('scaler_20260106_011633.pkl', 'rb') as f: scaler = pickle.load(f) # ... (same prediction process as above) ``` ## Model Configuration ### TF-IDF Parameters - max_features: 200 - ngram_range: (1, 2) - min_df: 2 - max_df: 0.95 ### SVD Parameters - n_components: 50 - explained_variance: 89.66% ### Training Parameters - random_state: 42 - train_size: 289 (63.7%) - val_size: 62 (13.7%) - test_size: 62 (13.7%) ## Phase 1 vs Phase 2 Comparison Phase 1 (Numeric only): R² = 0.4136 if best_model_phase2 in phase1_results else 'N/A' Phase 2 (With text): R² = 0.3339 Improvement: -0.07973710481765645 ## Notes - This model includes text features extracted from task descriptions - Text preprocessing: lowercase, remove special chars, combine task columns - Feature engineering: TF-IDF → SVD → StandardScaler - Use the same preprocessing pipeline for new predictions Generated: 2026-01-06 01:16:33