3.0 KiB
3.0 KiB
Phase 2 Model Package - 20260106_011633
Model Information
- Model Type: Decision Tree
- Dataset: FINAL_DATASET_WITH_TEXT_BACKUP_20260105_213507.xlsx
- Total Samples: 454
- Total Features: 92 (Numeric: 46, Text SVD: 50)
Performance Metrics
Validation Set
- R² Score: 0.3339
- MAE: 2.5525
- RMSE: 4.2579
Test Set
- R² Score: 0.2893
- MAE: 3.5730
- RMSE: 5.6899
Files Included
- best_model_Decision_Tree_20260106_011633.pkl - Trained model
- scaler_20260106_011633.pkl - StandardScaler for numeric features
- tfidf_vectorizer_20260106_011633.pkl - TF-IDF vectorizer for text
- svd_model_20260106_011633.pkl - SVD dimensionality reduction
- feature_names_20260106_011633.pkl - List of all feature names
- model_metadata_20260106_011633.pkl - Complete metadata dictionary
- phase2_complete_package_20260106_011633.joblib - All-in-one package (recommended for deployment)
How to Load and Use
Option 1: Load Complete Package (Recommended)
import joblib
import pandas as pd
# Load package
package = joblib.load('phase2_complete_package_20260106_011633.joblib')
model = package['model']
scaler = package['scaler']
tfidf = package['tfidf']
svd = package['svd']
# Make prediction
# 1. Process text
text_combined = "your text here" # Combined task text
tfidf_features = tfidf.transform([text_combined])
text_svd = svd.transform(tfidf_features)
# 2. Prepare numeric features
numeric_features = [...] # Your numeric features array
# 3. Combine and scale
all_features = pd.concat([
pd.DataFrame(numeric_features, columns=package['feature_names'][:len(numeric_features)]),
pd.DataFrame(text_svd, columns=package['feature_names'][len(numeric_features):])
], axis=1)
all_features_scaled = scaler.transform(all_features)
# 4. Predict
prediction = model.predict(all_features_scaled)
print(f"Predicted staff count: {prediction[0]:.0f}")
Option 2: Load Individual Files
import pickle
with open('best_model_Decision_Tree_20260106_011633.pkl', 'rb') as f:
model = pickle.load(f)
with open('scaler_20260106_011633.pkl', 'rb') as f:
scaler = pickle.load(f)
# ... (same prediction process as above)
Model Configuration
TF-IDF Parameters
- max_features: 200
- ngram_range: (1, 2)
- min_df: 2
- max_df: 0.95
SVD Parameters
- n_components: 50
- explained_variance: 89.66%
Training Parameters
- random_state: 42
- train_size: 289 (63.7%)
- val_size: 62 (13.7%)
- test_size: 62 (13.7%)
Phase 1 vs Phase 2 Comparison
Phase 1 (Numeric only): R² = 0.4136 if best_model_phase2 in phase1_results else 'N/A' Phase 2 (With text): R² = 0.3339 Improvement: -0.07973710481765645
Notes
- This model includes text features extracted from task descriptions
- Text preprocessing: lowercase, remove special chars, combine task columns
- Feature engineering: TF-IDF → SVD → StandardScaler
- Use the same preprocessing pipeline for new predictions
Generated: 2026-01-06 01:16:33