# Phase 2 Model Package - 20260106_011633

## Model Information
- **Model Type**: Decision Tree
- **Dataset**: FINAL_DATASET_WITH_TEXT_BACKUP_20260105_213507.xlsx
- **Total Samples**: 454
- **Total Features**: 92 (Numeric: 46, Text SVD: 50)

## Performance Metrics

### Validation Set
- **R² Score**: 0.3339
- **MAE**: 2.5525
- **RMSE**: 4.2579

### Test Set
- **R² Score**: 0.2893
- **MAE**: 3.5730
- **RMSE**: 5.6899

## Files Included

1. **best_model_Decision_Tree_20260106_011633.pkl** - Trained model
2. **scaler_20260106_011633.pkl** - StandardScaler for numeric features
3. **tfidf_vectorizer_20260106_011633.pkl** - TF-IDF vectorizer for text
4. **svd_model_20260106_011633.pkl** - SVD dimensionality reduction
5. **feature_names_20260106_011633.pkl** - List of all feature names
6. **model_metadata_20260106_011633.pkl** - Complete metadata dictionary
7. **phase2_complete_package_20260106_011633.joblib** - All-in-one package (recommended for deployment)

## How to Load and Use

### Option 1: Load Complete Package (Recommended)
```python
import joblib
import pandas as pd

# Load package
package = joblib.load('phase2_complete_package_20260106_011633.joblib')
model = package['model']
scaler = package['scaler']
tfidf = package['tfidf']
svd = package['svd']

# Make prediction
# 1. Process text
text_combined = "your text here"  # Combined task text
tfidf_features = tfidf.transform([text_combined])
text_svd = svd.transform(tfidf_features)

# 2. Prepare numeric features
numeric_features = [...] # Your numeric features array

# 3. Combine and scale
all_features = pd.concat([
    pd.DataFrame(numeric_features, columns=package['feature_names'][:len(numeric_features)]),
    pd.DataFrame(text_svd, columns=package['feature_names'][len(numeric_features):])
], axis=1)
all_features_scaled = scaler.transform(all_features)

# 4. Predict
prediction = model.predict(all_features_scaled)
print(f"Predicted staff count: {prediction[0]:.0f}")
```

### Option 2: Load Individual Files
```python
import pickle

with open('best_model_Decision_Tree_20260106_011633.pkl', 'rb') as f:
    model = pickle.load(f)

with open('scaler_20260106_011633.pkl', 'rb') as f:
    scaler = pickle.load(f)

# ... (same prediction process as above)
```

## Model Configuration

### TF-IDF Parameters
- max_features: 200
- ngram_range: (1, 2)
- min_df: 2
- max_df: 0.95

### SVD Parameters
- n_components: 50
- explained_variance: 89.66%

### Training Parameters
- random_state: 42
- train_size: 289 (63.7%)
- val_size: 62 (13.7%)
- test_size: 62 (13.7%)

## Phase 1 vs Phase 2 Comparison

Phase 1 (Numeric only): R² = 0.4136 if best_model_phase2 in phase1_results else 'N/A'
Phase 2 (With text): R² = 0.3339
Improvement: -0.07973710481765645

## Notes
- This model includes text features extracted from task descriptions
- Text preprocessing: lowercase, remove special chars, combine task columns
- Feature engineering: TF-IDF → SVD → StandardScaler
- Use the same preprocessing pipeline for new predictions

Generated: 2026-01-06 01:16:33