predict_caLamviec_nhansu/phase2_output/models/README_20260106_011633.md

3.0 KiB

Phase 2 Model Package - 20260106_011633

Model Information

  • Model Type: Decision Tree
  • Dataset: FINAL_DATASET_WITH_TEXT_BACKUP_20260105_213507.xlsx
  • Total Samples: 454
  • Total Features: 92 (Numeric: 46, Text SVD: 50)

Performance Metrics

Validation Set

  • R² Score: 0.3339
  • MAE: 2.5525
  • RMSE: 4.2579

Test Set

  • R² Score: 0.2893
  • MAE: 3.5730
  • RMSE: 5.6899

Files Included

  1. best_model_Decision_Tree_20260106_011633.pkl - Trained model
  2. scaler_20260106_011633.pkl - StandardScaler for numeric features
  3. tfidf_vectorizer_20260106_011633.pkl - TF-IDF vectorizer for text
  4. svd_model_20260106_011633.pkl - SVD dimensionality reduction
  5. feature_names_20260106_011633.pkl - List of all feature names
  6. model_metadata_20260106_011633.pkl - Complete metadata dictionary
  7. phase2_complete_package_20260106_011633.joblib - All-in-one package (recommended for deployment)

How to Load and Use

import joblib
import pandas as pd

# Load package
package = joblib.load('phase2_complete_package_20260106_011633.joblib')
model = package['model']
scaler = package['scaler']
tfidf = package['tfidf']
svd = package['svd']

# Make prediction
# 1. Process text
text_combined = "your text here"  # Combined task text
tfidf_features = tfidf.transform([text_combined])
text_svd = svd.transform(tfidf_features)

# 2. Prepare numeric features
numeric_features = [...] # Your numeric features array

# 3. Combine and scale
all_features = pd.concat([
    pd.DataFrame(numeric_features, columns=package['feature_names'][:len(numeric_features)]),
    pd.DataFrame(text_svd, columns=package['feature_names'][len(numeric_features):])
], axis=1)
all_features_scaled = scaler.transform(all_features)

# 4. Predict
prediction = model.predict(all_features_scaled)
print(f"Predicted staff count: {prediction[0]:.0f}")

Option 2: Load Individual Files

import pickle

with open('best_model_Decision_Tree_20260106_011633.pkl', 'rb') as f:
    model = pickle.load(f)

with open('scaler_20260106_011633.pkl', 'rb') as f:
    scaler = pickle.load(f)

# ... (same prediction process as above)

Model Configuration

TF-IDF Parameters

  • max_features: 200
  • ngram_range: (1, 2)
  • min_df: 2
  • max_df: 0.95

SVD Parameters

  • n_components: 50
  • explained_variance: 89.66%

Training Parameters

  • random_state: 42
  • train_size: 289 (63.7%)
  • val_size: 62 (13.7%)
  • test_size: 62 (13.7%)

Phase 1 vs Phase 2 Comparison

Phase 1 (Numeric only): R² = 0.4136 if best_model_phase2 in phase1_results else 'N/A' Phase 2 (With text): R² = 0.3339 Improvement: -0.07973710481765645

Notes

  • This model includes text features extracted from task descriptions
  • Text preprocessing: lowercase, remove special chars, combine task columns
  • Feature engineering: TF-IDF → SVD → StandardScaler
  • Use the same preprocessing pipeline for new predictions

Generated: 2026-01-06 01:16:33