Vježba 6a - AI Centar Lipik

Proučite dataset german_credit_data.csv i definirajte što je target varijabla. Ako je ne pronalazite, izaberite neku varijablu za koju postoje samo dvije vrijednosti u stupcu. Iz dataseta izbacite sve značajke koje nisu numeričke, a ako negdje nedostaje vrijednost umetnite medijan. 20% dataseta ostavite za testiranje, a pri dijeljenju dataset-a u funkciji train_test_split koristite random_state=42 i stratify=y. Istrenirajte k-NN sa 5 susjeda i ispišite sve pokazatelje modela uključujući matricu konfuzije.

Nakon toga smanjite threshhold za 50%, pa ponovo izračunajte i ispišite sve pokazatelje modela uključujući matricu konfuzije.
Nakon toga povećajte threshhold za 50% od početnog, pa ponovo izračunajte i ispišite sve pokazatelje modela uključujući matricu konfuzije.

Predajte .ipynb datoteku koja je rješenje zadatka.

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.impute import SimpleImputer
from sklearn import metrics

df = pd.read_csv("german_credit_data.csv", low_memory=False)
df.head()

target = (df["Sex"] == "male").astype(int)
x_num = df.select_dtypes(include=[np.number])

imputer = SimpleImputer(strategy="median")
x_num_imputed = pd.DataFrame(imputer.fit_transform(x_num), columns=x_num.columns)

df_clean = pd.concat([x_num_imputed, target], axis=1)
features = df_clean.drop(columns=["Sex"]).select_dtypes(include=[np.number])
features.head()

train_features, test_features, train_target, test_target = train_test_split(features, target, test_size=0.2, random_state=42, stratify=target)

knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(train_features, train_target)
predictions = knn.predict(test_features)

def evaluate_threshold(threshold, probs, y_true, title):
    y_pred = (probs >= threshold).astype(int)
    acc = metrics.accuracy_score(y_true, y_pred)
    prec = metrics.precision_score(y_true, y_pred, zero_division=0)
    rec = metrics.recall_score(y_true, y_pred, zero_division=0)
    f1 = metrics.f1_score(y_true, y_pred, zero_division=0)
    cm = metrics.confusion_matrix(y_true, y_pred, labels=[0, 1])
    print(f"\n=== {title} ===")
    print(f"Threshold: {threshold:.2f}")
    print(f"Accuracy:  {acc:.4f}")
    print(f"Precision: {prec:.4f}")
    print(f"Recall:    {rec:.4f}")
    print(f"F1 score:  {f1:.4f}")
    print("Confusion matrix (rows=true [0,1], cols=pred [0,1]):")
    print(cm)
    print("\nC. report:\n", metrics.classification_report(y_true, y_pred, digits=4, zero_division=0))

probs_test = knn.predict_proba(test_features)[:, 1]
evaluate_threshold(0.50, probs_test, test_target, "Baseline")
evaluate_threshold(0.25, probs_test, test_target, "Niži threshold (-50% from 0.50)")
evaluate_threshold(0.75, probs_test, test_target, "Viši threshold (+50% from 0.50)")


=== Baseline ===
Threshold: 0.50
Accuracy:  0.6200
Precision: 0.6937
Recall:    0.8043
F1 score:  0.7450
Confusion matrix (rows=true [0,1], cols=pred [0,1]):
[[ 13  49]
 [ 27 111]]

C. report:
               precision    recall  f1-score   support

           0     0.3250    0.2097    0.2549        62
           1     0.6937    0.8043    0.7450       138

    accuracy                         0.6200       200
   macro avg     0.5094    0.5070    0.4999       200
weighted avg     0.5794    0.6200    0.5930       200


=== Niži threshold (-50% from 0.50) ===
Threshold: 0.25
Accuracy:  0.6900
Precision: 0.6959
Recall:    0.9783
F1 score:  0.8133
Confusion matrix (rows=true [0,1], cols=pred [0,1]):
[[  3  59]
 [  3 135]]

C. report:
               precision    recall  f1-score   support

           0     0.5000    0.0484    0.0882        62
           1     0.6959    0.9783    0.8133       138

    accuracy                         0.6900       200
   macro avg     0.5979    0.5133    0.4507       200
weighted avg     0.6352    0.6900    0.5885       200


=== Viši threshold (+50% from 0.50) ===
Threshold: 0.75
Accuracy:  0.4650
Precision: 0.6667
Recall:    0.4493
F1 score:  0.5368
Confusion matrix (rows=true [0,1], cols=pred [0,1]):
[[31 31]
 [76 62]]

C. report:
               precision    recall  f1-score   support

           0     0.2897    0.5000    0.3669        62
           1     0.6667    0.4493    0.5368       138

    accuracy                         0.4650       200
   macro avg     0.4782    0.4746    0.4518       200
weighted avg     0.5498    0.4650    0.4841       200

Smanjivanje thresholda za 50%¶

probs_test_m50 = knn.predict_proba(test_features)[:, 1]
evaluate_threshold(0.5*0.5, probs_test, test_target, "Baseline")
evaluate_threshold(0.25*0.5, probs_test, test_target, "Niži threshold (-50% from 0.50)")
evaluate_threshold(0.75*0.5, probs_test, test_target, "Viši threshold (+50% from 0.50)")


=== Baseline ===
Threshold: 0.25
Accuracy:  0.6900
Precision: 0.6959
Recall:    0.9783
F1 score:  0.8133
Confusion matrix (rows=true [0,1], cols=pred [0,1]):
[[  3  59]
 [  3 135]]


C. report:
               precision    recall  f1-score   support

           0     0.5000    0.0484    0.0882        62
           1     0.6959    0.9783    0.8133       138

    accuracy                         0.6900       200
   macro avg     0.5979    0.5133    0.4507       200
weighted avg     0.6352    0.6900    0.5885       200


=== Niži threshold (-50% from 0.50) ===
Threshold: 0.12
Accuracy:  0.6850
Precision: 0.6884
Recall:    0.9928
F1 score:  0.8131
Confusion matrix (rows=true [0,1], cols=pred [0,1]):
[[  0  62]
 [  1 137]]

C. report:
               precision    recall  f1-score   support

           0     0.0000    0.0000    0.0000        62
           1     0.6884    0.9928    0.8131       138

    accuracy                         0.6850       200
   macro avg     0.3442    0.4964    0.4065       200
weighted avg     0.4750    0.6850    0.5610       200


=== Viši threshold (+50% from 0.50) ===
Threshold: 0.38
Accuracy:  0.6900
Precision: 0.6959
Recall:    0.9783
F1 score:  0.8133
Confusion matrix (rows=true [0,1], cols=pred [0,1]):
[[  3  59]
 [  3 135]]


C. report:
               precision    recall  f1-score   support

           0     0.5000    0.0484    0.0882        62
           1     0.6959    0.9783    0.8133       138

    accuracy                         0.6900       200
   macro avg     0.5979    0.5133    0.4507       200
weighted avg     0.6352    0.6900    0.5885       200

Povećanje thresholda za 50%¶

probs_test_p50 = knn.predict_proba(test_features)[:, 1]
evaluate_threshold(0.5*1.5, probs_test, test_target, "Baseline")
evaluate_threshold(0.25*1.5, probs_test, test_target, "Niži threshold (-50% from 0.50)")
evaluate_threshold(0.75*1.5, probs_test, test_target, "Viši threshold (+50% from 0.50)")


=== Baseline ===
Threshold: 0.75
Accuracy:  0.4650
Precision: 0.6667
Recall:    0.4493
F1 score:  0.5368
Confusion matrix (rows=true [0,1], cols=pred [0,1]):
[[31 31]
 [76 62]]


C. report:
               precision    recall  f1-score   support

           0     0.2897    0.5000    0.3669        62
           1     0.6667    0.4493    0.5368       138

    accuracy                         0.4650       200
   macro avg     0.4782    0.4746    0.4518       200
weighted avg     0.5498    0.4650    0.4841       200


=== Niži threshold (-50% from 0.50) ===
Threshold: 0.38
Accuracy:  0.6900
Precision: 0.6959
Recall:    0.9783
F1 score:  0.8133
Confusion matrix (rows=true [0,1], cols=pred [0,1]):
[[  3  59]
 [  3 135]]

C. report:
               precision    recall  f1-score   support

           0     0.5000    0.0484    0.0882        62
           1     0.6959    0.9783    0.8133       138

    accuracy                         0.6900       200
   macro avg     0.5979    0.5133    0.4507       200
weighted avg     0.6352    0.6900    0.5885       200


=== Viši threshold (+50% from 0.50) ===
Threshold: 1.12
Accuracy:  0.3100
Precision: 0.0000
Recall:    0.0000
F1 score:  0.0000
Confusion matrix (rows=true [0,1], cols=pred [0,1]):
[[ 62   0]
 [138   0]]


C. report:
               precision    recall  f1-score   support

           0     0.3100    1.0000    0.4733        62
           1     0.0000    0.0000    0.0000       138

    accuracy                         0.3100       200
   macro avg     0.1550    0.5000    0.2366       200
weighted avg     0.0961    0.3100    0.1467       200

AI Centar Lipik

Roc

AI Centar Lipik

Klasifikatori - ponavljanje