Proučite dataset german_credit_data.csv
i definirajte što je target varijabla. Ako je ne pronalazite, izaberite neku varijablu za koju postoje samo dvije vrijednosti u stupcu. Iz dataseta izbacite sve značajke koje nisu numeričke, a ako negdje nedostaje vrijednost umetnite medijan. 20% dataseta ostavite za testiranje, a pri dijeljenju dataset-a u funkciji train_test_split koristite random_state=42 i stratify=y. Istrenirajte k-NN sa 5 susjeda i ispišite sve pokazatelje modela uključujući matricu konfuzije.
- Nakon toga smanjite threshhold za 50%, pa ponovo izračunajte i ispišite sve pokazatelje modela uključujući matricu konfuzije.
- Nakon toga povećajte threshhold za 50% od početnog, pa ponovo izračunajte i ispišite sve pokazatelje modela uključujući matricu konfuzije.
Predajte .ipynb datoteku koja je rješenje zadatka.
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.impute import SimpleImputer
from sklearn import metrics
df = pd.read_csv("german_credit_data.csv", low_memory=False)
df.head()
Loading...
target = (df["Sex"] == "male").astype(int)
x_num = df.select_dtypes(include=[np.number])
imputer = SimpleImputer(strategy="median")
x_num_imputed = pd.DataFrame(imputer.fit_transform(x_num), columns=x_num.columns)
df_clean = pd.concat([x_num_imputed, target], axis=1)
features = df_clean.drop(columns=["Sex"]).select_dtypes(include=[np.number])
features.head()
Loading...
train_features, test_features, train_target, test_target = train_test_split(features, target, test_size=0.2, random_state=42, stratify=target)
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(train_features, train_target)
predictions = knn.predict(test_features)
def evaluate_threshold(threshold, probs, y_true, title):
y_pred = (probs >= threshold).astype(int)
acc = metrics.accuracy_score(y_true, y_pred)
prec = metrics.precision_score(y_true, y_pred, zero_division=0)
rec = metrics.recall_score(y_true, y_pred, zero_division=0)
f1 = metrics.f1_score(y_true, y_pred, zero_division=0)
cm = metrics.confusion_matrix(y_true, y_pred, labels=[0, 1])
print(f"\n=== {title} ===")
print(f"Threshold: {threshold:.2f}")
print(f"Accuracy: {acc:.4f}")
print(f"Precision: {prec:.4f}")
print(f"Recall: {rec:.4f}")
print(f"F1 score: {f1:.4f}")
print("Confusion matrix (rows=true [0,1], cols=pred [0,1]):")
print(cm)
print("\nC. report:\n", metrics.classification_report(y_true, y_pred, digits=4, zero_division=0))
probs_test = knn.predict_proba(test_features)[:, 1]
evaluate_threshold(0.50, probs_test, test_target, "Baseline")
evaluate_threshold(0.25, probs_test, test_target, "Niži threshold (-50% from 0.50)")
evaluate_threshold(0.75, probs_test, test_target, "Viši threshold (+50% from 0.50)")
=== Baseline ===
Threshold: 0.50
Accuracy: 0.6200
Precision: 0.6937
Recall: 0.8043
F1 score: 0.7450
Confusion matrix (rows=true [0,1], cols=pred [0,1]):
[[ 13 49]
[ 27 111]]
C. report:
precision recall f1-score support
0 0.3250 0.2097 0.2549 62
1 0.6937 0.8043 0.7450 138
accuracy 0.6200 200
macro avg 0.5094 0.5070 0.4999 200
weighted avg 0.5794 0.6200 0.5930 200
=== Niži threshold (-50% from 0.50) ===
Threshold: 0.25
Accuracy: 0.6900
Precision: 0.6959
Recall: 0.9783
F1 score: 0.8133
Confusion matrix (rows=true [0,1], cols=pred [0,1]):
[[ 3 59]
[ 3 135]]
C. report:
precision recall f1-score support
0 0.5000 0.0484 0.0882 62
1 0.6959 0.9783 0.8133 138
accuracy 0.6900 200
macro avg 0.5979 0.5133 0.4507 200
weighted avg 0.6352 0.6900 0.5885 200
=== Viši threshold (+50% from 0.50) ===
Threshold: 0.75
Accuracy: 0.4650
Precision: 0.6667
Recall: 0.4493
F1 score: 0.5368
Confusion matrix (rows=true [0,1], cols=pred [0,1]):
[[31 31]
[76 62]]
C. report:
precision recall f1-score support
0 0.2897 0.5000 0.3669 62
1 0.6667 0.4493 0.5368 138
accuracy 0.4650 200
macro avg 0.4782 0.4746 0.4518 200
weighted avg 0.5498 0.4650 0.4841 200
Smanjivanje thresholda za 50%¶
probs_test_m50 = knn.predict_proba(test_features)[:, 1]
evaluate_threshold(0.5*0.5, probs_test, test_target, "Baseline")
evaluate_threshold(0.25*0.5, probs_test, test_target, "Niži threshold (-50% from 0.50)")
evaluate_threshold(0.75*0.5, probs_test, test_target, "Viši threshold (+50% from 0.50)")
=== Baseline ===
Threshold: 0.25
Accuracy: 0.6900
Precision: 0.6959
Recall: 0.9783
F1 score: 0.8133
Confusion matrix (rows=true [0,1], cols=pred [0,1]):
[[ 3 59]
[ 3 135]]
C. report:
precision recall f1-score support
0 0.5000 0.0484 0.0882 62
1 0.6959 0.9783 0.8133 138
accuracy 0.6900 200
macro avg 0.5979 0.5133 0.4507 200
weighted avg 0.6352 0.6900 0.5885 200
=== Niži threshold (-50% from 0.50) ===
Threshold: 0.12
Accuracy: 0.6850
Precision: 0.6884
Recall: 0.9928
F1 score: 0.8131
Confusion matrix (rows=true [0,1], cols=pred [0,1]):
[[ 0 62]
[ 1 137]]
C. report:
precision recall f1-score support
0 0.0000 0.0000 0.0000 62
1 0.6884 0.9928 0.8131 138
accuracy 0.6850 200
macro avg 0.3442 0.4964 0.4065 200
weighted avg 0.4750 0.6850 0.5610 200
=== Viši threshold (+50% from 0.50) ===
Threshold: 0.38
Accuracy: 0.6900
Precision: 0.6959
Recall: 0.9783
F1 score: 0.8133
Confusion matrix (rows=true [0,1], cols=pred [0,1]):
[[ 3 59]
[ 3 135]]
C. report:
precision recall f1-score support
0 0.5000 0.0484 0.0882 62
1 0.6959 0.9783 0.8133 138
accuracy 0.6900 200
macro avg 0.5979 0.5133 0.4507 200
weighted avg 0.6352 0.6900 0.5885 200
Povećanje thresholda za 50%¶
probs_test_p50 = knn.predict_proba(test_features)[:, 1]
evaluate_threshold(0.5*1.5, probs_test, test_target, "Baseline")
evaluate_threshold(0.25*1.5, probs_test, test_target, "Niži threshold (-50% from 0.50)")
evaluate_threshold(0.75*1.5, probs_test, test_target, "Viši threshold (+50% from 0.50)")
=== Baseline ===
Threshold: 0.75
Accuracy: 0.4650
Precision: 0.6667
Recall: 0.4493
F1 score: 0.5368
Confusion matrix (rows=true [0,1], cols=pred [0,1]):
[[31 31]
[76 62]]
C. report:
precision recall f1-score support
0 0.2897 0.5000 0.3669 62
1 0.6667 0.4493 0.5368 138
accuracy 0.4650 200
macro avg 0.4782 0.4746 0.4518 200
weighted avg 0.5498 0.4650 0.4841 200
=== Niži threshold (-50% from 0.50) ===
Threshold: 0.38
Accuracy: 0.6900
Precision: 0.6959
Recall: 0.9783
F1 score: 0.8133
Confusion matrix (rows=true [0,1], cols=pred [0,1]):
[[ 3 59]
[ 3 135]]
C. report:
precision recall f1-score support
0 0.5000 0.0484 0.0882 62
1 0.6959 0.9783 0.8133 138
accuracy 0.6900 200
macro avg 0.5979 0.5133 0.4507 200
weighted avg 0.6352 0.6900 0.5885 200
=== Viši threshold (+50% from 0.50) ===
Threshold: 1.12
Accuracy: 0.3100
Precision: 0.0000
Recall: 0.0000
F1 score: 0.0000
Confusion matrix (rows=true [0,1], cols=pred [0,1]):
[[ 62 0]
[138 0]]
C. report:
precision recall f1-score support
0 0.3100 1.0000 0.4733 62
1 0.0000 0.0000 0.0000 138
accuracy 0.3100 200
macro avg 0.1550 0.5000 0.2366 200
weighted avg 0.0961 0.3100 0.1467 200