Skip to article frontmatterSkip to article content

Vježba 01a

Proučite dataset gender_classification_v7.csv i definirajte što je target varijabla. Ako je ne pronalazite, izaberite neku varijablu za koju postoje samo dvije vrijednosti u stupcu. Iz dataseta izbacite sve značajke koje nisu numeričke, a ako negdje nedostaje vrijednost umetnite medijan. Model trenirajte sa metodom Naive Bayes. Dataset pripremite sa train_test_split funkcijom i pri tom koristite random_state=42, stratify=y, a 20% neka vam ostane za testiranje. Nakon toga sa preostalih 80% koristite metodu cross_val_score za k-CV pri čemu koristite cv=5 i scoring=‘accuracy’. Na kraju izračunajte točnost modela i sa testnim podacima. Ispišite srednju vrijednost niza scores, ali i grešku sa testnim podacima.

import numpy as np
import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score
from sklearn import model_selection
df = pd.read_csv("bank.csv", low_memory=False)
df.head()
Loading...
target = df["deposit"]
features_num = df.select_dtypes(include=[np.number])

imputer = SimpleImputer(strategy="median")
features_num_imputed = pd.DataFrame(imputer.fit_transform(features_num), columns=features_num.columns)

df_clean = pd.concat([features_num_imputed, target], axis=1)
target = (df_clean["deposit"] == "yes").astype(int)
features = df_clean.drop(columns=["deposit"]).select_dtypes(include=[np.number])
train_features, test_features, train_target, test_target = train_test_split(features, target, test_size=0.2, random_state=42, stratify=target)

algorithm = GaussianNB()
scores = model_selection.cross_val_score(algorithm, train_features, train_target, cv=5, scoring="accuracy")
print(scores, scores.mean())

model = algorithm.fit(train_features, train_target)
predictions = algorithm.predict(test_features)
accuracy = accuracy_score(test_target, predictions)
[0.70940649 0.69708847 0.71780515 0.71388578 0.71876751] 0.711390679452072
print("Scores", scores)
print("Scores mean", scores.mean())
print("Accuracy", accuracy)
Scores [0.70940649 0.69708847 0.71780515 0.71388578 0.71876751]
Scores mean 0.711390679452072
Accuracy 0.7218987908643081