'모델 평가와 성능 향상/평가 지표와 측정' 카테고리의 글 목록

모델 평가와 성능 향상/평가 지표와 측정

불확실성 고려 2018.04.20
Binary Classifier Evaluation 2018.04.05

불확실성 고려

2018. 4. 20. 13:07

#!/usr/bin/env python3

불확실성 고려

confusion matrix와 classifier report는 예측 결과를 자세히 분석할 수 있도록 도와줍니다.

하지만 예측값은 model에 담긴 많은 정보가 이미 손실된 상태입니다.

대부분의 classifier는 확신을 가늠하기 위한 decision_function이나 predict_proba 메소드를 제공합니다.

예측을 만들어내는 것은 decision_function, predict_proba 출력의 critical value를 검증하는 것입니다.

binary search에서 decision_function은 0, predict_proba는 0.5를 critical value로 사용합니다.

다음 예는 음성 클래스 데이터 포인트 400개와 양성 클래스 데이터 포인트 50개로 이뤄진 불균형 데이터셋의 classifier report입니다.

from mglearn.plots import plot_decision_threshold

import matplotlib.pyplot as plt

plot_decision_threshold()

plt.show()

decision function의 heatmap과 critical value 변화에 따른 영향

classification_report 함수를 사용해서 두 클래스의 정밀도와 재현율을 평가해보겠습니다.

from mglearn.datasets import make_blobs

from sklearn.model_selection import train_test_split

from sklearn.svm import SVC

from sklearn.metrics import classification_report

x, y = make_blobs(n_samples=(400, 50), centers=2, cluster_std=[7, 2], random_state=22)

x_train, x_test, y_train, y_test = train_test_split(x, y, stratify=y, random_state=22)

svc = SVC(gamma=0.5) # degree=3, C=1, gamma='auto', kernel='rbf'

svc.fit(x_train, y_train)

rpt_result = classification_report(y_test, svc.predict(x_test))

print('{}'.format(rpt_result))

# precision recall f1-score support

# 0 0.92 0.91 0.91 100

# 1 0.36 0.38 0.37 13

# avg / total 0.85 0.85 0.85 113

클래스 1에 대해 36%, 재현율은 38%입니다. class 0의 샘플이 매우 많으므로 classifier는 class 0에 초점을 맞추고 있습니다.

만약 class 1의 재현율을 높이는게 중요하다고 가정하면

class 1로 잘못분류된 FP(False Positive)보다 TP(True Positive)를늘려야 한다는 뜻입니다.

svc.predict로 만든 예측은 이 조건을 만족하지 못했지만 critical value를 조정하여 class 1의 재현율을 높이도록 예측을 조정할 수 있습니다.

기본적으로 decision_function의 값이 0보다 크면 class 1로 분류 됩니다. 더 많은 데이터 포인트가 class 1로 분류되려면 critical value를 낮춰야 합니다.

y_pred_lower_threshold = svc.decision_function(x_test) > -0.8

rpt_result_adj = classification_report(y_test, y_pred_lower_threshold)

print('{}'.format(rpt_result_adj))

# precision recall f1-score support

# 0 0.94 0.87 0.90 100

# 1 0.35 0.54 0.42 13

# avg / total 0.87 0.83 0.85 113

class 1의 정밀도는 낮아졌지만 재현율은 높아졌습니다.

간단한 예제를 위해 test set의 결과를 바탕으로 critical value를 선택했습니다.

실제로는 test set를 사용하면 안되며, 다른 parameter처럼 test set에서 decision critical value를 선택하면 과도하게 낙관적인 결과가 나옵니다. 대신에 검증세트나 교차 검증을 사용해야 합니다.

참고 자료:

[1]Introduction to Machine Learning with Python, Sarah Guido

'모델 평가와 성능 향상 > 평가 지표와 측정' 카테고리의 다른 글

Binary Classifier Evaluation (0)	2018.04.05

Binary Classifier Evaluation

2018. 4. 5. 18:19

#!/usr/bin/env python3

Binary Classifier Evaluation

binary clssifier는 널리 사용되고 개념도 쉬운 머신러닝 알고리즘이지만 이렇게 간단한 작업을 평가하는 데에도 주의해야할 점이 많습니다.

정확도를 잘못 측정하는 경우에 대해 살펴보면

binary clssifier에는 양성 클래스와 음성 클래스가 있으며 양성 클래스가 관심 클래스 입니다.

1. Error Type

테스트가 양성이면 건강, 음성 이면 암진단으로 생각할 수 있습니다.

모델이 항상 완벽하게 작동하는 것은 아니니, 잘못 분류할 때가 있습니다.

건강한 사람을 음성으로 분류하면 추가 검사를 받게 할 것이며 이를 false positive^{거짓 양성이라 합니다.}

반대로 암에 걸린사람을 음성으로 분류하여 제대로 된 검사나 치료를 받지 못하게 할 수도 있습니다. 이는 위의 오류보다 더 치명적으로 다가옵니다. 이런 종류의 잘못된 예측은 false negative^{거짓 음성}이라 합니다.

통계학에서 false positive를 type-I Error false negative를 type-II Error라고 합니다.

암진단에서는 false negative가 false positive보다 중요도가 높습니다.

2. imbalanced dataset^{불균형 데이터셋}

이 두 종류의 에러는 두 클래스 중 하나가 다른 것보다 훨씬 많을 때 더 중요합니다.

어떤 아이템이 사용자에게 보여진 impression^노출 데이터로 클릭을 예측하는 것입니다.

아이템은 광고일 수도 있고, 관련 기사나 기타 등등 여러가지 일 수도 있습니다.

목표는 특정 상품을 보여주면 사용자가 클릭을 할지(관심 대상인지)를 예측하는 것입니다.

사용자가 관심 있는 것을 클릭할 때까지 100개의 광고나 글을 보여줘야 한다면

이 때 클릭이 아닌 데이터 99개와 클릭 데이터 1개가 데이터셋으로 만들어집니다.

즉 샘플의 99%가 클릭 아님(음성), 1%가 클릭(양성) 클래스에 속합니다.

이렇게 한 클래스가 다른 것보다 훨씬 많은 데이터셋을 imbalanced datasets^{불균형 데이터셋}이라고 합니다.

클릭을 99% 정확도로 예측하는 분류기를 만들었다고 하면 정확도는 꽤 높아보이지만 불균형 클래스를 고려하지 못했습니다.

즉 머신러닝 모델을 만들지 않고서도 무조건 '클릭 아님'으로 예측하면 그 정확도는 99%입니다.

이 말은 모델이 '좋은 모델', '무조건 클릭 모델' 중에 하나일 수 있다는 사실입니다.

따라서 정확도로는 이 둘을 구분하기가 어렵습니다.

예를 위해서 digits 데이터셋을 사용해 숫자 9를 다른 숫자와 구분해서 9:1의 불균형한 데이터셋을 만들어보겠습니다.

from sklearn.datasets import load_digits

from sklearn.model_selection import train_test_split

digits = load_digits()

y = digits.target == 9

x_train, x_test, y_train, y_test = \

train_test_split(digits.data, y, random_state=0)

# 항상 다수인 클래스(여기서는 '9 아님')를 예측값으로 내놓는 DummyClassifier를 사용해서 정확도를 계산해보면

from sklearn.dummy import DummyClassifier

import numpy as np

dummy_majority = DummyClassifier(strategy='most_frequent')

dummy_majority.fit(x_train, y_train)

pred_most_frequent = dummy_majority.predict(x_test)

print('예측된 유니크 레이블 ==> {}'.format(np.unique(pred_most_frequent)))

print('test score ==> {:.3f}'.format(dummy_majority.score(x_test, y_test)))

# 예측된 유니크 레이블 ==> [False]

# test score ==> 0.896

거의 아무것도 학습하지 않고 약 90% 정확도를 얻었습니다. 생각보다 높은 정확도를 가지고 있습니다.

하지만 문제에 따라서는 그저 무조건 한 클래스를 예측하기만 해도 될 수 있습니다.

실제 분류기 DecisionTreeClassifier를 사용한 것과 비교해보면

DecisionTreeClassifier는 이전 포스팅을 참고하세요

[1]Decision Tree -- intro, [2]Decision Tree, [3]Decision Tree Regressor

from sklearn.tree import DecisionTreeClassifier

tree = DecisionTreeClassifier(max_depth=2)

tree.fit(x_train, y_train)

pred_tree = tree.predict(x_test)

print('test score ==> {:.3f}'.format(tree.score(x_test, y_test)))

# test score ==> 0.918

정확도로만 보면 DecisionTreeClassifier가 더미 clssifier보다 조금 나을 뿐입니다.

비교를 위해 LogisticRegression과 기본 DummyClassifier clssifier 두 개를 더 살펴보겠습니다.

DummyClassifier는 무작위로 선택하므로 훈련 세트와 같은 비율의 예측값을 만듭니다.

DummyClassifier의 stategy의 기본값은 stratified로 클래스 레이블의 비율과 같은 비율로 예측 결과를 만들지만 타깃값 y_test와는 다르므로 정확도는 더 낮아집니다.

from sklearn.linear_model import LogisticRegression

dummy = DummyClassifier() # strategy='stratified'

dummy.fit(x_train, y_train)

pred_dummy = dummy.predict(x_test)

print('예측된 유니크 레이블 ==> {}'.format(np.unique(pred_dummy)))

print('dummy score ==> {:.3f}'.format(dummy.score(x_test, y_test)))

예측된 유니크 레이블 ==> [False True]

# dummy score ==> 0.820

# logreg = LogisticRegression(C=0.1)

logreg.fit(x_train, y_train)

pred_logreg = logreg.predict(x_test)

print('예측된 유니크 레이블 ==> {}'.format(np.unique(pred_logreg)))

print('logreg score ==> {:.3f}'.format(logreg.score(x_test, y_test)))

# 예측된 유니크 레이블 ==> [False True]

# logreg score ==> 0.978

무작위로 예측하는 DummyClassifier는 결과가 안 좋습니다. 반면 LogisticRegression 은 매우 좋으나 DummyClassifier도 81.3%를 맞추었으므로 실제로 이 결과가 유용한지 판단하기가 매우 어렵습니다.

imbalanced datasets에서 예측 성능을 정량화하는 데 정확도는 적절한 측정 방법이 아니기 때문입니다.

특히 pred_most_frequent와 pred_dummy처럼, 빈도나 무작위 기반 예측보다 얼마나 더 나은지 알려주는 평가지표가 필요합니다.

모델을 평가하는 지표라면 이런 비상식적인 예측은 피할 수 있어야 합니다.

3. confusion matrix^오차행렬

confusion matix^오차행렬은 binary clssifier 평가 결과를 나타낼 때 가장 많이 사용하는 방법입니다.

LogisticRegression의 예측 결과를 confusion_matrix 함수를 사용해서 확인해보겠습니다.

from sklearn.metrics import confusion_matrix

pred_logreg = logreg.predict(x_test)

confusion = confusion_matrix(y_test, pred_logreg)

print('confusion matrix \n{}'.format(confusion))

# confusion matrix

# [[401 2]

# [ 8 39]]

confusion_matrix의 출력은 2x2 배열입니다. 행은 정답 클래스에 해당하고, 열은 예측 클래스에 해당합니다.

각 항목의 숫자는 행에 해당하는 클래스(여기서는 '9 아님'과 '9')가 얼마나 많이 열에 해당하는 클래스로 분류되었는지를 나타냅니다.

다음은 confusion matrix를 시각적으로 보여주는 코드입니다.

import mglearn

import matplotlib.pyplot as plt

import matplotlib

matplotlib.rc('font', family='AppleGothic')

plt.rcParams['axes.unicode_minus'] = False

mglearn.plots.plot_confusion_matrix_illustration()

plt.show()

'nine'과 'not nine' 분류문제의 confusion matrix

>> confusion matrix의 대각선 성분은 정확히 분류된 경우고, 다른 항목은 한 클래스의 샘플들이 다른 클래스로 잘못 분류된 경우가 얼마나 많은지를 보여줍니다.

숫자 9를 양성 클래스로 정의하면 confusion matrix의 항목을 앞서 이야기한 false positive와 false negative로 연결할 수 있습니다.

true positive, false positive, true negative, false negative로 분류하고 약자로

TP, FP, TN, FN이라고 하면

mglearn.plots.plot_binary_confusion_matrix()

plt.show()

binary classifier의 confustion matrix

이제 이 confusion matrix를 사용하여 앞서 만든 모델들

1. DummyClassifier 2개와

2. DecisionTreeClassifier

3. LogisticRegression을 비교해보겠습니다.

pred_most_frequent = dummy_majority.predict(x_test)

pred_dummy = dummy.predict(x_test)

pred_tree = tree.predict(x_test)

pred_logreg = logreg.predict(x_test)

print('dummy model based on frequency')

print(confusion_matrix(y_test, pred_most_frequent))

print('\nrandom dummy model')

print(confusion_matrix(y_test, pred_dummy))

print('\ndecision tree')

print(confusion_matrix(y_test, pred_tree))

print('\nlogistic regression')

print(confusion_matrix(y_test, pred_logreg))

# dummy model based on frequency

# [[403 0]

# [ 47 0]]

# random dummy model

# [[365 38]

# [ 40 7]]

# decision tree

# [[390 13]

# [ 24 23]]

# logistic regression

# [[401 2]

# [ 8 39]]

confusion_matrix를 보면 pred_most_frequent에서 잘못된 것이 보입니다.

항상 동일한 클래스를 예측하기 때문입니다.

반면에 pred_dummy는 특히 FN과 FP보다 TP가 매우 적고, TP보다 FP가 매우 많습니다.

pred_logreg는 거의 모든 면에서pred_tree보다 낫습니다.

이 행렬의 모든 면을 살펴보면 많은 정보를 얻을 수 있지만, 매우 수동적이며 정성적인 방법입니다.

'모델 평가와 성능 향상 > 평가 지표와 측정' 카테고리의 다른 글

불확실성 고려 (0)	2018.04.20

PREV 1 NEXT

게으른 우루루