'전처리' 태그의 글 목록

전처리

preprocessing 2018.03.19

preprocessing

2018. 3. 19. 11:27

#!/usr/bin/env python3

preprocessing

0. 살펴보기

SVM과 neural_network는 데이터 스케일에 매우 민감

보통 특성마다 스케일을 조정해서 데이터를 변경

# library import

import mglearn

import matplotlib.pyplot as plt

mglearn.plots.plot_scaling() # 그래프 예제 출력

plt.show() # 그래프 출력

데이터셋의 스케일증 조정하거나 전처리하는 여러 방법

첫 번째 그래프는 두 개의 특성을 인위적으로 만든 이진 분류 데이터셋의

오른쪽의 4가지 그래프는 데이터를 기준이 되는 범위로 변환하는 방법

StandardScalar는 각 특성의 평균을 0, 분산을 1로 변경하여 모든 특성이 같은 크기를 가지게 함

이 방법은 특성의 최솟값과 최댓값 크기를 제한하지 않음

RobustScaler는 특성들이 같은 스케일을 갖게 되지만 평균대신 중앙값을 사용 ==> 극단값에 영향을 받지 않음

MinMaxSclaer는 매우 다른 스케일의 범위를 0과 1사이로 변환

Nomalizer는 uclidian의 길이가 1이 되도록 데이터 포인트를 조정 ==> 각도가 많이 중요할 때 사용

1. 전처리를 통한 breast cancer 데이터 분석(SVM, Support Vector Machine)

# library import

from sklearn.datasets import load_breast_cancer

from sklearn.model_selection import train_test_split

# 데이터 로드

cancer = load_breast_cancer()

# 데이터 분할

x_train, x_test, y_train, y_test = \

train_test_split( # 데이터 분할을위해

cancer.data, cancer.target, # 분할할 데이터

test_size=0.25, random_state=1) # 테스트 비율, 랜덤상태

# 분할한 데이터 속성

print('x_train.shape \n{}'.format(x_train.shape))

print('x_test.shape \n{}'.format(x_test.shape))

# pre-processing import

# library import

from sklearn.preprocessing import MinMaxScaler

# 메소드 호출

scaler = MinMaxScaler()

### MinMaxScaler의 fit 메소드는 train set에 있는 특성마다 최솟값과 최댓값을 계산 ###

### fit 메소드를 호출할 때 x_train만 넘겨줌 ###

scaler.fit(x_train) # 메소드 적용

###fit 메소드로 실제 훈련 데이터의 스케일을 조정하려면 스케일 객체의 transform 메소드를 사용 ###

### transform: 새로운 데이터 표현을 만들 때 사용하는 메소드 ###

# data 변환

x_train_scaled = scaler.transform(x_train)

x_test_scaled = scaler.transform(x_test)

# 스케일이 조정된 후 데이터셋의 속성

print('size: \n{}'.format(x_train_scaled.shape))

print('스케일 전 최솟값: \n{}'.format(x_train.min(axis=0))) # axis=0 열방향, axis=1 행방향

print('스케일 전 최댓값: \n{}'.format(x_train.max(axis=0)))

print('스케일 후 최솟값: \n{}'.format(x_train_scaled.min(axis=0))) # 최솟값 :0

print('스케일 후 최댓값: \n{}'.format(x_train_scaled.max(axis=0))) # 최댓값: 1

# 스케일이 조정된 후 테스트 데이터의 속성 출력

print('조정 후 특성별 최솟값: \n{}'.format(x_test_scaled.min(axis=0))) # 0이 최솟값이 아님

print('조정 후 특성별 최댓값: \n{}'.format(x_test_scaled.max(axis=0))) # 1이 최댓값이 아님

### 모든 스케일도 train set와 test set에 같은 변환을 적용해야 함 ###

### 스케일 조정된 데이터로 학습

# library import

from sklearn.svm import SVC

# 모델 생성 및 학습

svm = SVC(C=100)

svm.fit(x_train, y_train)

svm_scaled = SVC(C=100).fit(x_train_scaled, y_train)

# 유효성 검증

print('조정 전 테스트 정확도 \n{:.3f}'.format(svm.score(x_test, y_test))) # 0.615

print('조정 후 테스트 정확도 \n{:.3f}'.format(svm_scaled.score(x_test_scaled, y_test))) # 0.965

### 스케일 조정된 데이터로 학습 ###

### 평균0, 분산 1을 갖도록 StandardScaler 조정 ###

# library import

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

scaler.fit(x_train)

x_train_scaled = scaler.transform(x_train)

x_test_scaled = scaler.transform(x_test)

# 조정된 데이터로 SVM 학습

svm.fit(x_train_scaled, y_train)

# 조정된 테스트 세트의 정확도

print('조정된 일반화 정확도 \n{:.3f}'.format(svm.score(x_test_scaled, y_test))) # 0.965

SVM은 스케일 조정에 민감하기때문에 결과값이 매우 다르게 나옴

PREV 1 NEXT

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

게으른 우루루

전처리

preprocessing

+ Recent posts

티스토리툴바

개인정보

단축키

내 블로그

블로그 게시글

모든 영역