벡터 양자화(vector quantization)로서의 k-평균

2018. 3. 22. 10:02

#!/usr/bin/env python3

0. 살펴보기

PCA : 데이터의 분산이 가장 큰 방향

NMF : 데이터의 극단 또는 일부분에 상응되는 중첩할 수 있는 성분

k-평균 : 클러스터의 중심으로 데이터 포인트를 표현 ==> 각 데이터 포인트가 클러스터 중심, 즉 하나의 성분으로 표현된다고 볼 수 있음

k-평균을 이렇게 각 포인트가 하나의 성분으로 분해되는 관점으로 보는 것을 벡터 양자화^{vector quantization}이라 함

1. PCA, NMF, k-평균에서 추출한 성분을 바탕으로 시각화

# library import

from sklearn.datasets import fetch_lfw_people

from sklearn.model_selection import train_test_split

import matplotlib.pyplot as plt

import matplotlib

import numpy as np

from sklearn.decomposition import NMF

from sklearn.decomposition import PCA

from sklearn.cluster import KMeans

# matplotlib 설정

matplotlib.rc('font', family='AppleGothic')

plt.rcParams['axes.unicode_minus'] = False

# people객체 생성

people = fetch_lfw_people(min_faces_per_person=20, resize=0.7, color=False) # 겹치치 않는 최소사람 수, 비율, 흑백

image_shape = people.images[0].shape

# 특성 50개 추출(True)

idx = np.zeros(people.target.shape, dtype=np.bool) # 3023개의 False생성

for target in np.unique(people.target):

idx[np.where(people.target == target)[0][:50]] = 1

# 데이터 추출

x_people = people.data[idx]

y_people = people.target[idx]

# 데이터 분할

x_train, x_test, y_train, y_test = train_test_split(

x_people, y_people, # 분할할 데이터

stratify=y_people, random_state=0) # 그룹, 랜덤상태

# 모델생성 및 학습

nmf = NMF(n_components=100, random_state=0).fit(x_train)

pca = PCA(n_components=100, random_state=0).fit(x_train)

kmeans = KMeans(n_clusters=100, random_state=0).fit(x_train)

### pca.transform(x_test): 주성분 갯수만큼 차원을 줄인 데이터

### pca.inverse_transform(pca.transform(x_test)): 차원을 줄인 주성분으로 원래 데이터에 매핑

### kmeans.cluster_centers_[kmeans.predict(x_test)] k-평균 중심에서 k-평균으로 구한 라벨을 행 인덱스로 추출하여 재구성

x_reconst_nmf = np.dot(nmf.transform(x_test), nmf.components_) # nmf.inverse_transform(nmf.transform(x_test)) 와 동일

x_reconst_pca = pca.inverse_transform(pca.transform(x_test))

x_reconst_kmeans = kmeans.cluster_centers_[kmeans.predict(x_test)]

# visualization

fig, axes = plt.subplots(3, 5,

subplot_kw={'xticks':(), 'yticks':()})

for ax, comp_nmf, comp_pca, comp_kmeans in zip(

axes.T, nmf.components_, pca.components_, kmeans.cluster_centers_): # 행방향 순서로 그림을 채우기 때문에 Transfrom을

ax[0].imshow(comp_nmf.reshape(image_shape))

ax[1].imshow(comp_pca.reshape(image_shape))

ax[2].imshow(comp_kmeans.reshape(image_shape))

axes[0, 0].set_ylabel('nmf')

axes[1, 0].set_ylabel('pca')

axes[2, 0].set_ylabel('kmeans')

plt.gray()

plt.show()

NMF, PCA, k-평균의 클러스터 중심으로 찾은 성분의 비교

# library import

fig, axes = plt.subplots(4, 5,

subplot_kw={'xticks':(), 'yticks':()}) # subplot 축 없애기

for ax, orig, rec_nmf, rec_pca, rec_kmeans in zip(

axes.T, x_test, x_reconst_nmf, x_reconst_pca, x_reconst_kmeans):

ax[0].imshow(orig.reshape(image_shape))

ax[1].imshow(rec_nmf.reshape(image_shape))

ax[2].imshow(rec_pca.reshape(image_shape))

ax[3].imshow(rec_kmeans.reshape(image_shape))

for i, name in zip(range(4), ['원본', 'kmeans', 'pca', 'nmf']):

axes[i, 0].set_ylabel(name)

plt.gray()

plt.show()

성분(또는 클러스터 중심) 100개를 사용한 NMF, PCA, k-평균의 이미지 재구성

k-평균을 사용한 벡터 양자화는 데이터의 차원보다 더 많은 클러스터를 사용해 데이터를 인코딩할 수 있음

2. 특성의 수보다 많은 클러스터중심을 사용한 k-평균

# library import

from sklearn.datasets import make_moons

# dataset생성

x, y = make_moons(n_samples=200, noise=0.05, random_state=0) # noise: 퍼짐정도

# 모델 생성 및 학습, 예측

kmeans = KMeans(n_clusters=10, random_state=0).fit(x)

y_pred = kmeans.predict(x)

plt.scatter(x[:, 0], x[:, 1], c=y_pred, s=60, cmap='Paired', edgecolors='black', alpha=0.7) # cmap: palette

plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], c=np.unique(kmeans.labels_), # c: group

s=100, marker='s', linewidths=2, cmap='Paired', edgecolors='black')

plt.xlabel('특성 0')

plt.ylabel('특성 1')

plt.show()

복잡한 형태의 데이터셋을 다루기 위해 많은 클러스터를 사용한 k-평균

10개의 클러스터를 사용했기 때문에 10개로 구분

해당 클러스터에 해당하는 특성을 1, 해당하지않는 특성을 0으로 구분

큰 규모의 데이터셋을 처리할 수 있는 scikit-learn은 MiniBatchKMenas도 제공

k-평균의 단점은 클러스터의 모양을 가정(원)하고 있어서 활용범위가 비교적 제한

찾으려하는 클러스터의 갯수를 지정해야함

'python 머신러닝 -- 비지도학습 > clustering(군집)' 카테고리의 다른 글

face dataset으로 Agglomerative Algorithm 비교 (0)	2018.03.23
clustering valuation(군집 모델 평가하기) (0)	2018.03.23
DBSCAN (0)	2018.03.22
병합 군집(agglomerative clustering) (0)	2018.03.22
k-평균 군집 알고리즘 (0)	2018.03.21

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

게으른 우루루

벡터 양자화(vector quantization)로서의 k-평균

'python 머신러닝 -- 비지도학습 > clustering(군집)' 카테고리의 다른 글

+ Recent posts

티스토리툴바

개인정보

단축키

내 블로그

블로그 게시글

모든 영역