숫자 손글씨 분류하기

MNIST http://yann.lecun.com/exdb/mnist/


문제정의

28*28 픽셀의 손글씨 숫자이미지를 입력받아서 실제로 의미하는 숫자 인식하기

가설

784개의 특징 데이터를 구성한 후 머신러닝으로 어떤 숫자인지 추측 가능

목표

28*28사이즈의 의미지로부터 label값을 얻어낸다

데이터 구성

784개 입력특징(28*28 픽셀) 출력 데이터 label 데이터의 총 행수 10000개

분석단계

  1. 데이터 불러오기
In [15]:
import pandas as pd
df=pd.read_csv('data/digit.csv')
df.head(10)
Out[15]:
pixel 1,1 pixel 1,2 pixel 1,3 pixel 1,4 pixel 1,5 pixel 1,6 pixel 1,7 pixel 1,8 pixel 1,9 pixel 1,10 ... pixel 28,20 pixel 28,21 pixel 28,22 pixel 28,23 pixel 28,24 pixel 28,25 pixel 28,26 pixel 28,27 pixel 28,28 label
0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 4
1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 8
2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 8
3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 7
4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 4
5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 6
6 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 7
7 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 6
8 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2
9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 7

10 rows × 785 columns

In [2]:
df.describe()
Out[2]:
pixel 1,1 pixel 1,2 pixel 1,3 pixel 1,4 pixel 1,5 pixel 1,6 pixel 1,7 pixel 1,8 pixel 1,9 pixel 1,10 ... pixel 28,20 pixel 28,21 pixel 28,22 pixel 28,23 pixel 28,24 pixel 28,25 pixel 28,26 pixel 28,27 pixel 28,28 label
count 10000.0 10000.0 10000.0 10000.0 10000.0 10000.0 10000.0 10000.0 10000.0 10000.0 ... 10000.000000 10000.000000 10000.000000 10000.000000 10000.0 10000.0 10000.0 10000.0 10000.0 10000.000000
mean 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.000480 0.000239 0.000050 0.000025 0.0 0.0 0.0 0.0 0.0 4.453400
std 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.017804 0.013588 0.003535 0.002500 0.0 0.0 0.0 0.0 0.0 2.884451
min 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.000000 0.000000 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 0.000000
25% 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.000000 0.000000 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 2.000000
50% 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.000000 0.000000 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 4.000000
75% 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.000000 0.000000 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 7.000000
max 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.988281 0.988281 0.250000 0.250000 0.0 0.0 0.0 0.0 0.0 9.000000

8 rows × 785 columns

In [3]:
df['label'].value_counts()
Out[3]:
1    1134
7    1101
2    1016
3    1006
9     986
6     974
0     971
4     961
8     942
5     909
Name: label, dtype: int64
  1. EDA, Feature Engineering
In [4]:
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set()
In [5]:
sns.countplot(data=df,x='label')
plt.show()
In [6]:
sns.catplot(data=df,x='label',kind='count')
plt.show()
In [7]:
import numpy as np
numbers=df.drop(['label'],axis=1)
nth=5
img=np.reshape(numbers.iloc[nth].values,[28,28])
plt.imshow(img)
plt.show()
  1. Data Set 구성
In [8]:
#df.drop([],axis=1)
from sklearn.model_selection import train_test_split
train_data=df.drop(['label'],axis=1)
target_data=df['label']
x_train,x_test,y_train,y_test=train_test_split(train_data,target_data,test_size=0.2)
x_train,x_valid,y_train,y_valid=train_test_split(x_train,y_train,test_size=0.2)
In [9]:
print(train_data.shape)
print(x_train.shape,y_train.shape)
print(x_valid.shape,y_valid.shape)
print(x_test.shape,y_test.shape)
(10000, 784)
(6400, 784) (6400,)
(1600, 784) (1600,)
(2000, 784) (2000,)
  1. Modeling & 학습
  2. 모델 평가 및 검증
  1. DecisionTreeClassifier
In [10]:
from sklearn.tree import DecisionTreeClassifier
tree=DecisionTreeClassifier().fit(x_train,y_train)
print('train set score:',tree.score(x_train,y_train))
print('valid set score:',tree.score(x_valid,y_valid))
train set score: 1.0
valid set score: 0.78
  1. RandomForestClassifier
In [11]:
from sklearn.ensemble import RandomForestClassifier
forest=RandomForestClassifier().fit(x_train,y_train)
print('train set score:',forest.score(x_train,y_train))
print('valid set score:',forest.score(x_valid,y_valid))
train set score: 1.0
valid set score: 0.941875
  1. 최종 결론 도출
In [12]:
print('test set score:', forest.score(x_test,y_test))
prediction=forest.predict(x_test)
prediction
test set score: 0.948
Out[12]:
array([7, 0, 6, ..., 4, 7, 0], dtype=int64)

결과확인 RandomForestClassifier모델로 93.8의 정확도로 숫자 손글씨를 OCR할 수 있다.

In [13]:
import random # 난수 생성
for i in range(4):
    n=random.randrange(0,len(x_test))
    img=np.reshape(x_test.iloc[n].values,[28,28])
    plt.imshow(img)
    plt.show()
    result=forest.predict([x_test.iloc[n].values])[0] #[]빼기
    print("인식된 숫자는", result,"입니다")
인식된 숫자는 8 입니다
인식된 숫자는 8 입니다
인식된 숫자는 1 입니다
인식된 숫자는 6 입니다

CHA1

In [14]:
from sklearn.svm import SVC
model=SVC().fit(x_train,y_train)
print("train set score:",model.score(x_train,y_train))
print("valid set score:",model.score(x_valid,y_valid))
train set score: 0.98515625
valid set score: 0.951875

시간이 오래걸림 Support Vector Machine은 분류 모델에 적합

In [ ]: