와인 품질 측정하기

데이터 원작자 UCI https://archive.ics.uci.edu/ml/index.php


문제 정의

와인의 화학 데이터를 기반으로 와인의 품질을 측정해보자

가설 수립

산성도, 알코올 도수와 같은 정량적인 화학 데이터를 이용해서 미각 측정 단계 없이 와인의 품질을 측정하는 게 가능하다.

목표

화학 특징 데이터를 입력받아 0~10 사이의 숫자로 와인 품질을 나타낸다.


데이터 구성

11개 특징 데이터

  1. fixed acidity
  2. volatile acidity ... 데이터 설명 페이지에서 확인할 것

요약

개수 1599개 출력 데이터 quality 와인 품질


분석 단계

단계 1. 데이터 불러오기

단계 2. EDA & Feature Engineering

단계 3. 데이터 셋 구성

단계 4. 모델링과 학습

단계 5. 모델 검증


1. 데이터 불러오기

In [1]:
import pandas as pd

df = pd.read_csv('data/wine.csv')
df
Out[1]:
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
0 7.4 0.700 0.00 1.9 0.076 11.0 34.0 0.99780 3.51 0.56 9.4 4.617195
1 7.8 0.880 0.00 2.6 0.098 25.0 67.0 0.99680 3.20 0.68 9.8 4.782987
2 7.8 0.760 0.04 2.3 0.092 15.0 54.0 0.99700 3.26 0.65 9.8 4.868157
3 11.2 0.280 0.56 1.9 0.075 17.0 60.0 0.99800 3.16 0.58 9.8 5.929590
4 7.4 0.700 0.00 1.9 0.076 11.0 34.0 0.99780 3.51 0.56 9.4 4.714931
... ... ... ... ... ... ... ... ... ... ... ... ...
1594 6.2 0.600 0.08 2.0 0.090 32.0 44.0 0.99490 3.45 0.58 10.5 5.420490
1595 5.9 0.550 0.10 2.2 0.062 39.0 51.0 0.99512 3.52 0.76 11.2 5.730746
1596 6.3 0.510 0.13 2.3 0.076 29.0 40.0 0.99574 3.42 0.75 11.0 6.337004
1597 5.9 0.645 0.12 2.0 0.075 32.0 44.0 0.99547 3.57 0.71 10.2 4.697053
1598 6.0 0.310 0.47 3.6 0.067 18.0 42.0 0.99549 3.39 0.66 11.0 5.624180

1599 rows × 12 columns

In [2]:
df.describe()
Out[2]:
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
count 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000
mean 8.319637 0.527821 0.270976 2.538806 0.087467 15.874922 46.467792 0.996747 3.311113 0.658149 10.422983 5.622542
std 1.741096 0.179060 0.194801 1.409928 0.047065 10.460157 32.895324 0.001887 0.154386 0.169507 1.065668 0.858455
min 4.600000 0.120000 0.000000 0.900000 0.012000 1.000000 6.000000 0.990070 2.740000 0.330000 8.400000 2.552934
25% 7.100000 0.390000 0.090000 1.900000 0.070000 7.000000 22.000000 0.995600 3.210000 0.550000 9.500000 4.982849
50% 7.900000 0.520000 0.260000 2.200000 0.079000 14.000000 38.000000 0.996750 3.310000 0.620000 10.200000 5.568807
75% 9.200000 0.640000 0.420000 2.600000 0.090000 21.000000 62.000000 0.997835 3.400000 0.730000 11.100000 6.189646
max 15.900000 1.580000 1.000000 15.500000 0.611000 72.000000 289.000000 1.003690 4.010000 2.000000 14.900000 8.456527

2. EDA & Feature Engineering

In [3]:
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
In [4]:
facet = sns.FacetGrid(df, aspect = 4)
facet.map(sns.kdeplot, 'quality')
facet.add_legend()
plt.show()
In [5]:
plt.figure(figsize = (20, 15))
sns.heatmap(df.corr(), annot = True, fmt = '.2f', square = True)
plt.show()

3. Dataset 구성하기

In [6]:
input_data = df.drop(['quality'], axis = 1)
target_data = df['quality']

print(input_data.shape, target_data.shape)
(1599, 11) (1599,)
In [7]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(input_data, target_data, test_size = 0.2)
print(x_train.shape, x_test.shape, y_train.shape, y_test.shape)
(1279, 11) (320, 11) (1279,) (320,)

4. 모델링 & 학습

Random Forest

In [8]:
from sklearn.ensemble import RandomForestRegressor
forest = RandomForestRegressor(n_estimators = 100)

forest.fit(x_train, y_train)
Out[8]:
RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse',
                      max_depth=None, max_features='auto', max_leaf_nodes=None,
                      max_samples=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      n_estimators=100, n_jobs=None, oob_score=False,
                      random_state=None, verbose=0, warm_start=False)
In [9]:
print('training set accuracy:', forest.score(x_train, y_train))
training set accuracy: 0.903068691224136

Support Vector Machine

In [10]:
from sklearn.svm import SVR
model = SVR()
model.fit(x_train, y_train)
Out[10]:
SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1, gamma='scale',
    kernel='rbf', max_iter=-1, shrinking=True, tol=0.001, verbose=False)
In [11]:
print('training set accuracy:', model.score(x_train, y_train))
training set accuracy: 0.17001292337779783

5. 모델 검증

Random Forest

In [12]:
print('test set accuracy:', forest.score(x_test, y_test))
test set accuracy: 0.44084457890872875
In [13]:
y_predict = forest.predict(x_test)
In [14]:
comparison = pd.DataFrame(y_test)

comparison['내 추측'] = y_predict

comparison
Out[14]:
quality 내 추측
1158 6.332934 5.729461
1490 5.986880 6.112312
954 6.346535 6.080602
1487 4.802313 5.931343
27 5.482463 5.348220
... ... ...
1469 2.837674 5.346231
206 6.909932 6.542674
767 4.913243 5.374792
729 5.826869 5.581858
704 4.354485 5.204601

320 rows × 2 columns


더 좋은 평가지표 고안하기(카테고리화)

In [ ]:
round(125.5)
In [ ]:
comparison = round(comparison)
comparison.head()
In [ ]:
evaluation = (comparison['quality'] == comparison['내 추측'])
evaluation
In [ ]:
success = (evaluation == True).sum()
failure = (evaluation == False).sum()

print(success / (success + failure)) # 예측 성공률