와인 품질 측정하기


문제 정의

와인의 화학 데이터를 기반으로 와인의 품질을 측정해보자

가설 수립

신성도, 알코올도수와 같은 정량적인 화학데이터를 이용해서 미각측정없이 와인의 품질을 측정하는 게 가능하다.

목표

화학 특징 데이터를 입력받아 0-10사이의 숫자로 와인품질을 나타낸다.


데이터 구성

11개 특징 데이터

요약

개수 1599개 출력 데이터 quality 와인품질

분석단계

  1. 데이터 불러오기
In [1]:
import pandas as pd
df=pd.read_csv('data/wine.csv')
df.head(10)
Out[1]:
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
0 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4 4.617195
1 7.8 0.88 0.00 2.6 0.098 25.0 67.0 0.9968 3.20 0.68 9.8 4.782987
2 7.8 0.76 0.04 2.3 0.092 15.0 54.0 0.9970 3.26 0.65 9.8 4.868157
3 11.2 0.28 0.56 1.9 0.075 17.0 60.0 0.9980 3.16 0.58 9.8 5.929590
4 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4 4.714931
5 7.4 0.66 0.00 1.8 0.075 13.0 40.0 0.9978 3.51 0.56 9.4 4.865404
6 7.9 0.60 0.06 1.6 0.069 15.0 59.0 0.9964 3.30 0.46 9.4 5.453404
7 7.3 0.65 0.00 1.2 0.065 15.0 21.0 0.9946 3.39 0.47 10.0 7.369344
8 7.8 0.58 0.02 2.0 0.073 9.0 18.0 0.9968 3.36 0.57 9.5 7.113026
9 7.5 0.50 0.36 6.1 0.071 17.0 102.0 0.9978 3.35 0.80 10.5 4.883397
In [2]:
df.describe()
Out[2]:
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
count 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000
mean 8.319637 0.527821 0.270976 2.538806 0.087467 15.874922 46.467792 0.996747 3.311113 0.658149 10.422983 5.622542
std 1.741096 0.179060 0.194801 1.409928 0.047065 10.460157 32.895324 0.001887 0.154386 0.169507 1.065668 0.858455
min 4.600000 0.120000 0.000000 0.900000 0.012000 1.000000 6.000000 0.990070 2.740000 0.330000 8.400000 2.552934
25% 7.100000 0.390000 0.090000 1.900000 0.070000 7.000000 22.000000 0.995600 3.210000 0.550000 9.500000 4.982849
50% 7.900000 0.520000 0.260000 2.200000 0.079000 14.000000 38.000000 0.996750 3.310000 0.620000 10.200000 5.568807
75% 9.200000 0.640000 0.420000 2.600000 0.090000 21.000000 62.000000 0.997835 3.400000 0.730000 11.100000 6.189646
max 15.900000 1.580000 1.000000 15.500000 0.611000 72.000000 289.000000 1.003690 4.010000 2.000000 14.900000 8.456527
  1. EDA, Feature Engineering
In [3]:
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set()
In [4]:
facet=sns.FacetGrid(df,aspect=4)
facet.map(sns.kdeplot,'quality')
facet.add_legend()
plt.show()
In [5]:
plt.figure(figsize=(15,15))
sns.heatmap(df.corr(),annot=True,fmt='.2f',square=True)
#annot 상관계수 표시 여부 #fmt 소수자리 #squre 정사각형
plt.show()
  1. Data Set 구성
In [6]:
#df.drop([],axis=1)
from sklearn.model_selection import train_test_split
train_data=df.drop(['quality'],axis=1)
target_data=df['quality']
x_train,x_test,y_train,y_test=train_test_split(train_data,target_data,test_size=0.2)
x_train,x_valid,y_train,y_valid=train_test_split(x_train,y_train,test_size=0.2)
In [7]:
print(train_data.shape)
print(x_train.shape,y_train.shape)
print(x_valid.shape,y_valid.shape)
print(x_test.shape,y_test.shape)
(1599, 11)
(1023, 11) (1023,)
(256, 11) (256,)
(320, 11) (320,)
  1. 모델링 & 학습
  1. 모델 평가 및 검증
1. Decision Tree Regressor
In [8]:
from sklearn.tree import DecisionTreeRegressor
tree=DecisionTreeRegressor().fit(x_train,y_train)
print("train set score:",tree.score(x_train,y_train))
print("valid set score:",tree.score(x_valid,y_valid))
train set score: 0.9894181206851056
valid set score: -0.22743273322312338
2. Random Forest Regressor
In [9]:
from sklearn.ensemble import RandomForestRegressor
forest=RandomForestRegressor().fit(x_train,y_train)
print("train set score:",forest.score(x_train,y_train))
print("valid set score:",forest.score(x_valid,y_valid))
train set score: 0.900956729032214
valid set score: 0.3363204369699776
3. Linear Regression
In [10]:
from sklearn.linear_model import LinearRegression
lr=LinearRegression().fit(x_train,y_train)
print("train set score:",lr.score(x_train,y_train))
print("valid set score:",lr.score(x_valid,y_valid))
train set score: 0.29161488429619464
valid set score: 0.36601453007372914
4. Ridge
In [11]:
from sklearn.linear_model import Ridge
ridge=Ridge(alpha=1.0).fit(x_train,y_train)
print("train set score:",ridge.score(x_train,y_train))
print("valid set score:",ridge.score(x_valid,y_valid))
train set score: 0.290307367127563
valid set score: 0.35970104319477214
5. PolynomialFeatures
In [12]:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
model=make_pipeline(PolynomialFeatures(2),LinearRegression())
model.fit(x_train,y_train)
print("train set score:",model.score(x_train,y_train))
print("valid set score:",model.score(x_valid,y_valid))
train set score: 0.3781033597574109
valid set score: 0.3124514229052976
  1. 최종 결론 도출
In [13]:
print("test set score:",forest.score(x_test,y_test))
prediction=forest.predict(x_test)
test set score: 0.39770477884059297

RandomForestRegression 모델을 통해 42.3의 정확도로 와인의 품질을 추정할 수 있다.

In [14]:
comparison=pd.DataFrame(y_test)
comparison['my_predict']=y_test
comparison.head()
Out[14]:
quality my_predict
190 4.769441 4.769441
1597 4.697053 4.697053
211 5.866661 5.866661
103 4.713316 4.713316
553 5.485500 5.485500

SVM

In [15]:
from sklearn.svm import SVR
model=SVR().fit(x_train,y_train)
print("train set score:",model.score(x_train,y_train))
print("valid set score:",model.score(x_valid,y_valid))
train set score: 0.1682234461740576
valid set score: 0.10022416782660304

더 좋은 평가지표 고안하기(카테고리화)

In [16]:
round(125.5)
Out[16]:
126
In [17]:
comparison=round(comparison)
comparison.head()
Out[17]:
quality my_predict
190 5.0 5.0
1597 5.0 5.0
211 6.0 6.0
103 5.0 5.0
553 5.0 5.0
In [18]:
evaluation=comparison['quality']==comparison['my_predict']
evaluation
Out[18]:
190     True
1597    True
211     True
103     True
553     True
        ... 
374     True
1590    True
127     True
406     True
612     True
Length: 320, dtype: bool
In [19]:
success=(evaluation == True).sum()
failure=(evaluation==False).sum()
print(success/(success+failure)) # 예측 성공률
1.0
In [ ]: