cs231n筆記1-用KNN演算法進行影象分類

一。下載資料集

下載CIFAR-10的Python版本壓縮包

將其解壓縮到

。/input

資料夾。

二。熟悉資料集

檢視有哪些檔案

from

subprocess

import

check_output

（

check_output

（［

“ls”

，

“input/cifar-10-batches-py”

］）

。

decode

（））

輸出

batches。meta

data_batch_1

data_batch_2

data_batch_3

data_batch_4

data_batch_5

readme。html

test_batch

上面的batch檔案都是Python的持久化物件，按照CIFAR-10官網的提示可以將其unpickle成字典

def

unpickle

（

file

）：

import

pickle

with

open

（

file

，

‘rb’

）

：

dict

pickle

。

load

（

，

encoding

‘bytes’

）

return

dict

我們看下data_batch_1

data_batch_1

unpickle

（

“input/cifar-10-batches-py/data_batch_1”

）

data_batch_1

。

keys

（）

輸出

dict_keys

（［

‘filenames’

，

‘data’

，

‘labels’

，

‘batch_label’

］）

看下每個item的型別

for

data_batch_1

。

keys

（）：

（

。

decode

（），

‘：’

，

type

（

data_batch_1

［

］））

輸出

filenames

：

class

‘

list

’>

data

：

class

‘

numpy

。

ndarray

’>

labels

：

class

‘

list

’>

batch_label

：

class

‘

bytes

’>

檢視data的shape和labels的長度

（

data_batch_1

［

‘data’

］

。

shape

）

（

len

（

data_batch_1

［

‘labels’

］））

輸出

（

10000

，

3072

）

10000

，

實際上， data是10000個樣本資料的集合，每個樣本都是一個32*32畫素的RGB圖片，所以每個樣本是用32*32*3=3072個數字來表示。 data的一行就是一個樣本，看起來是這樣：

array

（［

，

154

，

255

，

。。。

，

250

，

］，

dtype

uint8

）

而labels就是10000個樣本對應的標籤。其取值範圍為0~９，具體的類別可從batchs。meta得知

meta

［

‘label_names’

］

輸出

［

‘airplane’

，

‘automobile’

，

‘bird’

，

‘cat’

，

‘deer’

，

‘dog’

，

‘frog’

，

‘horse’

，

‘ship’

，

‘truck’

］

總而言之，CIFAR10訓練集一共有50000個樣本，分別劃分到了5個data_batch。並且還有10000測試樣本，在test_batch中，作為測試集。

下面我們將所有訓練樣本和測試樣本unpickle，並將5個data_batch合併成50000*3072的訓練集

import

def

load_CIFAR_batch

（

filepath

）：

datadict

unpickle

（

filepath

）

datadict

［

‘data’

］

。

astype

（

‘float’

）

datadict

［

‘labels’

］

。

array

（

）

return

，

def

load_CIFAR10

（

ROOT

）：

［］

for

range

（

，

）：

。

path

。

join

（

ROOT

，

‘data_batch_

’

（

，））

，

load_CIFAR_batch

（

）

。

append

（

）

。

append

（

）

Xtr

。

concatenate

（

）

#將list中5個10000*3072的array疊在一起，變成50000*3072

Ytr

。

concatenate

（

）

Xte

，

Yte

load_CIFAR_batch

（

。

path

。

join

（

ROOT

，

‘test_batch’

））

return

Xtr

，

Ytr

，

Xte

，

Yte

cifar10_dir

‘input/cifar-10-batches-py’

X_train

，

y_train

，

X_test

，

y_test

load_CIFAR10

（

cifar10_dir

）

（

‘Training data shape： ’

，

X_train

。

shape

）

（

‘Training labels shape： ’

，

y_train

。

shape

）

（

‘Test data shape： ’

，

X_test

。

shape

）

（

‘Test labels shape： ’

，

y_test

。

shape

）

輸出

Training

data

shape

：

（

50000

，

3072

）

Training

labels

shape

：

（

50000

，）

Test

data

shape

：

（

10000

，

3072

）

Test

labels

shape

：

（

10000

，）

不過，為了加快訓練速度，我們訓練集取5000個樣本，測試集取500個樣本就算了

num_training

5000

mask

list

（

range

（

num_training

））

X_train

［

mask

］

y_train

［

mask

］

num_test

500

mask

list

（

range

（

num_test

））

X_test

［

mask

］

y_test

［

mask

］

三。 KNN演算法實現

KNN並沒有明顯的訓練過程，它只是直接記住所有訓練樣本，然後在預測時，計算出測試樣本到所有訓練樣本的距離（我們將採用歐式距離），再取其中距離前K小的訓練樣本的集合，將它們中出現次數最多的類別作為預測類別。

由於有500個測試樣本和5000個訓練樣本，我們採用500*5000的二維陣列來儲存距離，每一行就是一個測試樣本分別到5000個訓練樣本的歐式距離。

現在問題是，怎樣計算這個500*5000的距離陣列呢？assignment中要求用3種不同方法計算：二重迴圈，一重迴圈和不用迴圈。

由於經常被numpy的陣列運算搞暈，所以還是寫個簡化的例子。假設我們的X_train現在是4*2的（實際為5000*3072），X_test是3*2的（實際為500*3072）。不妨就設

$X_{train}=\begin{bmatrix} 1 & 2 \\ 3 & 4 \\ 5 & 6 \\ 7 & 8 \end{bmatrix}$

$X_{test}=\begin{bmatrix} a & b \\ c & d \\ e & f \end{bmatrix}$

（只是為了表示方便，不要在乎資料型別），無論怎麼算，我們最終都得得到如下的3*4的dists矩陣

$dists=\begin{bmatrix} (a-1)^2 + (b-2)^2 & (a-3)^2+(b-4)^2 & (a-5)^2+(b-6)^2 & (a-7)^2+(b-8)^2 \\ (c-1)^2 + (d-2)^2 & (c-3)^2+(d-4)^2 & (c-5)^2+(d-6)^2 & (c-7)^2+(d-8)^2 \\ (e-1)^2 + (f-2)^2 & (e-3)^2+(f-4)^2 & (e-5)^2+(f-6)^2 & (e-7)^2+(f-8)^2 \end{bmatrix}$

（最後還得取sqrt）

兩重迴圈太naive，直接上程式碼

def

compute_distances_two_loops

（

self

，

）：

num_test

。

shape

［

］

num_train

self

。

X_train

。

shape

［

］

dists

。

zeros

（（

num_test

，

num_train

））

for

range

（

num_test

）：

for

range

（

num_train

）：

dists

［

］［

］

。

sqrt

（

。

sum

（

。

square

（

［

］

self

。

X_train

［

］）））

return

dists

一重迴圈其實還是差不多，因為在numpy中

$\begin{bmatrix} a&b \end{bmatrix} - \begin{bmatrix} 1 & 2 \\ 3 & 4 \\ 5 & 6 \\ 7 & 8 \end{bmatrix} = \begin{bmatrix} a-1 & b-2 \\ a-3 & b-4 \\ a-5 & b-6 \\ a-7 & b-8 \end{bmatrix}$

np。sqaure上面的結果，得到

$\begin{bmatrix} (a-1)^2 & (b-2)^2 \\ (a-3)^2 & (b-4)^2 \\ (a-5)^2 & (b-6)^2 \\ (a-7)^2 & (b-8)^2 \end{bmatrix}$

再沿著第二維（axis=1）sum，就得到了一個測試樣本到所有訓練樣本的距離

$\begin{bmatrix} (a-1)^2 + (b-2)^2 \\ (a-3)^2 + (b-4)^2 \\ (a-5)^2 + (b-6)^2 \\ (a-7)^2 + (b-8)^2 \end{bmatrix}$

上面雖然寫的是矩陣，但實際上sum出來的是numpy一維陣列，其shape是（4，）而不是（4，1）

所以，直接將其賦給dists［i］就可以。

這樣，一重迴圈的程式碼是

def

compute_distances_one_loop

（

self

，

）：

num_test

。

shape

［

］

num_train

self

。

X_train

。

shape

［

］

dists

。

zeros

（（

num_test

，

num_train

））

for

range

（

num_test

）：

dists

［

］

。

sqrt

（

。

sum

（

。

square

（

［

］

self

。

X_train

），

axis

））

return

dists

最後一種方法，不用迴圈怎麼搞？

答案是利用完全平方公式展開。

首先，因為X_train是4*2的，X_test是3*2的，為了得到3*4的結果，我們先用X_test乘以X_train的轉置看看

$X_{test}X_{train}^T = \begin{bmatrix} a*1+b*2 & a*3+b*4 & a*5+b*6 & a*7+b*8 \\ c*1+d*2 & c*3+d*4 & c*5+d*6 & c*7+d*8 \\ e*1+f*2 & e*3+f*4 & e*5+f*6 & e*7+f*8 \end{bmatrix}$

對比其與dists的差別，就是用-2乘以上矩陣之後，該矩陣的每一行加個

$\begin{bmatrix} 1^2+2^2 & 3^2+4^2 & 5^2+6^2 & 7^2+8^2 \end{bmatrix}$

這可以透過X_train得到。

每一列加個

$\begin{bmatrix} a^2+b^2\\ c^2+d^2\\ e^2+f^2 \end{bmatrix}$

這可以透過X_test得到。

程式碼

def

compute_distances_no_loops

（

self

，

）：

test_sums

。

sum

（

。

square

（

），

axis

）

train_sums

。

sum

（

。

square

（

self

。

X_train

），

axis

）

return

。

sqrt

（

。

dot

（

，

self

。

X_train

。

）

train_sums

test_sums

。

reshape

（

，

））

算距離的方法就到此為止了。

有了距離矩陣，我們接下來就對每個測試樣本進行預測了。

為了預測第i個樣本，我們需要確定距離矩陣第i行中前K小的距離所在的下標，以便從y_test中找到它們對於的類別。這用numpy的argsort函式可以完成。

比如，對［8，7，10，9］進行argsort將得到一個［？，？，？，？］

由於7是最小的，並且其下標為1，所以有［1，？，？，？］

8是次小的，其下標為0，所以有［1，0，？，？］

。。。

最後，得到的是［1，0，3，2］

所以，取argsort（dist［i］）［：k］便得到了我們需要的下標。最後，我們只需找出

y_train

［

。

argsort

（

dists

［

］）［：

］］

中的眾數。這透過bincount和argmax函式來完成。

bincount函式是什麼意思呢？舉個例子

對［1，5，3，4，4］進行bincount，由於最大的數是5， bincount將統計0~5中每個數字出現的次數。得到的結果就是out=［0，1，0，1，2，1］， out［i］=x代表數字i出現了x次。

而argmax，顧名思義就是找到最大的數的下標。

所以，預測演算法是這樣的

def

predict_labels

（

self

，

dists

，

）：

num_test

dists

。

shape

［

］

y_pred

。

zeros

（

num_test

）

for

range

（

num_test

）：

closest_y

self

。

y_train

［

。

argsort

（

dists

［

］）［：

］］

y_pred

［

］

。

argmax

（

。

bincount

（

closest_y

））

return

y_pred

最終，完整的KNN類是

class

KNearestNeighbor

（

object

）：

def

__init__

（

self

）：

pass

def

train

（

self

，

）：

self

。

X_train

self

。

y_train

def

predict

（

self

，

num_loops

）：

num_loops

：

dists

self

。

compute_distances_no_loops

（

）

elif

num_loops

：

dists

self

。

compute_distances_one_loop

（

）

elif

num_loops

：

dists

self

。

compute_distances_two_loops

（

）

else

：

raise

ValueError

（

‘Invalid value

for num_loops’

num_loops

）

return

self

。

predict_labels

（

dists

，

）

def

compute_distances_two_loops

（

self

，

）：

num_test

。

shape

［

］

num_train

self

。

X_train

。

shape

［

］

dists

。

zeros

（（

num_test

，

num_train

））

for

range

（

num_test

）：

for

range

（

num_train

）：

dists

［

］［

］

。

sqrt

（

。

sum

（

。

square

（

［

］

self

。

X_train

［

］）））

return

dists

def

compute_distances_one_loop

（

self

，

）：

num_test

。

shape

［

］

num_train

self

。

X_train

。

shape

［

］

dists

。

zeros

（（

num_test

，

num_train

））

for

range

（

num_test

）：

dists

［

］

。

sqrt

（

。

sum

（

。

square

（

［

］

self

。

X_train

），

axis

））

return

dists

def

compute_distances_no_loops

（

self

，

）：

test_sums

。

sum

（

。

square

（

），

axis

）

train_sums

。

sum

（

。

square

（

self

。

X_train

），

axis

）

return

。

sqrt

（

。

dot

（

，

self

。

X_train

。

）

train_sums

test_sums

。

reshape

（

，

））

def

predict_labels

（

self

，

dists

，

）：

num_test

dists

。

shape

［

］

y_pred

。

zeros

（

num_test

）

for

range

（

num_test

）：

closest_y

self

。

y_train

［

。

argsort

（

dists

［

］）［：

］］

y_pred

［

］

。

argmax

（

。

bincount

（

closest_y

））

return

y_pred

跑一下

classifier

KNearestNeighbor

（）

classifier

。

train

（

X_train

，

y_train

）

y_test_pred

classifier

。

predict

（

X_test

）

num_correct

。

sum

（

y_test_pred

y_test

）

accuracy

float

（

num_correct

）

num_test

（

‘Got

correct => accuracy：

’

（

num_correct

，

num_test

，

accuracy

））

結果

Got

137

500

correct

accuracy

：

0。274000

27%的正確率，起碼比隨機猜（10%）好。

四。選擇合適的K

用交叉驗證法。

將資料集劃分成num_folds折

num_folds

X_train_folds

。

array_split

（

X_train

，

num_folds

）

y_train_folds

。

array_split

（

y_train

，

num_folds

）

嘗試一系列不同的k

k_choices

［

，

100

］

k_to_accuracies

{}

for

k_choices

：

k_to_accuracies

。

setdefault

（

，

［］）

for

range

（

num_folds

）：

classifier

KNearestNeighbor

（）

X_val_train

。

vstack

（

X_train_folds

［

：

］

X_train_folds

［

：］）

y_val_train

。

vstack

（

y_train_folds

［

：

］

y_train_folds

［

：］）

y_val_train

。

flatten

（）

classifier

。

train

（

X_val_train

，

y_val_train

）

for

k_choices

：

y_val_pred

classifier

。

predict

（

X_train_folds

［

］，

）

num_correct

。

sum

（

y_val_pred

y_train_folds

［

］

。

flatten

（））

accuracy

float

（

num_correct

）

len

（

y_val_pred

）

k_to_accuracies

［

］

k_to_accuracies

［

］

［

accuracy

］

輸出結果

for

sorted

（

k_to_accuracies

）：

for

accuracy

k_to_accuracies

［

］：

（

‘k =

， accuracy =

’

（

，

accuracy

））

結果

，

accuracy

0。263000

，

accuracy

0。257000

，

accuracy

0。264000

，

accuracy

0。278000

，

accuracy

0。266000

，

accuracy

0。239000

，

accuracy

0。249000

，

accuracy

0。240000

，

accuracy

0。266000

，

accuracy

0。254000

，

accuracy

0。248000

，

accuracy

0。266000

，

accuracy

0。280000

，

accuracy

0。292000

，

accuracy

0。280000

，

accuracy

0。262000

，

accuracy

0。282000

，

accuracy

0。273000

，

accuracy

0。290000

，

accuracy

0。273000

，

accuracy

0。265000

，

accuracy

0。296000

，

accuracy

0。276000

，

accuracy

0。284000

，

accuracy

0。280000

，

accuracy

0。260000

，

accuracy

0。295000

，

accuracy

0。279000

，

accuracy

0。283000

，

accuracy

0。280000

，

accuracy

0。252000

，

accuracy

0。289000

，

accuracy

0。278000

，

accuracy

0。282000

，

accuracy

0。274000

，

accuracy

0。270000

，

accuracy

0。279000

，

accuracy

0。279000

，

accuracy

0。282000

，

accuracy

0。285000

，

accuracy

0。271000

，

accuracy

0。288000

，

accuracy

0。278000

，

accuracy

0。269000

，

accuracy

0。266000

100

，

accuracy

0。256000

100

，

accuracy

0。270000

100

，

accuracy

0。263000

100

，

accuracy

0。256000

100

，

accuracy

0。263000

上面的正確率都沒有超過30%的。

視覺化結果

import

matplotlib。pyplot

plt

。

rcParams

［

‘figure。figsize’

］

（

，

）

# plot the raw observations

for

k_choices

：

accuracies

k_to_accuracies

［

］

plt

。

scatter

（［

］

len

（

accuracies

），

accuracies

）

# plot the trend line with error bars that correspond to standard deviation

accuracies_mean

。

array

（［

。

mean

（

）

for

，

sorted

（

k_to_accuracies

。

items

（））］）

accuracies_std

。

array

（［

。

std

（

）

for

，

sorted

（

k_to_accuracies

。

items

（））］）

plt

。

errorbar

（

k_choices

，

accuracies_mean

，

yerr

accuracies_std

）

plt

。

title

（

‘Cross-validation on k’

）

plt

。

xlabel

（

‘k’

）

plt

。

ylabel

（

‘Cross-validation accuracy’

）

plt

。

show

（）

從均值來看，k=10時均值最大。但此時標準差也挺大的。

cs231n筆記1-用KNN演算法進行影象分類

蘋果授權店可以對客人解釋ipad air是第五代嗎?

頭髮斷怎麼回事

隨便看看

墊胎與棉胎有什麼區別？

如果鄧亞萍和伊藤美誠比賽，誰會笑到最後呢？

綱目體的特點？

飄窗簾高度的正確尺寸？

cs231n筆記1-用KNN演算法進行影象分類

蘋果授權店可以對客人解釋ipad air是第五代嗎?

頭髮斷怎麼回事

猜你喜歡

在ASP.NET中HttpResponse和HttpRequest分別有什麼作用？最好舉例謝謝!

Netflix電影推薦系統Python實現(協同過濾+矩陣分解)

Keras之文字分類實現

隨便看看

墊胎與棉胎有什麼區別？

如果鄧亞萍和伊藤美誠比賽，誰會笑到最後呢？

綱目體的特點？

飄窗簾高度的正確尺寸？