一。 下載資料集

下載CIFAR-10的Python版本壓縮包

將其解壓縮到

。/input

資料夾。

二。 熟悉資料集

檢視有哪些檔案

from

subprocess

import

check_output

print

check_output

([

“ls”

“input/cifar-10-batches-py”

])

decode

())

輸出

batches。meta

data_batch_1

data_batch_2

data_batch_3

data_batch_4

data_batch_5

readme。html

test_batch

上面的batch檔案都是Python的持久化物件,按照CIFAR-10官網的提示可以將其unpickle成字典

def

unpickle

file

):

import

pickle

with

open

file

‘rb’

as

fo

dict

=

pickle

load

fo

encoding

=

‘bytes’

return

dict

我們看下data_batch_1

data_batch_1

=

unpickle

“input/cifar-10-batches-py/data_batch_1”

data_batch_1

keys

()

輸出

dict_keys

([

b

‘filenames’

b

‘data’

b

‘labels’

b

‘batch_label’

])

看下每個item的型別

for

k

in

data_batch_1

keys

():

print

k

decode

(),

‘:’

type

data_batch_1

k

]))

輸出

filenames

<

class

list

’>

data

<

class

numpy

ndarray

’>

labels

<

class

list

’>

batch_label

<

class

bytes

’>

檢視data的shape和labels的長度

print

data_batch_1

b

‘data’

shape

print

len

data_batch_1

b

‘labels’

]))

輸出

10000

3072

10000

實際上, data是10000個樣本資料的集合,每個樣本都是一個32*32畫素的RGB圖片,所以每個樣本是用32*32*3=3072個數字來表示。 data的一行就是一個樣本,看起來是這樣:

array

([

59

154

255

。。。

71

250

62

],

dtype

=

uint8

而labels就是10000個樣本對應的標籤。 其取值範圍為0~9, 具體的類別可從batchs。meta得知

meta

=

unpickle

‘input/cifar-10-batches-py/batches。meta’

meta

b

‘label_names’

輸出

b

‘airplane’

b

‘automobile’

b

‘bird’

b

‘cat’

b

‘deer’

b

‘dog’

b

‘frog’

b

‘horse’

b

‘ship’

b

‘truck’

總而言之,CIFAR10訓練集一共有50000個樣本,分別劃分到了5個data_batch。 並且還有10000測試樣本,在test_batch中, 作為測試集。

下面我們將所有訓練樣本和測試樣本unpickle, 並將5個data_batch合併成50000*3072的訓練集

import

os

def

load_CIFAR_batch

filepath

):

datadict

=

unpickle

filepath

X

=

datadict

b

‘data’

astype

‘float’

Y

=

datadict

b

‘labels’

Y

=

np

array

Y

return

X

Y

def

load_CIFAR10

ROOT

):

xs

=

[]

ys

=

[]

for

b

in

range

1

6

):

f

=

os

path

join

ROOT

‘data_batch_

%d

%

b

,))

X

Y

=

load_CIFAR_batch

f

xs

append

X

ys

append

Y

Xtr

=

np

concatenate

xs

#將list中5個10000*3072的array疊在一起,變成50000*3072

Ytr

=

np

concatenate

ys

Xte

Yte

=

load_CIFAR_batch

os

path

join

ROOT

‘test_batch’

))

return

Xtr

Ytr

Xte

Yte

cifar10_dir

=

‘input/cifar-10-batches-py’

X_train

y_train

X_test

y_test

=

load_CIFAR10

cifar10_dir

print

‘Training data shape: ’

X_train

shape

print

‘Training labels shape: ’

y_train

shape

print

‘Test data shape: ’

X_test

shape

print

‘Test labels shape: ’

y_test

shape

輸出

Training

data

shape

50000

3072

Training

labels

shape

50000

,)

Test

data

shape

10000

3072

Test

labels

shape

10000

,)

不過,為了加快訓練速度,我們訓練集取5000個樣本,測試集取500個樣本就算了

num_training

=

5000

mask

=

list

range

num_training

))

X_train

=

X_train

mask

y_train

=

y_train

mask

num_test

=

500

mask

=

list

range

num_test

))

X_test

=

X_test

mask

y_test

=

y_test

mask

三。 KNN演算法實現

KNN並沒有明顯的訓練過程,它只是直接記住所有訓練樣本,然後在預測時,計算出測試樣本到所有訓練樣本的距離(我們將採用歐式距離),再取其中距離前K小的訓練樣本的集合,將它們中出現次數最多的類別作為預測類別。

由於有500個測試樣本和5000個訓練樣本,我們採用500*5000的二維陣列來儲存距離,每一行就是一個測試樣本分別到5000個訓練樣本的歐式距離。

現在問題是,怎樣計算這個500*5000的距離陣列呢?assignment中要求用3種不同方法計算:二重迴圈,一重迴圈和不用迴圈。

由於經常被numpy的陣列運算搞暈,所以還是寫個簡化的例子。假設我們的X_train現在是4*2的(實際為5000*3072),X_test是3*2的(實際為500*3072)。不妨就設

X_{train}=\begin{bmatrix} 1 & 2 \\ 3 & 4 \\ 5 & 6 \\ 7 & 8 \end{bmatrix}

X_{test}=\begin{bmatrix} a & b \\ c & d \\ e & f \end{bmatrix}

(只是為了表示方便,不要在乎資料型別),無論怎麼算,我們最終都得得到如下的3*4的dists矩陣

dists=\begin{bmatrix} (a-1)^2 + (b-2)^2 & (a-3)^2+(b-4)^2 & (a-5)^2+(b-6)^2 & (a-7)^2+(b-8)^2 \\ (c-1)^2 + (d-2)^2 & (c-3)^2+(d-4)^2 & (c-5)^2+(d-6)^2 & (c-7)^2+(d-8)^2 \\ (e-1)^2 + (f-2)^2 & (e-3)^2+(f-4)^2 & (e-5)^2+(f-6)^2 & (e-7)^2+(f-8)^2 \end{bmatrix}

(最後還得取sqrt)

兩重迴圈太naive,直接上程式碼

def

compute_distances_two_loops

self

X

):

num_test

=

X

shape

0

num_train

=

self

X_train

shape

0

dists

=

np

zeros

((

num_test

num_train

))

for

i

in

range

num_test

):

for

j

in

range

num_train

):

dists

i

][

j

=

np

sqrt

np

sum

np

square

X

i

-

self

X_train

j

])))

return

dists

一重迴圈其實還是差不多,因為在numpy中

\begin{bmatrix} a&b \end{bmatrix} - \begin{bmatrix} 1 & 2 \\ 3 & 4 \\ 5 & 6 \\ 7 & 8 \end{bmatrix} = \begin{bmatrix} a-1 & b-2 \\ a-3 & b-4 \\ a-5 & b-6 \\ a-7 & b-8 \end{bmatrix}

np。sqaure上面的結果,得到

\begin{bmatrix} (a-1)^2 & (b-2)^2 \\ (a-3)^2 & (b-4)^2 \\ (a-5)^2 & (b-6)^2 \\ (a-7)^2 & (b-8)^2 \end{bmatrix}

再沿著第二維(axis=1)sum,就得到了一個測試樣本到所有訓練樣本的距離

\begin{bmatrix} (a-1)^2 + (b-2)^2 \\ (a-3)^2 + (b-4)^2 \\ (a-5)^2 + (b-6)^2 \\ (a-7)^2 + (b-8)^2 \end{bmatrix}

上面雖然寫的是矩陣,但實際上sum出來的是numpy一維陣列,其shape是(4,)而不是(4,1)

所以,直接將其賦給dists[i]就可以。

這樣,一重迴圈的程式碼是

def

compute_distances_one_loop

self

X

):

num_test

=

X

shape

0

num_train

=

self

X_train

shape

0

dists

=

np

zeros

((

num_test

num_train

))

for

i

in

range

num_test

):

dists

i

=

np

sqrt

np

sum

np

square

X

i

-

self

X_train

),

axis

=

1

))

return

dists

最後一種方法,不用迴圈怎麼搞?

答案是利用完全平方公式展開。

首先,因為X_train是4*2的,X_test是3*2的,為了得到3*4的結果,我們先用X_test乘以X_train的轉置看看

X_{test}X_{train}^T = \begin{bmatrix} a*1+b*2 & a*3+b*4 & a*5+b*6 & a*7+b*8 \\ c*1+d*2 & c*3+d*4 & c*5+d*6 & c*7+d*8 \\ e*1+f*2 & e*3+f*4 & e*5+f*6 & e*7+f*8 \end{bmatrix}

對比其與dists的差別,就是用-2乘以上矩陣之後,該矩陣的每一行加個

\begin{bmatrix} 1^2+2^2 & 3^2+4^2 & 5^2+6^2 & 7^2+8^2 \end{bmatrix}

這可以透過X_train得到。

每一列加個

\begin{bmatrix} a^2+b^2\\ c^2+d^2\\ e^2+f^2 \end{bmatrix}

這可以透過X_test得到。

程式碼

def

compute_distances_no_loops

self

X

):

test_sums

=

np

sum

np

square

X

),

axis

=

1

train_sums

=

np

sum

np

square

self

X_train

),

axis

=

1

return

np

sqrt

-

2

*

np

dot

X

self

X_train

T

+

train_sums

+

test_sums

reshape

-

1

1

))

算距離的方法就到此為止了。

有了距離矩陣,我們接下來就對每個測試樣本進行預測了。

為了預測第i個樣本,我們需要確定距離矩陣第i行中前K小的距離所在的下標,以便從y_test中找到它們對於的類別。這用numpy的argsort函式可以完成。

比如,對[8,7,10,9]進行argsort將得到一個[?, ?, ?,?]

由於7是最小的,並且其下標為1,所以有[1,?,?,?]

8是次小的,其下標為0, 所以有[1,0,?,?]

。。。

最後,得到的是[1,0,3,2]

所以,取argsort(dist[i])[:k]便得到了我們需要的下標。最後,我們只需找出

y_train

np

argsort

dists

i

])[:

k

]]

中的眾數。 這透過bincount和argmax函式來完成。

bincount函式是什麼意思呢?舉個例子

對[1,5,3,4,4]進行bincount,由於最大的數是5, bincount將統計0~5中每個數字出現的次數。得到的結果就是out=[0,1,0,1,2,1], out[i]=x代表數字i出現了x次。

而argmax, 顧名思義就是找到最大的數的下標。

所以,預測演算法是這樣的

def

predict_labels

self

dists

k

=

1

):

num_test

=

dists

shape

0

y_pred

=

np

zeros

num_test

for

i

in

range

num_test

):

closest_y

=

self

y_train

np

argsort

dists

i

])[:

k

]]

y_pred

i

=

np

argmax

np

bincount

closest_y

))

return

y_pred

最終,完整的KNN類是

class

KNearestNeighbor

object

):

def

__init__

self

):

pass

def

train

self

X

y

):

self

X_train

=

X

self

y_train

=

y

def

predict

self

X

k

=

1

num_loops

=

0

):

if

num_loops

==

0

dists

=

self

compute_distances_no_loops

X

elif

num_loops

==

1

dists

=

self

compute_distances_one_loop

X

elif

num_loops

==

2

dists

=

self

compute_distances_two_loops

X

else

raise

ValueError

‘Invalid value

%d

for num_loops’

%

num_loops

return

self

predict_labels

dists

k

=

k

def

compute_distances_two_loops

self

X

):

num_test

=

X

shape

0

num_train

=

self

X_train

shape

0

dists

=

np

zeros

((

num_test

num_train

))

for

i

in

range

num_test

):

for

j

in

range

num_train

):

dists

i

][

j

=

np

sqrt

np

sum

np

square

X

i

-

self

X_train

j

])))

return

dists

def

compute_distances_one_loop

self

X

):

num_test

=

X

shape

0

num_train

=

self

X_train

shape

0

dists

=

np

zeros

((

num_test

num_train

))

for

i

in

range

num_test

):

dists

i

=

np

sqrt

np

sum

np

square

X

i

-

self

X_train

),

axis

=

1

))

return

dists

def

compute_distances_no_loops

self

X

):

test_sums

=

np

sum

np

square

X

),

axis

=

1

train_sums

=

np

sum

np

square

self

X_train

),

axis

=

1

return

np

sqrt

-

2

*

np

dot

X

self

X_train

T

+

train_sums

+

test_sums

reshape

-

1

1

))

def

predict_labels

self

dists

k

=

1

):

num_test

=

dists

shape

0

y_pred

=

np

zeros

num_test

for

i

in

range

num_test

):

closest_y

=

self

y_train

np

argsort

dists

i

])[:

k

]]

y_pred

i

=

np

argmax

np

bincount

closest_y

))

return

y_pred

跑一下

classifier

=

KNearestNeighbor

()

classifier

train

X_train

y_train

y_test_pred

=

classifier

predict

X_test

num_correct

=

np

sum

y_test_pred

==

y_test

accuracy

=

float

num_correct

/

num_test

print

‘Got

%d

/

%d

correct => accuracy:

%f

%

num_correct

num_test

accuracy

))

結果

Got

137

/

500

correct

=>

accuracy

0。274000

27%的正確率,起碼比隨機猜(10%)好。

四。 選擇合適的K

用交叉驗證法。

將資料集劃分成num_folds折

num_folds

=

5

X_train_folds

=

np

array_split

X_train

num_folds

y_train_folds

=

np

array_split

y_train

num_folds

嘗試一系列不同的k

k_choices

=

1

3

5

8

10

12

15

20

50

100

k_to_accuracies

=

{}

for

k_

in

k_choices

k_to_accuracies

setdefault

k_

[])

for

i

in

range

num_folds

):

classifier

=

KNearestNeighbor

()

X_val_train

=

np

vstack

X_train_folds

0

i

+

X_train_folds

i

+

1

:])

y_val_train

=

np

vstack

y_train_folds

0

i

+

y_train_folds

i

+

1

:])

y_val_train

=

y_val_train

flatten

()

classifier

train

X_val_train

y_val_train

for

k_

in

k_choices

y_val_pred

=

classifier

predict

X_train_folds

i

],

k

=

k_

num_correct

=

np

sum

y_val_pred

==

y_train_folds

i

flatten

())

accuracy

=

float

num_correct

/

len

y_val_pred

k_to_accuracies

k_

=

k_to_accuracies

k_

+

accuracy

輸出結果

for

k

in

sorted

k_to_accuracies

):

for

accuracy

in

k_to_accuracies

k

]:

print

‘k =

%d

, accuracy =

%f

%

k

accuracy

))

結果

k

=

1

accuracy

=

0。263000

k

=

1

accuracy

=

0。257000

k

=

1

accuracy

=

0。264000

k

=

1

accuracy

=

0。278000

k

=

1

accuracy

=

0。266000

k

=

3

accuracy

=

0。239000

k

=

3

accuracy

=

0。249000

k

=

3

accuracy

=

0。240000

k

=

3

accuracy

=

0。266000

k

=

3

accuracy

=

0。254000

k

=

5

accuracy

=

0。248000

k

=

5

accuracy

=

0。266000

k

=

5

accuracy

=

0。280000

k

=

5

accuracy

=

0。292000

k

=

5

accuracy

=

0。280000

k

=

8

accuracy

=

0。262000

k

=

8

accuracy

=

0。282000

k

=

8

accuracy

=

0。273000

k

=

8

accuracy

=

0。290000

k

=

8

accuracy

=

0。273000

k

=

10

accuracy

=

0。265000

k

=

10

accuracy

=

0。296000

k

=

10

accuracy

=

0。276000

k

=

10

accuracy

=

0。284000

k

=

10

accuracy

=

0。280000

k

=

12

accuracy

=

0。260000

k

=

12

accuracy

=

0。295000

k

=

12

accuracy

=

0。279000

k

=

12

accuracy

=

0。283000

k

=

12

accuracy

=

0。280000

k

=

15

accuracy

=

0。252000

k

=

15

accuracy

=

0。289000

k

=

15

accuracy

=

0。278000

k

=

15

accuracy

=

0。282000

k

=

15

accuracy

=

0。274000

k

=

20

accuracy

=

0。270000

k

=

20

accuracy

=

0。279000

k

=

20

accuracy

=

0。279000

k

=

20

accuracy

=

0。282000

k

=

20

accuracy

=

0。285000

k

=

50

accuracy

=

0。271000

k

=

50

accuracy

=

0。288000

k

=

50

accuracy

=

0。278000

k

=

50

accuracy

=

0。269000

k

=

50

accuracy

=

0。266000

k

=

100

accuracy

=

0。256000

k

=

100

accuracy

=

0。270000

k

=

100

accuracy

=

0。263000

k

=

100

accuracy

=

0。256000

k

=

100

accuracy

=

0。263000

上面的正確率都沒有超過30%的。

視覺化結果

import

matplotlib。pyplot

as

plt

plt

rcParams

‘figure。figsize’

=

10

8

# plot the raw observations

for

k

in

k_choices

accuracies

=

k_to_accuracies

k

plt

scatter

([

k

*

len

accuracies

),

accuracies

# plot the trend line with error bars that correspond to standard deviation

accuracies_mean

=

np

array

([

np

mean

v

for

k

v

in

sorted

k_to_accuracies

items

())])

accuracies_std

=

np

array

([

np

std

v

for

k

v

in

sorted

k_to_accuracies

items

())])

plt

errorbar

k_choices

accuracies_mean

yerr

=

accuracies_std

plt

title

‘Cross-validation on k’

plt

xlabel

‘k’

plt

ylabel

‘Cross-validation accuracy’

plt

show

()

cs231n筆記1-用KNN演算法進行影象分類

cs231n筆記1-用KNN演算法進行影象分類

從均值來看,k=10時均值最大。但此時標準差也挺大的。