import numpy as np
from keras import backend as K
from keras import optimizers
from keras.layers import Activation
from keras.layers import Dense
from keras.models import Sequential
from keras.layers import Dropout
from keras.layers import Flatten
from keras.layers.convolutional import Conv2D
from keras.layers.convolutional import MaxPooling2D
from keras.layers.convolutional import ZeroPadding2D
from keras.layers.normalization import BatchNormalization
from keras.models import Model
from keras.utils import np_utils
from keras.utils.layer_utils import convert_all_kernels_in_model
from utils import local_response_normalization
We will be recreating the famous CaffeNet implementation here, but the code for the OG two channel AlexNet can be found on the Github repo which is a sequential implementation.
" ...they are virtually indistinguishable" - Evan Shelhamer, Caffe lead developer.
model = Sequential()
"The first convolutional layer filters the 224×224×3 input image with 96 kernels of size 11×11×3 with a stride of 4 pixels" - Sec. 3.5 para 3.
We can compute the spatial size of the output volume as a function of the input volume size (W), the receptive field size of the Conv Layer neurons (F), the stride with which they are applied (S), and the amount of zero padding used (P) on the border. You can convince yourself that the correct formula for calculating how many neurons “fit” (O) is given by
O = (W−F+2P)/S+1
Therefore, for conv layer 1, as per the diagram, the output needs to be a volume of size 55x55x96
If input image is of size 224x224x3, W = 224, F = 11, S = 4, P = 0(not mentioned)
O = ((224 - 11 + 0)/4) + 1 -> this is not even an integer so there must be something wrong!
“The other author's were Ilya Sutskever and Geoffrey Hinton. So, AlexNet input starts with 227 by 227 by 3 images. And if you read the paper, the paper refers to 224 by 224 by 3 images. But if you look at the numbers, I think that the numbers make sense only of actually 227 by 227.” - Andrew Ng
O = ((227 - 11 + 0)/4) + 1 = (216/4) + 1 = 55
model.add(Conv2D(filters=96, input_shape=(3,227,227), kernel_size=(11,11), strides=(4,4), padding='valid', name='Conv1'))
model.summary()
No. of params here can be calculated using (filter_height * filter_width * input_image_channels + 1) * number_of_filters
Therefore, params = (11x11x3 + 1) x 96 = 34944
The "+ 1" is for the biases.
"The ReLU non-linearity is applied to the output of every convolutional and fully-connected layer." - Sec 3.5 Para 2
"Response-normalization layers follow the first and second convolutional layers." - Sec 3.5 Para 2
"Max-pooling layers, of the kind described in Section 3.4, follow both response-normalization layers as well as the fifth convolutional layer" - Sec 3.5 Para 2
More on this in the slides.
O = ((W - F + 2P)/ S) + 1
For pooling, W = 55, F = 3, P = 0, S = 2
Therefore O = ((55 - 3 + 0)/2) + 1 = 27
model.add(Activation('relu'))
model.add(local_response_normalization(name="LRN1"))
# model.add(BatchNormalization())
model.add(MaxPooling2D(pool_size=(3,3), strides=(2,2), padding='valid', name="MaxPool"))
model.summary()
"The second convolutional layer takes as input the (response-normalized and pooled) output of the first convolutional layer and filters it with 256 kernels of size 5×5×48." Sec. 3.5 para 3
O = (W−F+2P)/S+1
After pooling, we want output to be of size 13x13 but only after pooling, so we dont know what the intermediate output size is, but we know that it needs to be an integer. We know F = 5, W = 27, S = ?, P = ?, but lets investigate
O = (27 - 5 + 2P)/S + 1 needs to be an integer.
Lets assume stride for the conv to be 2 and padding to be 1 O = ((27 - 5 + 2)/ 2) + 1 = 11. Now if we apply pooling, we wont get 13. So this is wrong as well.
So lets assume the stride to be 1 and padding to be 0.
O = (27 - 5 + 0)/1 + 1 = 23
Now if we apply pooling,
O = (23 - 3 + 0)/2 + 1 = 11. This is not correct!
But if we apply 1x1 padding to this, we'll get 13.
model.add(Conv2D(filters=256, kernel_size=(5,5), padding='valid', name="Conv2"))
model.add(Activation('relu'))
model.add(local_response_normalization(name="LRN2"))
# model.add(BatchNormalization())
model.add(MaxPooling2D(pool_size=(3,3), strides=(2,2), name="MaxPool2"))
model.add(ZeroPadding2D(padding=(1, 1)))
model.summary()
"The third convolutional layer has 384 kernels of size 3×3×256 connected to the (normalized, pooled) outputs of the second convolutional layer." - Sec 3.5 Para 3
"The third, fourth, and fifth convolutional layers are connected to one another without any intervening pooling or normalization layers." - Sec 3.5 Para 3
model.add(Conv2D(filters=384, kernel_size=(3,3), padding='valid', name="Conv3"))
model.add(Activation('relu'))
model.summary()
model.add(ZeroPadding2D(padding=(1, 1)))
model.summary()
"The fourth convolutional layer has 384 kernels of size 3×3×192, and the fifth convolutional layer has 256 kernels of size 3×3×192." - Sec 3.5 Para 3
model.add(Conv2D(filters=384, kernel_size=(3,3), padding='valid', name="Conv4"))
model.add(Activation('relu'))
model.add(ZeroPadding2D(padding=(1, 1)))
model.add(Conv2D(filters=256, kernel_size=(3,3), padding='valid', name="Conv5"))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(3,3), strides=(2,2), name="MaxPool3"))
model.summary()
"The fully-connected layers have 4096 neurons each." - Sec 3.5 Para 3
model.add(Flatten())
model.add(Dense(4096, name="FC6"))
model.add(Activation('relu'))
model.add(Dropout(0.5))
model.add(Dense(4096, name="FC7"))
model.add(Activation('relu'))
model.add(Dropout(0.4))
n_classes=1000
model.add(Dense(1000))
model.add(Activation('relu'))
model.add(Dropout(0.5))
if n_classes != 1000:
model.add(Dense(n_classes))
model.add(Activation('softmax'))
if K.backend() == 'tensorflow':
convert_all_kernels_in_model(model)
model.summary()
sgd = optimizers.SGD(lr=0.01, decay=0.0005, momentum=0.9)
model.compile(loss='categorical_crossentropy', optimizer=sgd, metrics=['accuracy'])
# model.fit(data, labels, epochs=90, batch_size=128)