电子人社区›论坛 › 基础技术 › 算法技术 › 神经网络算法 › 牛津大学视觉几何组VGG卷积神经网络实践教程VGG Convolu ...

神经网络算法

今日 : 0|主题 : 591|排名 : 293

发帖回复收藏

牛津大学视觉几何组VGG卷积神经网络实践教程VGG Convolutional Neural Networks Practical

发表于 2016-6-6 15:13:24 | 424110 只看该作者回帖奖励

|倒序浏览 |阅读模式

[复制链接]

发表于 2016-6-6 15:13:24 | 只看该作者回帖奖励

|倒序浏览 |阅读模式

电子人社区网讯：
VGG Convolutional Neural Networks Practical

By Andrea Vedaldi and Andrew Zisserman

This is an Oxford Visual Geometry Group computer vision practical, authored by Andrea
Vedaldi and Andrew Zisserman (Release 2015a).

Convolutional neural networks are an important class of learnable representations applicable, among others, to numerous computer vision problems. Deep CNNs, in particular, are composed of several layers of processing, each involving linear
as well as non-linear operators, that are learned jointly, in an end-to-end manner, to solve a particular tasks. These methods are now the dominant approach for feature extraction from audiovisual and textual data.

This practical explores the basics of learning (deep) CNNs. The first part introduces typical CNN building blocks, such as ReLU units and linear filters, with a particular emphasis on understanding back-propagation. The second part looks at learning two basic
CNNs. The first one is a simple non-linear filter capturing particular image structures, while the second one is a network that recognises typewritten characters (using a variety of different fonts). These examples illustrate the use of stochastic gradient
descent with momentum, the definition of an objective function, the construction of mini-batches of data, and data jittering. The last part shows how powerful CNN models can be downloaded off-the-shelf and used directly in applications, bypassing the expensive
training process.

VGG Convolutional Neural Networks Practical
- Getting started
- Part 1: CNN building blocks
  - Part 1.1: convolution
  - Part 1.2: non-linear gating
  - Part 1.3: pooling
  - Part 1.4: normalisation
- Part 2: back-propagation and derivatives
  - Part 2.1: the theory of back-propagation
  - Part 2.1: using back-propagation in practice
- Part 3: learning a tiny CNN
  - Part 3.1: training data and labels
  - Part 3.2: image preprocessing
  - Part 3.3: learning with gradient descent
  - Part 3.4: experimenting with the tiny CNN
- Part 4: learning a character CNN
  - Part 4.1: prepare the data
  - Part 4.2: intialize a CNN architecture
  - Part 4.3: train and evaluate the CNN
  - Part 4.4: visualise the learned filters
  - Part 4.5: apply the model
  - Part 4.6: training with jitter
  - Part 4.7: Training using the GPU
- Part 5: using pretrained models
  - Part 5.1: load a pre-trained model
  - Part 5.2: use the model to classify an image
- Links and further work
- Acknowledgements
- History

　　

Getting started

Read and understand the requirements and installation instructions. The download
links for this practical are:

Code and data: practical-cnn-2015a.tar.gz
Code only: practical-cnn-2015a-code-only.tar.gz
Data only: practical-cnn-2015a-data-only.tar.gz
Git repository (for lab setters and developers)

After the installation is complete, open and edit the script exercise1.m in the MATLAB editor. The script contains commented code and a description for all steps
of this exercise, forPart I of this document. You can cut and paste this code into the MATLAB window
to run it, and will need to modify it as you go through the session. Other files exercise2.m,exercise3.m,
and exercise4.m are given for Part
II, III, and IV.

Each part contains several Questions (that require pen and paper) and Tasks (that require experimentation or coding) to be answered/completed before proceeding further in the practical.

Part 1: CNN building blocks

Part 1.1: convolution

A feed-forward neural network can be thought of as the composition of number of functions

f(x)=fL(…f2(f1(x;w1);w2)…),wL).f(x)=fL(…f2(f1(x;w1);w2)…),wL).

Each function flfl takes
as input a datum xlxl and
a parameter vector wlwl and
produces as output a datum xl+1xl+1.
While the type and sequence of functions is usually handcrafted, the parameters w=(w1,…,wL)w=(w1,…,wL) are learned
from data in order to solve a target problem, for example classifying images or sounds.　　

In a convolutional neural network data and functions have additional structure. The data

x1,…,xnx1,…,xn

are
images, sounds, or more in general maps from a lattice1 to
one or more real numbers. In particular, since the rest of the practical will focus on computer vision applications, data will be 2D arrays of pixels. Formally, each

xixi

will
be a

M×N×KM×N×K

real
array of

M×NM×N

pixels
and

channels
per pixel. Hence the first two dimensions of the array span space, while the last one spans channels. Note that only the input

x=x1x=x1

of
the network is an actual image, while the remaining data are intermediate feature maps.

The second property of a CNN is that the functions

flfl

have
a convolutional structure. This means that

flfl

applies
to the input map

xlxl

an
operator that is local and translation invariant. Examples of convolutional operators are applying a bank of linear filters to

xlxl

.

In this part we will familiarise ourselves with a number of such convolutional and non-linear operators. The first one is the regular linear convolution by a filter bank. We will start by focusing our attention on a single function relation
as follows:

f:RM×N×K→RM′×N′×K′,x↦y.f:RM×N×K→RM′×N′×K′,x↦y.

Open the example1.m file, select the following part of the code, and execute it in MATLAB (right button > Evaluate
selection or Shift+F7).　　

% Read an example image
x = imread('peppers.png') ;
% Convert to single format
x = im2single(x) ;
% Visualize the input x
figure(1) ; clf ; imagesc(x)

This should display an image of bell peppers in Figure 1:

Use MATLAB size command to obtain the size of the array x. Note that
the array x is converted to single precision format. This is because the underlying MatConvNet assumes that data is in single precision.

[size=1em]Question. The third dimension of x is 3. Why?

Now we will create a bank 10 of

5×5×35×5×3

filters.

% Create a bank of linear filters
w = randn(5,5,3,10,'single') ;

The filters are in single precision as well. Note that w has four dimensions, packing 10 filters. Note also that each filter is not flat, but rather a volume with
three layers. The next step is applying the filter to the image. This uses the vl_nnconv function from MatConvNet:

% Apply the convolution operator
y = vl_nnconv(x, w, []) ;

Remark: You might have noticed that the third argument to the vl_nnconv function is the empty matrix [].
It can be otherwise used to pass a vector of bias terms to add to the output of each filter.

The variable y contains the output of the convolution. Note that the filters are three-dimensional, in the sense that it operates on a map

with

channels.
Furthermore, there are

K′K′

such
filters, generating a

K′K′

dimensional
map

as
follows

yi′j′k′=∑ijkwijkk′xi+i′,j+j′,kyi′j′k′=∑ijkwijkk′xi+i′,j+j′,k

[size=1em]Questions: Study carefully this expression and answer the following:

Given that the input map
xx
has
M×N×KM×N×K
dimensions
and that each of the
K′K′
filters
has dimension
Mf×Nf×KMf×Nf×K
,
what is the dimension of
yy
?

Note that
xx
is
indexed by
i+i′i+i′
and
j+j′j+j′
,
but that there is no plus sign between
kk
and
k′k′
.
Why?

[size=1em]Task: check that the size of the variable y matches your calculations.

We can now visualise the output y of the convolution. In order to do this, use the vl_imarraysc function
to display an image for each feature channel in y:

% Visualize the output y
figure(2) ; clf ; vl_imarraysc(y) ; colormap gray ;

[size=1em]Question: Study the feature channels obtained. Most will likely contain a strong response in correspondences of edges in the input image x.
Recall that w was obtained by drawing random numbers from a Gaussian distribution. Can you explain this phenomenon?

So far filters preserve the resolution of the input feature map. However, it is often useful to downsample the output. This can be obtained by using the stride option
in vl_nnconv:

% Try again, downsampling the output
y_ds = vl_nnconv(x, w, [], 'stride', 16) ;
figure(3) ; clf ; vl_imarraysc(y_ds) ; colormap gray ;

As you should have noticed in a question above, applying a filter to an image or feature map interacts with the boundaries, making the output map smaller by an amount proportional to the size of the filters. If this is undesirable, then the input array can
be padded with zeros by using the pad option:

% Try padding
y_pad = vl_nnconv(x, w, [], 'pad', 4) ;
figure(4) ; clf ; vl_imarraysc(y_pad) ; colormap gray ;

[size=1em]Task: Convince yourself that the previous code’s output has different boundaries compared to the code that does not use padding. Can you explain the result?

In order to consolidate what has been learned so far, we will now design a filter by hand:

w = [0 1 0 ;
1 -4 1 ;
0 1 0 ] ;
w = single(repmat(w, [1, 1, 3])) ;
y_lap = vl_nnconv(x, w, []) ;
figure(5) ; clf ; colormap gray ;
subplot(1,2,1) ;
imagesc(y_lap) ; title('filter output') ;
subplot(1,2,2) ;
imagesc(-abs(y_lap)) ; title('- abs(filter output)') ;

[size=1em]Questions:

What filter have we implemented?

How are the RGB colour channels processed by this filter?

What image structure are detected?

Part 1.2: non-linear gating

As we stated in the introduction, CNNs are obtained by composing several different functions. In addition to the linear filters shown in the previous
part, there are several non-linear operators as well.

[size=1em]Question: Some of the functions in a CNN must be non-linear. Why?

The simplest non-linearity is obtained by following a linear filter by a non-linear gating function, applied identically to each component (i.e. point-wise) of a feature map. The simplest such function is the Rectified Linear
Unit (ReLU)

yijk=max{0,xijk}.yijk=max{0,xijk}.

This function is implemented by vl_relu; let’s try this out:　　

w = single(repmat([1 0 -1], [1, 1, 3])) ;
w = cat(4, w, -w) ;
y = vl_nnconv(x, w, []) ;
z = vl_nnrelu(y) ;
figure(6) ; clf ; colormap gray ;
subplot(1,2,1) ; vl_imarraysc(y) ;
subplot(1,2,2) ; vl_imarraysc(z) ;

[size=1em]Tasks:

Run the code above and understand what the filter
ww
is
doing.

Explain the final result
zz
.

Part 1.3: pooling

There are several other important operators in a CNN. One of them is pooling. A pooling operator operates on individual feature channels, coalescing nearby feature values into one by the application of a suitable operator. Common choices
include max-pooling (using the max operator) or sum-pooling (using summation). For example, max-pooling is defined as:

yijk=max{yi′j′k:i≤i′10 fold).

Part 5: using pretrained models

A characteristic of deep learning is that it constructs representations of the data. These representations tend to have a universal value, or at least to be applicable to an array of problems that transcends the particular task a model
was trained for. This is fortunate as training complex models requires weeks of works on one or more GPUs or hundreds of CPUs; these models can then be frozen and reused for a number of additional applications, with no or minimal additional work.

In this part we will see how MatConvNet can be used to download and run high-performance CNN models for image classification. These models are trained from 1.2M images in the ImageNet datasets to discriminate 1,000 different object categories.

Several pertained models can be downloaded from the MatConvNet website, including several trained using other
CNN implementations such as Caffe. One such models is included in the practical data/imagenet-vgg-verydeep-16.mat file. This is one of the best models from the ImageNet
ILSVCR Challenge 2014.

Part 5.1: load a pre-trained model

The first step is to load the model itself. This is in the format of the vl_simplenn CNN wrapper, and ships as a MATLAB .mat file:

net = load('data/imagenet-vgg-verydeep-16.mat') ;
vl_simplenn_display(net) ;

[size=1em]Tasks:

Look at the output of vl_simplenn_display and understand the structure of the model. Can you understand why it is called “very deep”?

Look at the size of the file data/imagenet-vgg-verydeep-16.mat on disk. This is just the model.

Part 5.2: use the model to classify an image

We can now use the model to classify an image. We start from peppers.png, a MATLAB stock image:

% obtain and preprocess an image
im = imread('peppers.png') ;
im_ = single(im) ; % note: 255 range
im_ = imresize(im_, net.normalization.imageSize(1:2)) ;
im_ = im_ - net.normalization.averageImage ;

The code normalises the image in a format compatible with the model net. This amounts to: converting the image to single format
(but with range 0,…,255 rather than [0, 1] as typical in MATLAB), resizing the image to a fixed size, and then subtracting an average image.

It is now possible to call the CNN:

% run the CNN
res = vl_simplenn(net, im_) ;

As usual, res contains the results of the computation, including all intermediate layers. The last one can be used to perform the classification:

% show the classification result
scores = squeeze(gather(res(end).x)) ;
[bestScore, best] = max(scores) ;
figure(1) ; clf ; imagesc(im) ;
title(sprintf('%s (%d), score %.3f',...
net.classes.description{best}, best, bestScore)) ;

That completes this practical.

Links and further work

The code for this practical is written using the software package MatConvNet. This is a software library
written in MATLAB, C++, and CUDA and is freely available as source code and binary.
The ImageNet model is the VGG very deep 16 of Karen Simonyan and Andrew Zisserman.

Acknowledgements