DEEPLEARNING M.Lukac
WHY NOW?
GPU!
Good data, good annotated data(ImageNet).
Some great simple ideas.
Most of the techniquesare old 20-30 years.
99% is matrix multiplications.
NEURAL NETWORKS
Beginning in 50s and 60s.
Biologically inspired by brain and neurons.
Boom every ten years...
State of the art for multimedia
data processing!
PROBLEMS?
NEURON
Perceptron
Output is y=f (Wx+b)
f is activation function
Input x = [x0=1.0, x1, x2 ...., xn]
Weights W = [w0, w1, w1, ....., wn]
ACTIVATION FUNCTION
Step Function TanH Sigmoid
SIMPLE IMPLEMENTATION
NEURAL NETWORK
SIMPLE IMPLEMENTATION
PLAYGROUND.TENSORFLOW.ORG
OVERFITTING?
HOW TO LEARN WEIGHTS? #VERY SIMPLIFIED
Weights are learned with Backpropagation algorithm(gradient learning).
Update weights(alpha is learning rate like 0.01):
Objective function for bin. Classification[0,1]: Binary Cross Entropy
Gradient for last layer:
You need to propagate the error through the network.
"gradient points in the direction
of the greatest rate of increase of
the function, components are p.
derivatives"
TRAINING
Monitor the objective function: It should decrease over time.
Play with learning rate alpha=[0.1, 0.001, 0.05, ...]
Train with mini batch of samples.
Normalizeyour data to <0,1>, ...
SOFTMAX LAYER
A softmax layer takes the activations and divides each of them by the sum of all
activations, thereby forcing the outputs of the layer to take the form of probability
distribution (sum to 1).
OVERFITTING
DROPOUT LAYER
Regularization technique.
Active only during training.
With some probability set output
of unit to zero.
GRADIENT PROBLEM
Vanishing gradient : gradients are smaller in every next layer
Exploding gradient : gradients are larger in every next layer
Result:
Unable to learn deeper model(lower layers)
Why?
Weights and activation functions squeeze gradients.
Understanding the difficulty of training deep feedforward neural networks [X.Glorot, 2010]
when backpropagating error
WHY NOW? SOLVING GP #2
Rectified Linear Unit(ReLU) as activation function.
Intelligent Initialization of Weights at beginning of training.
It doesn't solve the problem, It just minified the problem.
ReLU: f(x) = max(0,x)
HOW TO INITIALIZE WEIGHTS?
1. Random uniform from [-e, e]
2. Gaussian distribution
3. Xavier initialization
4. Pretraining with RBM models
AUTOENCODER
Non-Linear dimensionality reduction.
Encoder and Decoder part.
After training throw away Decoder part.
It can work better than PCA.
Training ends often in local optimum...
compression
Reducing the Dimensionality of Data with Neural
Networks[Science, 2006, Hinton]
AUTOENCODER
PRETRAINED
BY RBM
https://github.com/Cospel/rbm-ae-tf
AUTOENCODER VS PCA
AUTOENCODER VS PCA
Original Input
Autoencoder
PCA
CONVOLUTIONAL NN
Stack of Convolution, Pooling, ReLU, Fully Connected Layers.
State of the art in computer vision.
Convolutional Layer: Weights Sharing, Local Connectivity
It is impractical to connect neurons to all neurons in the previous volume.
INPUT, CONV, POOLING, RELU LAYERS
LEARNED CNN FEATURES
POOLING LAYER
Subsampling the image.
Smaller outputs = faster learning
1. AlexNet
2. VGG
3. ResNet
4. SqueezeNet
5. GoogleNet
6. ...
MANY CNN ARCHITECTURES
[ImageNet Classification with
Deep Convolutional Neural
Networks, 2012 G.Hinton, A.
Krizhevsky]
TRANSFER LEARNING
1. Use existing Weights or entire NN to finetune on new(similar domain) data
2. Use CNN descriptors for algorithms as KNN, SVM, …
Many pretrained models(weights) are available to download on github:
VGG with face descriptors
Models for places
….
RECURRENT NN
Good for timeseries data, nlp, video sequences, ...
Backpropagation through time …
It has internal hidden state.(memory for sequence)
RECURRENT NN
SIMPLE IMPLEMENTATION
FRAMEWORKS
KERAS FEEDFORWARD NET
KERAS RECURRENT NET
KERAS CONVOLUTIONAL NETWORK
THANK YOU...