Daniel s deep learning notes Deep Learning crash course tutorial

This article from: reading technology Zouxy finishing

Deep learning, namely, Deep Learning, a learning algorithm (Learning algorithm), is also an important branch of the field of artificial intelligence. From development to practical application, in the space of a few years, deep learning to subvert the speech recognition, classification, text and many other algorithms in the field of design, gradually formed a kind of start from training data, through an end-to-end (end-to-end) models, and direct the output to get the final results of a new model. So, how deep is deep learning? Learn how a few minutes? This article will bring you a taste of deep learning behind the high-end range of methods and processes.

First, an overview

Second, background, characteristics of human visual mechanism of four, on the

       Granularity of the characteristics 4.1, 4.2, primary (shallow) characteristic structural features of 4.3, 4.4, how many characters are there?       Five, six in the basic idea of Deep Learning, shallow (Shallow Learning) and depth (Deep Learning) seven, eight in Deep learning and Neural Network, Deep learning process 8.1, the traditional training method 8.2, deep learning of neural network training process, commonly used model or method for Deep Learning 9.1, AutoEncoder automatically codes 9.2, 9.3, and sparse Sparse Coding coding Restricted Boltzmann Machine (RBM) restricted Boltzmann machine 9.4, 9.5, Convolutional Neural Networks Deep BeliefNetworks is convinced that network convolution neural network ten, summary and Outlook

| First, an overview

Artificial Intelligence, or AI, like immortality and the star Odyssey, is one of the most beautiful human dreams. Although computer technology has made considerable progress, but so far, there is no computer can produce “self” consciousness. Yes, human beings and with the help of a large number of ready-made data, computers can be very powerful, but left both, it cannot distinguish between a feline and a star Wang.

Turing (Turing, we all know it. Founder of computers and artificial intelligence, corresponding to its famous “Turing machine” and “Turing test”) in the 1950 paper, Turing test proposed ideas, namely, partition dialog, you will not know you are talking to, a person or a computer. This will no doubt give the computer, particularly artificial intelligence, presupposes a high expectation. But half a century later, advances in artificial intelligence, far from achieving the Turing test standards. Not only many years of anticipation, despair, thought artificial intelligence was hoodwinked into related fields is “pseudoscience”.

But since 2006, the field of machine learning, has made breakthrough progress. The Turing test, or at least not so elusive. Technical means not only relies on cloud computing for big data parallel processing capabilities, and depend on the algorithm. This algorithm is that Deep Learning. With Deep Learning algorithms, mankind has finally figured out how to deal with “abstract” the eternal problem of method.

Daniel's deep learning notes, Deep Learning crash course tutorial

June 2012, the New York Times disclosed the Google Brain project, has attracted wide attention from the public. The machine learning projects by renowned Stanford University Professor Andrew Ng in large-scale computer systems and JeffDean of the world’s top experts co-led, with 16,000 parallel computing platform CPU Core training called “deep neural networks” (DNN,Deep Neural Networks) machine learning model (internal total 1 billion nodes. This network of neural networks with human nature is not comparable. You know, but 150duoyige neurons in the human brain and synaptic number of interconnected nodes that is Sha number, such as the milky way. Someone once calculated, if a person of all neurons in the brain, axons and dendrites then connect and pull into a straight line, from the Earth to the Moon, returned to Earth from the Moon), in the areas of speech and image recognition has been a huge success.

Project leader Andrew said: “as we usually do not own framed border, but in huge amounts of data into the algorithm directly and let the data speak for themselves, the system will automatically learn from data. “A responsible person Jeff says:” we are trained never to tell the machine said, ‘ this is a cat. ‘ System was invented, or understand “the cat” concept. ”

Daniel's deep learning notes, Deep Learning crash course tutorial

In November 2012, Microsoft’s event in Tianjin, China, publicly perform a simultaneous interpretation system in full automatic, lecturer in English, speech, and back-end computer in one fell swoop automatic speech recognition, machine translation in English and Chinese speech synthesis, the effect is very smooth. According to reports, the support is the key technology behind DNN, or deep learning (DL,DeepLearning).

In January 2013, Baidu annual Conference, founder and CEO Robin Li Baidu Research Institute as announced in a high profile, which first established was “deep learning Institute” (IDL,Institue of Deep Learning).

Daniel's deep learning notes, Deep Learning crash course tutorial

Why have large Internet companies scrambling to invest a lot of resources in research and development of advanced learning technology. Sounds deeplearning great. What is deep learning? Why are there deep learning? How did it come from? And what are you doing? Now what are the difficulties? The short answer to these questions requires time. Let us learn about machine learning (the core of artificial intelligence) in the background.

| B, background

Machine learning (Machine Learning) is a specialized computer simulation or human behavior in order to acquire new knowledge or skills, reorganizing the knowledge structure to continuously improve the performance of their subjects. Whether machines can learn like a human ability? 1959 United States Samuel (Samuel) designed a chess-playing programs, this program has the ability to learn, it can continue to improve their skills in the game. 4 years later, designers of the program beat himself. After 3 years, which defeated the United States for 8 years unbeaten in winning the Championship. This program to show people the ability to machine learning, raised the issue many thought-provoking social and philosophical questions (well, normal track without a great deal of development of artificial intelligence, the development of philosophical ethics very quickly. What machine in the future more and more like person, people more and more like machine. What machines will be against humanity, ATM is fired the first shot, and so on. Infinite human thought).

Machine learning has developed for decades, but there is still a lot of no good to solve the problem:

Daniel's deep learning notes, Deep Learning crash course tutorial

For example, image recognition, speech recognition, natural language understanding, weather forecasts, gene expression, content recommendations, and so on. Currently we use machine learning to solve the problems of thinking like this (with visual perception as an example):

Daniel's deep learning notes, Deep Learning crash course tutorial

From the beginning of the sensor (such as CMOS) to get the data. Then after pre-processing and feature extraction and feature selection, and inferences, predict or identify. The last part, which is part of machine learning, most of the work is done, there is a lot of paper and research.

While the Middle three parts, summed up feature. Good feature, on the final accuracy of the algorithms play a very critical role, and this system and test work to engage in the most. However, this practice is generally done manually. Manual feature extraction.

Daniel's deep learning notes, Deep Learning crash course tutorial

As of now, there have been many characteristics of NB (good character should be invariant (size, scale and rotation) and separability): for example, Sift the emergence of research on local image descriptors is a landmark work in the field. Due to SIFT for scaling, rotating, and a certain angle and illumination invariance of image change has, and the separability of the SIFT has a strong, does solve many problems possible. But it is not a panacea.

Daniel's deep learning notes, Deep Learning crash course tutorial

However, manually select the feature it is very laborious, heuristic (requires special knowledge), can you choose the right is to a large extent depend on experience and luck, and its regulation requires a lot of time. Since hand-select the feature isn’t very good, so some can automatically learn the characteristics? The answer is! Deep Learning is used to do this thing to see if it is an alias of UnsupervisedFeature Learning, just as its name implies, Unsupervised is not characteristics of people involved in the selection process.

So how did it learn it? How to know what good is bad? We say how machine learning is a specialized computer simulations or human subjects of study. Well, our vision of how the system works? Why in a big crowd, of all mortals, we can find another of her (because you are in my mind deeply, my dream in my heart in my song … … )。 NB the human brain, we can refer to the human brain, simulate a human brain? (And associated features of the relationship between the human brain, algorithm, are good, but I do not know is not the imposition of artificial, in order to make their work becomes divine and elegant. ) In recent decades, cognitive neuroscience, biology and so on the development, let us feel the mysterious and magical mind not so strange. But also to the development of artificial intelligence.

| Three, the brain mechanism of Visual

1981 Nobel Medicine Prize, awarded to David Hubel (born in Canada the United States neurobiologists) and TorstenWiesel, as well as Roger Sperry. The main contribution of the top two, is “discover the Visual information processing system”: the visual cortex is rated:

Daniel's deep learning notes, Deep Learning crash course tutorial

We look at what they do. In 1958, DavidHubel and Torsten Wiesel JohnHopkins University study correspond to the areas with neurons in the cerebral cortex of the pupil. They are on the cat’s head skull, opened a small, 3 mm hole and inserting electrodes into the hole, measuring the activity of neurons.

Then, they were in front of the kitten’s eyes, to show of every shape and brightness of the object. And, in the show when each object, and change the position and angle of the object is placed. They hope in this way, small cats experience different types and intensity of stimulation of the pupil.

Do this test in order to prove a conjecture. Located in different Visual neurons in the cerebral cortex, and the stimuli of the pupil, there is a relationship. Once the pupil is a certain stimulus, neurons in a part of the cerebral cortex is active. Repeatedly has gone through many days of boring tests, at the expense of some poor little cat, David Hubel and Torsten Wiesel discovered called “directional selectivity cells (Orientation Selective Cell)” neuronal cells. When pupils discover the object’s edge, and the edge points to a direction, the neuronal cell becomes active.

This finding further thoughts on stimulating the nervous system. Nerve-Central-brain process is an iterative and continuous process of abstraction. There are two key words here, one abstract, one iteration. From the original signal, low-level abstraction gradually to a high-level abstraction of iteration. Human logic, often using highly abstract concepts.

For example, starting from the original intake (pupil intake pixels Pixels), and went on to do preliminary processing (edge and found that certain cells in the cerebral cortex) and abstract (the brain determine, in front of the shapes of the objects, is round in shape), and then further abstraction (further determined that the object is only the brain balloon).

Daniel's deep learning notes, Deep Learning crash course tutorial

The physiology found that contributed to the computer artificial intelligence, 40 years after the ground-breaking developments.

In General, the information processing of the human visual system is hierarchical. V1 extraction from lower edge, then to V2 of the shape or the destination of the parts, to a higher level, goals, target as a whole. This means that top features a combination of low-level features, from the bottom to the top feature more and more abstract, more and more meaning or intent. Higher level of abstraction, there is speculation the fewer, the more conducive to the classification. For example, the correspondence is more of a collection of words and sentences, sentences and semantic correspondence is many to one, meaning and intentions or many-to-one, this is a hierarchical system.

Sensitive attention to key words: stratification. Deep of Deep learning just how many layers I, that is, how well does it? That’s right. What about Deep learning how to learn from the process? After all, go to the computer to deal with, is the question of how to model this process?

Because we want to learn is a characteristic expression of features or characteristics on this level, we need to understand in greater depth. So before that Deep Learning, we need to talk to the next character (HA HA, in fact, was to see such a good explanation of the features, not here a little pity, so into it).

| Four, features

Feature machine learning system of raw material, the final model is beyond doubt. If the data were expressed as the characteristics of good, usually linear model can achieve satisfactory accuracy. That feature, we need to consider what?

4.1, the granularity of features

Learning algorithm in a granularity characteristic, have a role to play-? For a picture, pixel-level characteristics of no value. For example, the following a motorcycle from pixel level, do not have access to any information at all, it is impossible for the distinction between motor and non-motor. If the feature is a structural (or meaning), such as if it has handlebars (handle), and has a wheel (wheel), it’s easy to make a distinction between motor and non-motor, learning algorithms in order to play a role.

Daniel's deep learning notes, Deep Learning crash course tutorial

4.2, primary (shallow) feature

Has no effect since the characteristics of pixel-level representation, and how to use?

Around 1995, Bruno Olshausen and David Field of two academics serve Cornell University as they try to both Physiology and computer means a two-pronged approach, research on vision problems. They collected a lot of black and white landscape photographs from these photographs, extracted 400 smaller pieces, each picture fragment size is 16×16 pixels, you might as well the 400 fragments marked as S[i], I = 0, … 399. Next, from the black and white landscape photographs, extracting another random fragments, size is 16×16 pixels, you might as well mark the debris as t.

Their problem is, how to go from the 400 fragments, select a group of pieces, S[k], through an overlay approach synthesized a new pieces, and fragments of the new, it should be with a random selection of a target fragment t, as similar as possible, at the same time, S[k] the number of as little as possible. Use mathematical language to describe, that is:

        Sum_k (a[k] * S[k])–> t, a[k] is found in fragmented S[k] when weight coefficients.

        To resolve this problem, Bruno Olshausen and David Field invented an algorithm, sparse coding (Sparse Coding).

Sparse coding is a repeated iteration process, each iteration consists of two steps:

1) select a set of S[k], and then adjust a[k] make Sum_k (a[k] * S[k]) closest to the t.

2) fixed a[k], 400 fragments, fragments of other more appropriate S'[k], instead of S[k], and Sum_k (a[k] * S'[k]) closest to the t.

 After several iterations, the best S[k] combination, be selected. Amazing is that selected S[k], were basically pictures of different objects on the edge lines, these line shapes are similar, the difference being direction.

Bruno Olshausen and David Field of the algorithm, and David Hubel and Torsten Wiesel of Physiology found that line!

In other words, complex graphics, often made up of some basic structures. For example the following figure: a map can use 64 orthogonal edges (can be thought of as the basic structure of orthogonal) to linear representations. Sample x can use 1-64 three of the edges by weight 0.8,0.3,0.5 ingredients. Other basic edge did not contribute, so is 0.

Daniel's deep learning notes, Deep Learning crash course tutorial

In addition, Daniel also found that images have not only the law, sounds also exist. They never found in Mark’s voice of 20 basic sound structures, the rest of the sound can be synthesized by the 20 basic structure.

Daniel's deep learning notes, Deep Learning crash course tutorial

4.3, structural features

Graphics can be formed by the edge of a small block, more structured, more complex, conceptual graphics say then? This requires a higher level of representation, such as V2,V4. V1 pixel is a pixel. V2 V1 is pixel-level, this is the level of progressive, top expressions formed by the combination of the underlying expression. Specialist says that radical basis. V1 take basis is made by edge, then V2, V1 is the combination of these basis, basis of the V2 region got a higher level. On the basis of a combination of a layer, on top and on a combined basis of … … (Daniel, said that Deep learning is “gay” because bad, so it was renamed Deep learning or Unsupervised Feature Learning)

Daniel's deep learning notes, Deep Learning crash course tutorial

 Intuitively say, is to find a small patch and then make sense to combine them, get a feature on, recursively up the learning feature.

Training is done on a different object, the edge basis is very similar, but will object parts and models were completely different (it’s much easier to tell whether car or face):

Daniel's deep learning notes, Deep Learning crash course tutorial

From the text, a what does doc mean? We describe one thing, what more appropriate? With one word, I think not, words are pixel-level, or at least should be the term, in other words each doc is constituted by the term, but said the concept of capacity is enough, may also not be enough, on the need to step up topic level, topic, and then to the doc is reasonable. But each level the number of disparities, such as doc says the concept of->topic (thousand-million orders of magnitude)->term (100,000 scale)->Word (million).

When a person watching a doc, eyes see is the word from the word automatic segmentation in the brain form a term, in accordance with the concept of organized way, prior learning, get the topic, followed by a high level of learning.

Hello Kitty iPhone plus case

4.4, how many characters are there?

We know features of the hierarchy of needs, from, but how much should each have a feature on it?

Either way, features more reference information is given more accuracy would be promoted. But characteristic means that complex, exploration of space, can be used to train on each characteristic data is sparse, brings with it various problems will not necessarily feature the more, the better.

Daniel's deep learning notes, Deep Learning crash course tutorial

Well, at this point, finally able to talk about Deep learning. We talk about why there is Deep learning (machine learning good characteristics, and removed from the artificial selection process. There are hierarchical Visual processing system of the reference person), we come to a conclusion that Deep learning that require multiple layers to get more abstract representation. What about how much is appropriate? What schema to use to model it? How non-supervision training?

| V, the basic idea of Deep Learning

Suppose we have a system s, it has n layers (S1, … Sn), whose inputs are I, output is o, the image is represented as: I =>S1=>S2=>…..=>Sn => o If o is equal to the input output I, after that I entered after the system change without any information loss (Oh, Daniel, said, this is not possible. In information theory have a “layer-by-layer loss of information” (information inequality), processing a message b, then b get c’s, you can prove that: mutual information does not exceed a and c a and b the mutual information. This indicates that the information does not add information, most of the lost information. Of course, if you throw away the useless information that is good), remain unchanged, which means I entered there is no loss of information through each layer of Si, Si in any one layer, it is the original information (type I), another said. Now returned to we of theme Deep Learning, we need automatically to learning features, assumed we has a heap entered I (as heap image or text), assumed we design has a system s (has n layer), we through adjustment system in the parameter, makes it of output still is entered I, so we on can automatically to gets get entered I of series level features, that S1,…, Sn.

For deep learning, the idea is to stack multiple layers, which means that this level of output as input to the next layer. In this way, you can achieve the rank expression input.

In addition, the front assumes the output is strictly equal to the input, the limit was too strict, we can slightly relax the restrictions, for example, we only have to make the difference between input and output can be as small as possible, this relaxation will lead to a different kind of Deep Learning method. This is the basic idea of Deep Learning.

| Six shallow learning (Shallow Learning) and depth (Deep Learning)

First wave of shallow learning machine learning.

In the late 1980 of the 20th century, the back-propagation algorithm for artificial neural networks (also called Back Propagation algorithm and BP algorithm), was invented and brought hope to machine learning, setting off a wave of machine learning based on statistical models. The boom has continued to this day. It was found that using artificial neural network BP algorithm can make a model from a number of training samples to study the law so as to make predictions on unknown events. Based on statistical machine learning method than artificial rules based system in the past, showed superiority in many ways. Artificial neural network at this time, although also known as Multilayer Perceptrons (Multi-layer Perceptron), but is actually kind of shallow layer model with one hidden layer nodes.

In the 1990 of the 20th century, the shallow machine learning in a variety of models have been proposed, such as support vector machines (SVM,Support Vector Machines), Boosting, maximum entropy methods (such as LR,Logistic Regression). The structure of these models can be seen as essentially a hidden layer nodes (such as the SVM, Boosting), or no hidden layer nodes (LR). Both theoretical analysis and the application of these models has been a huge success. In contrast, due to the difficulty of theoretical analysis, training methods and need a lot of experience and skill, shallow artificial neural networks but relatively quiet during this period.

 The second wave of deep learning machine learning.

In 2006, Canada University of Toronto Professor and scholar in the field of machine learning Geoffrey Hinton and his student RuslanSalakhutdinov published an article in science, opening a wave of deep learning in academia and industry. This article articles has two a main views: 1) more hidden layer of artificial neural network has excellent of features learning capacity, learning get of features on data has more nature of description, to conducive to Visual of or classification; 2) depth neural network in training Shang of difficulty, can through “by layer initial of” (layer-wise pre-training) to effective overcome, in this article articles in the, by layer initial of is through no supervision learning achieved of.

Most current classification and regression algorithm method for shallow structure, its limitation lies in the limited sample and cell case limited to a complex function, for complex classification of certain constraints on its generalization. Deep learning through learning a nonlinear network structure of deep, complex function approximation, characterization of distributed input data, and show a strong focus on samples from small data set of essential characteristics of capacity. (Multiple benefits can represent complex functions with fewer parameters)

Daniel's deep learning notes, Deep Learning crash course tutorial

The essence of deep learning, is constructing a hidden layer of machine learning models and a huge selection of training data, to learn more useful features, and ultimately improve the accuracy of classification and prediction. Therefore, the “model” means that “feature” is the goal. Difference Yu traditional of shallow layer learning, depth learning of different is: 1) stressed has model structure of depth, usually has 5 layer, and 6 layer, even more than 10 more layer of hidden layer node; 2) clear highlight has features learning of importance, that is, through by layer features transform, will sample in original space of features said transform to a new features space, to makes classification or forecast more easy. Compared with the method of structural characteristics of artificial rules using data to learn the features, more data-rich internal information.

| Seven, Deep learning and Neural Network

Deep learning is a new field of machine learning research, analyse their motivation is to build, simulate the human brain neural network learning, it mimics the mechanism of the human brain to interpret data, such as images, sounds and text. Deep learning is a kind of unsupervised learning.

Study on advanced learning concepts derived from artificial neural networks. Hidden layer MLP is a deep learning structure. Deep learning through a combination of low-level features in more abstract high-level property categories or characteristics to identify characteristics of distributed representation of data.

Deep learning is in itself a branch of machine learning, easy to understand neural network development. About twenty or thirty years ago, the neural network was particularly fiery ML area in one direction, but then slowly faded out, reasons for include the following aspects:

1) easy fitting parameters are difficult to tune, and involves a lot of trick;

2) training is slow and level is relatively small (less than or equal to 3) case results are not better than other methods;

So there is some time of more than 20 years, neural networks are paid little attention, this time is basically dominated by SVM and boosting algorithms. But Hinton a silly old man, he kept going, and eventually (and among others, Bengio, Yann.lecun and other) commissions a practical framework for deep learning.

Deep learning and traditional neural networks have the same place also has many differences.

Both of same is deep learning used has neural network similar of layered structure, system by including entered layer, and hidden layer (multilayer), and output layer composition of multilayer network, only adjacent layer node Zhijian has connection, same layer and across layer node Zhijian mutual no connection, each layer can as a is a logistic regression model; this layered structure, is compared close human brain of structure of.

Daniel's deep learning notes, Deep Learning crash course tutorial

In order to overcome the problem of neural network training, DL has adopted very different training mechanism and neural networks. Traditional neural networks, carried out using back propagation, in simple terms is an iterative algorithm to train the entire network, random initialization, calculates the current output of the network, and then based on the current output and the difference between label to change the parameters of the previous layer, until the convergence (is a gradient descent as a whole). Deep learning is a layer-wise training system as a whole. Reason for this is because, if using back propagation mechanisms, for a deep network (above level 7), residuals are propagated to the frontmost layer has become too small, the so-called gradient diffusion (gradient diffusion). The problem we discuss next.

Eight, Deep learning training |

8.1, the traditional training method of neural network in depth why not neural network

As traditional typical algorithm for training multilayer network BP algorithm, in fact, only a few layers, this training method is not ideal. Deep structure (involving multiple nonlinear processing units) non-convex objective function exists in the main source of the local minimum is difficult training.

Problems in BP algorithm:

(1) the gradient and sparse: from the top down, the error correction signal and smaller;

(2) converges to a local minimum: especially when starting from away from the optimal area (random value to initialize this situation would occur);

(3) in General, we only use labeled data to train: most of the data is not tagged, the brain can learn from unlabeled data;

8.2, deep learning process

If all layers at the same time training, time complexity too much; if every layer, deviations will transfer layer by layer. In supervised learning above this will be faced with the opposite problem, severely poor fitting (because of deep networks of neurons, and too many parameters).

In 2006, Hinton and raised in unsupervised data to establish an effective method for multi-layer neural network, simply put, is divided into two steps, one train at a time at a network, the other is the tuning, makes up the original x-generated high r and the formation of X’ as higher r down. By:

1) first layer-by-layer construction of single neurons, so that each is a single-layer network training.

2) when all the training finished, Hinton tuned using the wake-sleep algorithm.

In addition to the top of the weight becomes a two-headed for other code, so that the top layer is still a single-layer neural network, while the other layer will change in order to graph model. Weights of up to “cognitive” down weighting is used to “generate”. And then uses the Wake-Sleep method to adjust all weights. Cognition and production agreement, which guarantee that the generated top-level node that represents the right recovery possible underlying. Such as a top node represents the human face, images of faces should activate the node, and the results down the resulting image should be shown as an about face image. Wake-Sleep algorithm into wake up (wake) and sleep (sleep) in two parts.

1) wake stages: cognitive processes, up through the external features and weights (cognitive weight) each layer of abstraction (nodes), and modified using the gradient layer between the downward weight (weight). That is, “If the reality is not like I expected, change my weight so I imagine this is.”

2) sleep stages: generation, represented by the top (learn the concept of when awake) and down the weights, generates the underlying State, adjust the interlayer up weight. That is, “If the dream scene was not the corresponding concept in my mind, change my cognitive weight makes this scene in my opinion is the concept of”.

Deep learning and training process as follows:

1) use increase under non-supervised learning (that is, from the ground up, layer after layer to the top of training):

With uncalibrated data (calibration data can be) tiered training parameters, this step can be seen as an unsupervised training, largest part is different and traditional neural networks (this process can be seen as a feature learning process)

Specific of, first with no calibration data training first layer, training Shi first learning first layer of parameter (this a layer can as a is get a makes output and entered difference minimum of three layer neural network of hidden layer), due to model capacity of limit and sparse sex constraints, makes get of model can learning to data itself of structure, to get than entered more has said capacity of features; in learning get subsection n-1 layer Hou, will n-1 layer of output as subsection n layer of entered, training subsection n layer, This parameter in each layer respectively;

2) top-down learning (through labeled data to train, transmission errors from top to bottom, to fine-tune the network):

Based on first step get of the layer parameter further fine-tune whole multilayer model of parameter, this a step is a has supervision training process; first step similar neural network of random initial of initial process, due to DL of first step not random initial of, but through learning entered data of structure get of, thus this initial more close global optimal, to can made better of effect; so deep learning effect good is big degree Shang thanks to first step of feature Learning process.

|, Deep Learning models or methods

9.1, AutoEncoder auto-encoders

Deep Learning one of the simplest methods is to use characteristics of the artificial neural network, artificial neural network (ANN) itself is a hierarchical system, given a neural network, we assume that the input and output are the same, and then trained to adjust its parameters, the weights of each layer. Naturally, we got several different representations of the input I (each layer represents a gesture) that is characteristic. Auto-encoder is a replication of the neural network input signal as possible. In order to achieve this realization, auto encoder must capture can enter data on behalf of one of the most important factors, such as PCA, find information on behalf of the original main ingredient.

Specific description of the process is as follows:

1) given unlabeled data, using unsupervised learning features:

Daniel's deep learning notes, Deep Learning crash course tutorial

In our neural networks, as in the first figure, we enter the sample is tagged: (input, target) so we are based on the current output and target (label) the difference between the to change the parameters of the previous layer, until convergence. But now we only have non-tag data, which is the figure on the right. So why is this error from?

Daniel's deep learning notes, Deep Learning crash course tutorial

As shown above, we will enter an encoder encoder input, you will get a code, the code that is entered, then how do we know that the code is input at all? We add a decoder decoder, decoder will output a message if the output of the information and the beginning of the input signal input is much like (ideally is the same), that’s obvious, we have reason to believe that this code is possible. So, we adjust the parameters for encoder and decoder, making reconstruction error, then we get the input input signal first, which encode code. Because it is non-tag data, so the error source is reconstructed with the original input directly compared.

Daniel's deep learning notes, Deep Learning crash course tutorial

2) features generated by the encoder, and then train the next layer. Cascade training:

There we get the first layer of code, our reconstruction error let us believe that the code is a good expression of the original input signal, or forced said it and the original signal are identical (expressions are not the same, is a reflection of something). The training methods of the second and first floors there is no difference, we will be the first tier of output code as the input signal, also minimizes reconstruction error, you will get the second parameter, and the second input code, and is the second expression of the input information. Other processing on the line in the same way (training this layer, layer in front of the parameters are fixed, the decoder has no use and they do not need).

Daniel's deep learning notes, Deep Learning crash course tutorial

3) supervised fine-tuning:

Through the above methods, we get a lot of layers. As regards how long layers (or depth required, this is in itself is not a scientific evaluation method) requires its own test set. Each layer are different expressions of the original input. Of course, we think that it is, the more abstract the better, just like the human visual system does.

Here, the AutoEncoder cannot be used to classify the data, because it has not yet learned how to link an input and a class. It just learned how to refactor or reproduce its input. Or, it is just learning the characteristics of a good representative of the input, which can best represent the original input signal. So, in order to achieve classification, we can AutoEncoder the top layer of coding to add a classifier (such as luojiesite return, SVM, etc), then standard multilayer neural network methods of supervision and training (gradient descent) to train.

In other words, at this time, we feature code of the last layer needs to be entered into the final classifier, through sample labels, with fine-tuning by supervised learning, which is also divided into two types, one is only adjusting classifier (black part):

Daniel's deep learning notes, Deep Learning crash course tutorial

Another: the sample labels, fine tune the system as a whole: (if you have enough data, and this is the best. End-to-end learning learning from end to end)

Daniel's deep learning notes, Deep Learning crash course tutorial

Once the supervisory training is complete, the network can be used to classify. The top can be used as a linear neural network classifiers, and then we can to replace it with a better performance of the classifier. Can be found in the study, if the original features include auto-learn feature can greatly improve the accuracy of classification classification algorithms better than best!

AutoEncoder there are some variants, here briefly to two:

Automatic Sparse AutoEncoder sparse encoder:

Of course, we can continue to add some constraints have added to the Deep Learning method, such as: If L1 Regularity of limits on the basis of AutoEncoder (L1 was mainly bound most of the nodes in each layer to 0, only a small number is not 0, this is the source of Sparse name), we can get by Sparse AutoEncoder.

Daniel's deep learning notes, Deep Learning crash course tutorial

Pictured is in fact limit the expression code each time it is as sparse as possible. Because sparse expression than other expressions to be effective (like the human brain is the, an input just stimulates certain neurons, neuron is subdued for most of the other).

Denoising AutoEncoders noise auto encoder:

Noise reduction automatically coded DA in auto encoder based on training data noise, so auto-encoders must learn to remove the noise and get no noise of the input. Therefore, this forces the encoder to input signal more robust expression of which is the generalization of reason is stronger than General encoders. DA through gradient descent algorithm to train.

Daniel's deep learning notes, Deep Learning crash course tutorial

9.2, Sparse Coding sparse coding

If we must be equal to the input output restrictions relaxed while taking advantage of basic concepts of linear algebra, namely o = A1* φ φ φ n 2+….+ an* 1 + A2*, φ I is based, AI is the coefficient, we can get an optimization problem:

Min | I – O|, where I enter, o for output.

By solving the optimization equation, we can obtain the coefficient AI and matrix φ I, these factors and enter another base is approximate.

Daniel's deep learning notes, Deep Learning crash course tutorial

Therefore, they can be used for type I, the process is automatically learned. If we add L1 Regularity limitations to the above equation is obtained:

Min | I – O| + u*(|a1| + |a2| + … + |an |)

This method is known as Sparse Coding. Popular to say, it is to represent a signal as a linear combination of basis, and requires only few base-can be represented. “Sparsity” is defined as: only very few non-zero elements or only a few far greater than zero elements. Request mean of the coefficients AI are sparse: for a set of input vectors, we just want to have as few coefficient is greater than zero. Choose to use sparse have component of the input data to show that we are for a reason, because the vast majority of sensory data, such as images, which can be expressed as a superposition of a small number of basic elements, these basic elements can be in the image or line. Meanwhile, analogy with the primary visual cortex has also been upgrading the process (there are a large number of neurons of the human brain, but for some edge of the image, or only a few neurons, others are suppressed).

Sparse coding algorithm is an unsupervised learning method, which is used to look for a set of “Super perfect” base vector to represent sample data more efficiently. Though technology such as principal components analysis (PCA) enables us to easily find a “comprehensive” basis vector, but here we want to do is to find a set of “Super perfect” base vector to represent the input vector (that is, the number of basis vectors larger than the dimension of the input vector). Super complete set of benefits is that they can more effectively identify hidden inside the input data structure and pattern. However, for very complete set, determine the coefficient AI input vectors are no longer unique. Thus, sparse coding algorithms, we are subject to a standard “sparsity” to solve due to complete and lead to degradation (degeneracy).

Daniel's deep learning notes, Deep Learning crash course tutorial

For example, in the bottom of the image Feature Extraction to do Edge Detector generates, this small patch is randomly selected from the Natural Images, which generated patch can describe their “matrix”, which is 8*8=64 to the right of a basis consisting of basis, and then given a test patch, We can follow the above equation obtained by linear combinations of the basis, and sparse matrix is a, a in the following figure has 64 dimensions, of which only 3 non-zero, it is called “sparse”.

Here and we may be in doubt, why make the bottom Edge Detector? Top what is it? Here is a brief explanation everyone will understand that Edge is the Edge Detector is because different directions will be able to describe the entire image, different Edge in the direction of nature is the basis of the image … … On the basis of a combination of a layer, on top and on a combined basis of … … (Is part fourth we said above it)

Sparse coding is divided into two parts:

1) Training phase: given a series of sample pictures [X1, x 2, …], we need to learn a set of base [φ 1, φ 2, …], which is dictionary.

Sparse coding is a variant of the k-means algorithm, whose training was a very similar process (EM algorithm idea: If you want to optimize the target function contains two variables, such as l (w, b), then we can fix w, adjusting the b l minimum and fixed b and w l minimized, iteration, and continue to the l to the minimum value.

Training process is a process of repeated iterations, by what is said above, we alternate a and φ prevent this objective function.

Daniel's deep learning notes, Deep Learning crash course tutorial

Each iteration consists of two steps:

A) fixed dictionary φ [k], and then adjust a[k], above, the objective function (that is, for LASSO problems).

B) and then fix a [k], adjusted φ [k], above, the objective function (that is, solutions of convex QP problem).

Iterated until convergence. Can a group so that you can get a good base x in this series, is a dictionary.

2) Coding phase: given a new image x, obtained by the above dictionary, by solving a sparse vector of LASSO issues a. The sparse vector is the expression of a sparse input vector x.

Daniel's deep learning notes, Deep Learning crash course tutorial

For example:

Daniel's deep learning notes, Deep Learning crash course tutorial

9.3, Restricted Boltzmann Machine (RBM) restricted Boltzmann machine

Assumed has a two Department figure, each layer of node Zhijian no links, a layer is Visual Layer, that entered data layer (v), a layer is hidden layer (h), if assumed all of node are is random II value variable node (only take 0 or 1 value), while assumed full probability distribution p (v,h) meet Boltzmann distribution, we said this model is Restricted BoltzmannMachine (RBM).

Daniel's deep learning notes, Deep Learning crash course tutorial

Here we take a look at why it is the Deep Learning method. First of all, the model for bipartite graphs, so in the case of v is known, all hidden nodes are conditionally independent (because there is no connection between the nodes), that is, p (h|v) =p (H1|v) … p (HN|v). Similarly, in case of known hide-h, all visible nodes are conditionally independent. While and due to all of v and h meet Boltzmann distribution, so, dang entered v of when, through p (h|v) can get hidden layer h, and get hidden layer h zhihou, through p (v|h) and can get Visual Layer, through adjustment parameter, we is to makes from hidden layer get of Visual Layer v1 and original of Visual Layer v If as, so get of hidden layer is Visual Layer addition a expression, so hidden layer can as Visual Layer entered data of features, So it is a kind of Deep Learning method.

Daniel's deep learning notes, Deep Learning crash course tutorial

How to training? That Visual node and hidden layer node weights how sure? We need to do some mathematical analysis. That model.

Daniel's deep learning notes, Deep Learning crash course tutorial

The joint configuration (jointconfiguration), the energy can be expressed as: Hello Kitty iPhone plus case

Daniel's deep learning notes, Deep Learning crash course tutorial

And configuration of a joint probability distribution by Boltzmann distribution (and the configuration of the energy) to determine:

Daniel's deep learning notes, Deep Learning crash course tutorial

Because hidden nodes are conditionally independent (because there is no connection between the nodes), namely:

Daniel's deep learning notes, Deep Learning crash course tutorial

Then we can easier (to factor the above decomposition Factorizes) are given on the basis of the Visual Layer v, j of the hidden layer node is the probability of 1 or 0:

Daniel's deep learning notes, Deep Learning crash course tutorial

Similarly, a hidden layer of h is given based on the Visual Layer for the I-th node 1 or 0 probability can also be easy to get:

Daniel's deep learning notes, Deep Learning crash course tutorial

Given the needs of independent and identically distributed samples: D={v (1), v (2), …, v (n)}, we need to learn the parameters θ ={W,a,b}.

We maximize the log-likelihood function (maximum likelihood estimation: a probabilistic model, we need to select a parameter, and let the current probability of observing a sample maximum):

Daniel's deep learning notes, Deep Learning crash course tutorial

Which meant maximum log likelihood function, you get l w corresponds to the maximum parameter.

Daniel's deep learning notes, Deep Learning crash course tutorial

 If, we put hidden layer of layer number increased, we can get Deep Boltzmann Machine (DBM); if we in near Visual layer of part using Bell leaves Republika faith network (that has to figure model, certainly here still limit layer in the node Zhijian no links), and in most away from Visual layer of part using Restricted Boltzmann Machine, we can get DeepBelief Net (DBN).

Daniel's deep learning notes, Deep Learning crash course tutorial

9.4, Deep Belief Networks are convinced that network

DBNs is a probabilistic model, with the traditional model of neural networks, generation model is the establishment of a joint distribution of observation data between the and tags, the p (Observation| Label) and p (Label| Observation) were evaluated, and discriminant model assesses only the latter, that is, p (Label| Observation)。 BP algorithm for neural network in depth the traditional when DBNs encountered the following issues: iphone 6 Plus Hello Kitty case

(1) the need to provide a label for the training sample set;

(2) a slow learning process;

(3) inappropriate parameters can lead to convergence in local Optima.

Daniel's deep learning notes, Deep Learning crash course tutorial

DBNs consists of restricted Boltzmann machines (Restricted Boltzmann Machines) layer, a typical neural network type as shown in Figure three. These networks are “restricted” as a Visual Layer and a hidden layer, layer connection exists, but there is no connection between the layers of cells within the. Hidden units are trained to capture the Visual layer of higher-order data relationship.

First of all, regardless of the top form an associative memory (associative memory) two-storey a DBN connection was generated through top-down guidance to determine the weight, RBMs is like a building block, like, compared to the traditional layered sigmoid belief networks and depth, it can be easy to attach weights learning.

The first time through layer-by-layer method to train an unsupervised greed get the model right, unsupervised greed of layer-by-layer method is proved to be effective by Hinton, and referred to as contrast differences (contrastive divergence).

In this training phase, will produce a vector v in the Visual Layer through which to pass values to the hidden layer. In turn, the Visual input will be randomly selected to try to reconstitute the original input signal. Finally, these new Visual neurons the activation units will be passed prior to reconstruction of hidden layer units, access h (in the process of training, first of all, the Visual vector maps to hidden units and Visual elements from the hidden layer unit rebuild; once again, these units of the new Visual mapping to hidden units, so it gets a new hidden units. Perform this step repeatedly called the Gibbs sampling). These steps back and forward is that we are familiar with Gibbs sampling, and hidden layer units and correlation differences between Visual input as values are updated based on the right.

Training time will be significantly reduced, because only take a single step to near maximum likelihood learning. Added to each layer of the network have improved training log-probability of the data, we can see closer and closer to the true expression of energy. This significant expansion, and the use of unlabeled data, is the decisive factor in any one of the application deep learning.

Daniel's deep learning notes, Deep Learning crash course tutorial

In the top two tiers, weights are connected together, so lower output will provide a reference for clues or associated to the top, so that the top is linked to its memory contents. While we are most concerned about, finally wants is to judge performance, such as the classification task.

After the training, DBN can be through the use of labeled data using BP algorithm for judging performance adjustments. Here, a set of tags will be attached to the top level (promotion of associative memory) through a bottom up, learning to identify the classification of weights to obtain a network. This training network performance than simple BP algorithm better. This can be very intuitive explanation, DBNs BP algorithm only needs the weight parameter is a local search space, compared the feedforward neural network training is faster, and convergence time less.

DBNs flexibility allows it to expand more easily. An extension is the convolution DBNs (Convolutional Deep Belief Networks (CDBNs)). DBNs and did not take into account the 2-dimensional image structure information, because the input is simple from a one-dimensional vectorized image matrix. And CDBNs is considering this issue, it uses the spatial relationship between neighborhood pixels, by a convolution model area grew to generate models of RBMs invariance under transformation, and can be easily transformed to high-dimensional image. DBNs does not explicitly deal with learning on the contact time of the observed variables, although there are studies in this area, such as the stacked RBMs, this promotion, dubbed temporal sequence learning convolutionmachines, application of this sequence of learning, speech signal processing, brings an exciting future research directions.

At present, and the DBNs related research including stack auto-encoder, it is through the stack automatically encoder to replace traditional RBMs in DBNs. This makes it possible to go through the same rules to create depth training multi-layer neural network architecture, but it lacks the layers parameter requirements. With the different DBNs, automatic coding using discriminant model, so that this structure is difficult to input sample, this makes networks more difficult to capture its internal representation. However, the noise-auto-Encoder was able to avoid this problem well, and more excellent than traditional DBNs. It by adding a random contamination in the training process and stacked generalization performance. Train noise reduction automatically single encoder and RBMs training process of generating the model.

| X, summary and prospects

1) Deep learning summary

Deep learning is about learning to data modeling potential (implied) distribution of multi-layer (complicated) expression of algorithms. In other words, deep learning algorithm for automatic extraction of classification needs low level or high level features. High-level features, a is refers to the features can grading (level) to rely on other features, for example: for machine Visual, depth learning algorithm from original image to learning get it of a low-level expression, for example edge detection device, wavelet filter,, then in these low-level expression of based Shang again established expression, for example these low-level expression of linear or nonlinear combination, then repeat this process, last get a high-level of expression.

Feature of Deep learning to get better data, and because the model level, many, capacity is sufficient, therefore, models have the ability to express large amounts of data, so for the image, voice, this feature is not obvious (manual design and many do not have intuitive physical meaning) issues, to achieve better results on a large scale training data. Also, from mode recognition features and classification device of angle, deep learning framework will feature and classification device combined to a framework in the, with data to learning feature, in using in the reduced has manual design feature of huge workload (this is currently industry engineers pay efforts up of aspects), so, not only effect can better, and, using up also has many convenient of at, so, is is worth concern of a framework, Each ML of people should pay attention to find out.

Of course, deep learning itself is not perfect, any ML is not the solution to the world problems, and should not be magnified to the extent of an omnipotent.

2) Deep learning in the future

Deep learning there is still a lot of work need to be studied. Current concerns also are from the fields of machine learning can learn from some of the depth of field of study using the method especially dimensionality reduction. For example: at present a work is sparse coding through the dimension theory of compressed sensing for high-dimensional data, making very few elements of high-dimensional vector can be accurately represents the original signal. Another example is the popular semi-supervised learning, training sample by measuring the similarity of this similarity of high dimensional data projected onto a lower-dimensional space. Another direction is the evolutionary programming approaches to encouraging (genetic programming), which can be minimal to conceptual engineering energy of Adaptive learning and changes to the core framework.

Deep learning there are many core issues need to be addressed:

(1) for a specific framework for how many dimensions of input it can be better (if it is an image, probably millions of d)?

(2) when catching a short or a long period of time-dependent, what kind of structure is valid?

(3) for a given depth of the learning architecture, integration of a variety of perceptual information?

(4) what is the correct mechanism to enhance the depth of a given learning framework in order to improve its robustness and invariance of distortion and loss of data?

(5) are there other, more effective learning algorithm of model theory?

Explore new content feature extraction model is worthy of further study. In addition effective parallel training algorithm is also worthy of study in one direction. Stochastic gradient algorithm based on minimum batch currently difficult on multiple computers in parallel training. Often is the use of graphics processing units to accelerate the learning process. GPU on a single machine, however a large data-aware or similar tasks dataset does not apply. In the application deep learning development, how to make full use of deep learning in enhancing the performance of conventional learning are still present in all areas of research focus.

Article read technology focused on deep learning and embedded vision of artificial intelligence platform for reprints please contact the original author.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s