Knowledge distillation

current hub

Write something...

Be the first to start a discussion here.

Recent from talks

Be the first to start a discussion here.

Recent from talks

Be the first to start a discussion here.

About hubStatsRules

See all

Wikipedia

Grokipedia

In machine learning, knowledge distillation or model distillation is the process of transferring knowledge from a large model to a smaller one. While large models (such as very deep neural networks or ensembles of many models) have more knowledge capacity than small models, this capacity might not be fully utilized. It can be just as computationally expensive to evaluate a model even if it utilizes little of its knowledge capacity. Knowledge distillation transfers knowledge from a large model to a smaller one without loss of validity. As smaller models are less expensive to evaluate, they can be deployed on less powerful hardware (such as a mobile device).

There is also a less common technique called Reverse Knowledge Distillation, where knowledge is transferred from a smaller model to a larger one.

Model distillation is not to be confused with model compression, which describes methods to decrease the size of a large model itself, without training a new model. Model compression generally preserves the architecture and the nominal parameter count of the model, while decreasing the bits-per-parameter.

Knowledge distillation has been successfully used in several applications of machine learning such as object detection, acoustic models, and natural language processing. Recently^[when?], it has also been introduced to graph neural networks applicable to non-grid data.

Knowledge transfer from a large model to a small one somehow needs to teach the latter without loss of validity. If both models are trained on the same data, the smaller model may have insufficient capacity to learn a concise knowledge representation compared to the large model. However, some information about a concise knowledge representation is encoded in the pseudolikelihoods assigned to its output: when a model correctly predicts a class, it assigns a large value to the output variable corresponding to such class, and smaller values to the other output variables. The distribution of values among the outputs for a record provides information on how the large model represents knowledge. Therefore, the goal of economical deployment of a valid model can be achieved by training only the large model on the data, exploiting its better ability to learn concise knowledge representations, and then distilling such knowledge into the smaller model, by training it to learn the soft output of the large model.

Given a large model as a function of the vector variable $\mathbf {x}$ , trained for a specific classification task, typically the final layer of classification networks is a softmax in the form

where $t$ is the temperature, a parameter which is set to 1 for a standard softmax. The softmax operator converts the logit values $z_{i}(\mathbf {x} )$ to pseudo-probabilities: higher temperature values generate softer distributions of pseudo-probabilities among the output classes. Knowledge distillation consists of training a smaller network, called the distilled model, on a data set called the transfer set (which is different than the data set used to train the large model) using cross-entropy as the loss function between the output of the distilled model $\mathbf {y} (\mathbf {x} |t)$ and the output of the large model ${\hat {\mathbf {y} }}(\mathbf {x} |t)$ on the same record (or the average of the individual outputs, if the large model is an ensemble), using a high value of softmax temperature $t$ for both models

In this context, a high temperature increases the entropy of the output, therefore providing more information to learn for the distilled model compared to hard targets, and at the same time reducing the variance of the gradient between different records, thus allowing a higher learning rate.

See all

Hub AI

Knowledge distillation AI simulator

(@Knowledge distillation_simulator)

Wikipedia

Grokipedia

Hub AI

Knowledge distillation

There is also a less common technique called Reverse Knowledge Distillation, where knowledge is transferred from a smaller model to a larger one.

Given a large model as a function of the vector variable $\mathbf {x}$ , trained for a specific classification task, typically the final layer of classification networks is a softmax in the form

See all

Knowledge Base

Talk Channels

Special Pages

Knowledge distillation

Knowledge distillation

Recent from talks

Recent from talks

Knowledge base stats:

Talk channels stats:

Members stats:

Knowledge distillation

Hub AI

Knowledge distillation

History

Knowledge distillation

Knowledge distillation

Recent from talks

Recent from talks

Knowledge base stats:

Talk channels stats:

Members stats:

Knowledge distillation

Hub AI

Knowledge distillation