Alex Dyakonov, Chief Research Scientist15 minute read
We’ll talk about two things - a deep learning approach that solves the problem of reducing the sample size and an even more ambitious task - creating synthetic data that stores all the useful information about the sample.
History and terminology
Data Distillation is a significant reduction of the sample that happens by creating artificial objects (synthetic data) that aggregate useful information stored in the data and allow tuning machine learning algorithms no less efficiently than on all data.
Initially, the term "distillation" appeared in Hinton's work , then the "knowledge distillation” term was established and is now widely recognized. Knowledge distillation means the construction of a simple model based on complex/several models. Even though the terms sound similar, data distillation is a completely different topic that solves a different problem.
There are many classic methods for selecting objects (Instance Selection), ranging from the selection of so-called etalons in metric algorithms to the selection of support vectors in support vector machines (SVM). But distillation is not a selection, but a synthesis of objects (it’s similar to the difference between the feature selection and dimensionality reduction, only in this particular case we are not talking about features, but about objects). In the era of deep learning, it became possible to create synthetic data using a uniform optimization technique - by synthesizing objects (instead of sorting through one object after another). We will talk about some of the research papers on this topic below. Note that data distillation is needed for:
acceleration of training/testing (preferably for a whole class of models),
reducing the amount of data for storage (replacing the classic Instance Selection),
answering a scientific question: how much information is contained in the data (how much we can “compress” it).
The idea of deep learning distillation of data was proposed in the research paper , in which an MNIST dataset of 60,000 images was reduced to 10, ie. in synthetic learning there was one representative image for each class! At the same time, the LeNet was trained on synthetic data to approximately the same quality as for the entire sample (though much faster) and converged in several steps. However, training a differently provisioned network was already degrading the quality of the solution. Fig. 1 that came from the research paper  shows the main result and the obtained synthetic data in the MNIST and CIFAR10 problems and it’s notable that for the MNIST dataset the data is not at all similar to real numbers, although it is logical to expect that a synthetic representative of class "zero" is something similar to zero, see also fig. 2.
fig. 1 from 
fig. 2 from 
The basic idea of the method proposed in  is that we want the synthetic data to be such that when using gradient descent on them we get to the minimum of the error function, ie. it is necessary to solve the following optimization problem:
θt here is the vector of network parameters at the t-th iteration, at zero iteration they are taken from the distribution p(•), x with a “wave" is synthetic data, x (without a “wave”) is a training sample, l(x, θ) is a neural network error function for the entire training set. Note that the formula uses only one step of the gradient adjustment of the original neural network, we will generalize this further in a bit. The main problem that arises here is if the proposed optimization problem is solved by the gradient method, then we must take the gradient from the gradient
In a more general sense we stop hoping to fall into at least one gradient descent step and use multiple steps instead. Moreover, it is possible to specify several sets of parameters from the distribution p(.) Thus, synthetic data will be suitable for several neural networks with different parameters. The algorithm for searching synthetic data in general is shown in fig. 4. This generalization allows obtaining synthetic data that is more similar to the original data.
The research paper  gave impetus to a new direction of research. Research paper , for instance, showed how to improve data compression during distillation, using "fuzzy (soft) labeled" data, so in the MNIST problem a representative of the artificial data could belong to several classes at once with different probabilities. Not only the data was distilled, but also were the labels. As a result, learning on a sample of only 5 distilled images, you can get an accuracy of about 92% (this is, however, worse than 96% obtained in the previous paper). It goes without saying that the more synthetic data, the better its quality (fig. 6). Moreover, in paper , data distillation was used for texts (albeit a little artificially, i.e. by cutting off or complementing the texts to a fixed length).
When the author found the research paper , it seemed that it was greatly underestimated. Though it has practical importance close to zero,
1) a "regular" and very logical method of data aggregation, tailored "for the model" was proposed,
2) many problems were introduced (e.g. obtaining synthetic data similar to the original),
3) it was interesting to see how it all works for tabular data,
4) there was a purely engineering challenge to create a "more efficient synthesis" of artificial data.
Dmitry Medvedev, a Moscow State University Masters student, became interested in data distillation and conducted several experiments. Dmitry and I haven’t yet completed the other experiments because the described two-level optimization is performed very slowly, and because we don’t have Google's processing power. That being said, we decided to see how the method works for tabular data. Below are images that stand for the classic "two moons training dataset'' model problem as well as for several simple neural networks: one-layer, 2-layer, and 4-layer. Note that even for such a simple task and simple models, the total distillation time reached 4 hours.
We investigated how the test quality depended on the number of epochs when solving the minimization problem for constructing synthetic data (note: these are not epochs of learning neural networks), the number of steps in the same task, the number of neural networks for which data is synthesized (the number of initial parameters) (fig. 7).
(from left to right) test quality for the number of epochs, the number of steps, and the number of distillation networks)
Fig. 8 shows a synthetic sample obtained at different steps of the method for different architectures. The problems that arose for data of a different nature remain: synthetic objects don’t look like sample objects. We are yet to find a solution to this problem.
Fig. 9 shows how models that are tuned on synthetic data work.
Separating surfaces of networks of varying complexity, trained on distilled data
We also looked at how synthetic data built for one architecture is suitable for other neural network architectures (fig. 10) and proposed a method for distilling data for several architectures at once. We also managed to demonstrate the effect that a network trained on distilled data can be better in quality than a network trained on the entire sample.
The quality of models based on different data: from original to distilled ones that are made for a specific architecture
Our research on distillation continues. For now the research paper  is ready and its edited version was presented at a recent conference AIST-2020. The code  was also posted. In the meantime, several more research groups are doing similar work.
 Hinton, G., Vinyals, O., Dean, J.: Distilling the Knowledge in a Neural Network // NIPS (2015)
 Wang, T., Zhu, J., Torralba, A. and Efros, A. Dataset Distillation, 2018.
 Sucholutsky, I., Schonlau M. «Soft-Label Dataset Distillation and Text Dataset Distillation»
 Medvedev D., D’yakonov А. «New Properties of the Data Distillation MethodWhen Working With Tabular Data», 2020.