When scientists build large language models (LLM), they are aimed at maximizing results within a specific computing and financial budget. Since the training of the model can amount to millions of dollars, developers must be reasonable in the case of costs regarding costs, for example, an architectural model, optimizers and training sets before committing to the model. To predict the quality and accuracy of a large model forecasts, practitioners often turn to the scaling regulations: the use of smaller, cheaper models to bring the performance of a much larger target model. The challenge, however, is that there are thousands of ways to create the law of scaling.
The new works of MIT and MIT-IBM Watson AI Lab researchers deal with this, gathering and releasing a set of hundreds of models and indicators regarding training and performance for about a thousand scaling regulations. From this, the team developed a meta -analysis and a guide on the selection of small models and estimating the scaling regulations for various LLM model families, so that the budget is optimally used to generate reliable performance forecasts.
“The concept that you can try to build mathematical models of the training process is several years old, but I think that what was new here is that the majority of the work that people did earlier says:” Can we say something post-hoc about what happened when we trained all these models, so that when we try to come up with how to train a new model on a large scale, we can make the best decisions about how to use our budget? Department of Electrical Engineering and Computer Science and the main researcher from MIT-IBM Watson Ai Lab.
The research has recently been presented at an international conference on machine learning by Andreas, along with MIT-IBM Watson Ai Lab, Leshem Choshen and Yang Zhang from IBM Research.
Extrapolation performance
Regardless of how you cut it, LLM developed is an expensive undertaking: from making decisions regarding the number of parameters and tokens, selection and size of data as well as training techniques to determine the output accuracy and tuning to target applications and tasks. The scaling provisions are a way to forecast the model's behavior by linking the loss of a large model with the efficiency of smaller, cheaper models from the same family, avoiding the need for full training of each candidate. Mainly the differences between smaller models are the number of parameters and the size of token training. According to Choshen, explaining the rules on scaling not only allows better pre -training decisions, but also democratizes this field, enabling researchers without huge resources of understanding and building effective scaling regulations.
The functional form of scaling regulations is relatively simple, containing components from small models that capture the number of parameters and their scaling effect, the number of training tokens and their scaling effect, as well as the initial performance for the model family of interest. Together, they help scientists estimate the loss of performance of a large model; The lower the loss, the better the target model will be.
These provisions allow research teams to effectively balance compromises and check how to best assign limited resources. They are especially useful for assessing a certain variable, such as the number of tokens and for testing A/B various configurations before training.
In general, scaling regulations are not new; However, in the field of artificial intelligence they appeared when the models grew and the costs increased rapidly. “It is as if at some point in this field there were scaling regulations,” says Choshen. “They started paying attention, but no one really checked how good they are and what you have to do to do a good right of scaling.” In addition, the scaling regulations were in a sense a black box. “Whenever people in the past created the provisions on scaling, it was always only one model or one model family, one set of data and one programmer,” says Andreas. “There was not really many systematic meta -analysis, because everyone trained their own scaling rights individually. So (we wanted to know) are there a high level trends that you see in these things?”
Building better
To examine this, Choshen, Andreas and Zhang created a large set of data. They collected LLM from 40 families, including Pythia, OPT, Olmo, Lama, Bloom, T5-Pile, Moduleformer Mixture of Experts, GPT and other families. They included 485 unique, pre -trained models, and if available, data on their control points, calculation costs (flaps), training eras and seeds, as well as 1.9 million losses of losses and lower tasks. The models differed in their architecture, weights and so on. Using these models, scientists fit over 1000 recipes for scaling and compared their accuracy between architecture, the size of the training and training regimes, as well as testing how the number of models, including medium training control points and partial training influenced the predictive strength of scaling regulations for target models. Applied measurements of absolute relative error (are); This is the difference between the forecast of scaling law and the observed loss of a large, trained model. Thanks to this, the team compared the provisions on scaling, and after analyzing distilled practical recommendations for AI practitioners regarding effective scaling regulations.
Their joint guidelines run a programmer through steps and options for considering and expectations. First of all, it is very important to decide on the calculation budget and accuracy of the target model. The team said that 4 percent concerns the best accuracy that can be expected due to random seed noise, but even 20 percent is still useful to make decisions. Scientists have identified several factors that improve forecasts, such as indirect training training, instead of relying only on final losses; This made the scaling regulations more reliable. However, very early training data before 10 billion tokens is loud, reduce accuracy and should be rejected. Recommend priority to determine the training of more models in the spread of sizes to improve the reliability of predicting the law of scaling, and not only larger models; The selection of five models is a solid starting point.
Basically, including larger models, it improves the forecast, but the costs can be saved, partial training of the target model up to about 30 percent of its data set and using it for extrapolation. If the budget is significantly limited, programmers should consider training one smaller model within the target model and the parameters of the law of scaling loans from a model family with similar architecture; However, this may not work in the case of the Enkoder -decoder models. Finally, the MIT-IBM research group said that when comparing the rules on scaling between model families, there was a strong correlation between two hyperparmeter sets, which means that three out of five hyperparameters explained almost all variability and could probably capture the model's behavior. Together, these guidelines ensure a systematic approach to increasing the estimation of the law of scaling law, reliable and available to AI researchers working in accordance with various budget restrictions.
During this work, several surprises appeared: partially trained small models are still very predictive, and you can also use the average stages of training called the full trained model (as if they were individual models) to predict another target model. “You don't pay anything during training, because you have already trained a full model, so for example, the half -trained model is only a by -product of what you did,” says Choshen. Another feature Andreas noted that after aggregation, the variability between families and various experiments jumped and was louder than expected. Unexpectedly, scientists have found that it is possible to use the provisions on the scaling of large models to predict performance for smaller models. Other research in this field has hypothesized that smaller models were a “different beast” compared to large ones; However, Choshen does not agree. “If they are completely different, they should have demonstrated completely different behavior and no.”
While this work focused on the training time model, scientists plan to expand their analysis to the modeling of the model. Andreas says that it is not: “How does my model become better when I add more training data or more parameters, but when I let it think longer, draw more samples. I think that you can definitely learn here about how to build predictive models of this how much thinking you need to do during work.” He says that the theory of provisions regarding the scaling of inference time can become even more critical, because: “It's not that I will train one model and then finish. (Rather), every time the user comes to me, he will have a new inquiry, and I have to find out how hard (my model must think to think of inventing the best answer. So, be able to build these types of models, how to do it.
These studies were partly supported by MIT-IBM Watson Ai Lab and Sloan Research Fellowship.