Nvidia spends a hotfix on the problem of overheating the GPU driver

Yesterday, Nvidia threw out a critical hotfix to contain rainfall from the driver's earlier edition, which caused an alarm in AI and games communities, causing that false systems report safe GPU temperatures – even when cooling requires quietly in the direction of potentially critical levels.

In the Nvidia official post Around the HotFix release, although only the third on the list of patches given, the problem is cited as'GPU monitoring tools may stop reporting the GPU temperature after waking PC before bedtime ”.

Shortly after the affected game is ready 576.02 has been implemented, a Pins on a stable diffusion sub-edit, entitled Read to save GPU!It has become a source of anecdotal problems and updates reported by users regarding the new driver. Based on these and other reports on the Internet, you can determine a certain line of outgoing problems.

It seems that the first Reddit report from error performed Late Friday afternoon UTC, in ZephyrirriRiG14 Subreddit, where the user Frycy81 quoted post on nvidia forums (archived):

The user on NVIDIA forums finds problems after updating 576.02. Source: https://www.nvidia.com/en-us/geforce/forums/game-ready-drivers/13/563010/geforce-grd-57602-feedback-thread-relerease-41625/3524072/

The user on NVIDIA forums announced that after installing the update of the tools such as MSI Afterburner and monitors in the game, such as the one in Call of duty (which generally gain access to native system lectures, like the GPU Panel Manager of Task in Windows) stopped updating the GPU temperature readings, freezing about 35-36 ° C.

The user stated that the restart of the monitoring software had no effect, and only the full restart of the system would restore accurate readings. Tools such as Hwinfo and its own NVIDIA monitoring application still reported temperatures correctly. The user emphasized that the problem took place during normal use, not only after waking up the system from sleep.

Feedback from users in various forums was emphasized by the general disturbance of the normal behavior of the fan curve and changing the core of thermal adjustment, causing that the graphics processing units at idle in unexpectedly high temperatures and disturbingly overheating under what is usually considered as standard operational loads, as in detail In this commentary:

“I could say that something was turned off. The weather outside was probably about 55 ° f / 12 ° C, but I cooked alive in my room. My window was open, and yet I did not feel any difference. All fans acted to the max, and after some time Temps looked good – after some time from 68 ° C to 72 ° C.

“At the beginning it seemed normal – until the next morning, when I realized that they were not idle temperatures, and the fans were still (kicking).

“Recently, I did the AI ​​rim after repairing a few things, so I wasn't sure if the values ​​simply increased too high. It happened once after installing Asus Ai Suite 3 – BIOS settings would not even work correctly.

“In any case, I went for now and returned to an older driver.”

Unjust

Official edition Pdf In the case of updating the 576.02 controller, it offers some tips on changes that could contribute to new problems. In chapter 5.5, NVIDIA admits that the GPU temperature can be incorrectly reported to NVIDIA Optimus Systems, especially showing zero degrees when no applications are started.

Section 5.5 of the official 576.02 Remarks of the update solve problems related to temperature monitoring, which seem to affect a wider number of systems than the Optimus system. Source: https://us.download.nvidia.com/windows/576.02/576.02-win11-win10-release-notes.pdf

Section 5.5 of the official 576.02 Remarks of the update solve problems related to temperature monitoring, which seem to affect a wider number of systems than the Optimus system. Source: https://us.download.nvidia.com/windows/576.02/576.02-win11-win10-release-notes.pdf

The edition states:

5.5 GPU temperature reported incorrectly in Optimus systems

5.5.1 Problem

In Optimus systems, tools reporting temperature, such as Speccy or GPU-Z, report that the NVIDIA GPU temperature is zero when no applications work.

5.5.2 explanation

In Optimus systems, when NVIDIA is not used, it is placed in low power. As a result, the tools reporting the temperature return incorrect values. Waking GPU to ask about the temperature would cause senseless measurements, because as a result the GPU temperature changes.

These tools will report exact temperatures only if the GPU is awake and running.

Nvidia Optimus is a GPU switching technology that switches between integrated and discreet graphics based on application requirements, to automatically balance the efficiency and energy consumption, designed to save battery life and reduce energy consumption. In the case of tasks such as video games or HD video playback, Optimus activates a discreet graphic processor for better performance; During lighter activities, such as internet browsing, returns to integrated (built -in) graphics.

It seems that the update has expanded the behavior previously limited to Optimus systems, enabling the affected graphic processor to introduce a state of low power, while inactivity, even if it is not hosted in the Optimus system, while disturbing the temperature reporting in third -party tools.

Risk adjustment

In most scenarios, it can be said that the graphics card Vbios It would probably prevent permanent GPU damage. VBIO enforces thermal limits and power at the level of system software, regardless of the controller.

Therefore, even if the driver caused the fan behavior incorrectly or reporting the temperature incorrectly, VBIO should continue to like efficiency, increase the fan's activity or close the GPU to prevent the equipment failure.

This does not mean that the risk was trivial – high temperatures maintained can reduce performance over time or stress is adjacent to components; In addition, in the event of a common understanding that the updated controller caused a problem (especially in systems in which drivers update “quietly”), this kind of issue may mislead a large part of the affected users who can try remedies with non -existent problems, and even potentially cause damage to their systems, using “repairs” of unrepatures.

Incorrect behavior caused by updating 576.02 was particularly disturbing for people dealing with artificial intelligence, in which high -performance equipment is routinely crossed to thermal borders for a long time.

The problematic driver 576.02 inspired a wider rash of complaints after issuing in mid -April, despite the initial Reports that it offered some favorable performance improvements. Regardless of ensuring the upotations and the level of interference, which seems to be caused by 576.02 at the time of writing Available for download* On the Nvidia website.

Glow

When it comes to precipitation from a defective update, many types of damage and or inconvenience are reported: Frankie_T9000 Reported That his graphic processor crashed at starting due to heat accumulation in an error update and stabilized only after slowing down. Commented 'It seems that it is not permanent, but it must repeat as soon as possible (I have pads on Wednesday), suspecting that the old thermal paste has become more by accumulating heat, so I place new pastes.'

Another user yesterday in the same thread It was found: “I use a non -standard fan curve from Afterburner and it still showed that my GPU temperatures were constantly at 27 ° C, so the fans did not turn on, which led to overheating of problems. I thought it was a problem, but after installing the previous controller, everything worked well again. In addition, the temperatures are not displayed correctly in the Taskmanager.

Although NVIDIA (as he persistently states in every version of HotFix) often provides hotels for individual video games or platforms, the risk of heat damage to GPU is higher for AI practitioners than for video games, because intensive machine learning processes, such as training or permanent positions on the aipur of graphic producers under a coherent long -term load -Aign, which will probably be launched only periodically in a game, which can “increase” in high use for the boss or a particularly demanding map section, but otherwise it was designed as a compromise between the use of GPU and the stability of the system.

* Archive: https://archive.ph/ylvr1

LEAVE A REPLY

Please enter your comment!
Please enter your name here