czy big.LITTLE rozwiąże problemy z side channelami w procesorach, czy

0

...czy jednak postęp w virtualizacji okaże się zmarnowanymi dekadami pracy, bo koniec końców trzeba i tak "wszystko" będzie odpalać na osobnym hardwarze w data centers?

1

Ale jak ma rozwiązać? Big little to połączenie procków o rożnej częstotliwości i zapotrzebowaniu energetycznym, bez konieczności przetaktowywania wszystkich rdzeni na raz i największa zaleta to oszczędzanie baterii na smartfonach.

Jaki feature ma to rozwiązać? Pytam serio, bo pewnie nie jestem aktualny w temacie.

0

@nalik:

z tego co czytałem, to pomysł był taki aby untrusted code latał na slow corach z wyłączonymi spekulacjami/Out-of-order execution

src: https://news.ycombinator.com/item?id=27002271

1

Ciekawe. Od kiedy ten pomysł istnieje? W 2013-2015 jak byłem w szjasungu to o tym nie słyszałem jeszcze, a prace nad big.little już trwały.

0

@nalik:

nie wiem, bazowałem na komentarzu z HNa podlinkowanym wyżej, ale jakbym miał strzelać to gdzieś +- daty pierwszych głośnych podatności (Spectre/Meltdown?) w Intelu, zatem ok. >2017?

1

Też bym tak strzelał. Wiec brzmi jak próba ugrania trochę więcej, niż ARM oryginalnie planował.

0

@nalik:

pytanie czy próba czy konieczność, no bo jakie mają inne opcje aby połatać te wszystkie dziury?

jakby wyłączyli te wszystkie swoje hacki do performance to niezłe byłyby cyrki :D

W ogóle z czego wynika że w Chromie jest tyle podatności związanych z JITem? czy wynika to z tego, że po prostu Chrome w porównaniu do innych runtime odpala nieprawdopodobnie dużo untrusted code?

0

Ostatnio wykryto nowy typ podatności w prockach z OoOE: New Spectre Variants Discovered By Exploiting Micro-op Caches - w sensie, o ile dobrze zrozumiałem, to dotyczy tylko procków ze spekulatywnym wykonywaniem kodu i (oczywiście) micro-op caches. Out-of-order execution i speculative execution są teoretycznie ortogonalne, ale prawie zawsze jeśli występuje jedno to też i drugie, więc pomieszałem te pojęcia tutaj :] Tak po prawdzie zdarzają się procki mające jedno, ale niemające drugiego (i dlatego w pracach naukowych te pojęcia są wyraźnie rozróżnione):
https://stackoverflow.com/a/49603253

Processors designed to carry out simple tasks and used in embedded systems or IoT devices are typically neither speculative nor OoO. Desktop and server processors are both speculative and OoO. In the middle of the computing spectrum (mobile phones and microcontrollers), you can find processors that are OoO, but not speculative (such as the ARM Cortex-A9). The Intel Bonnell microarchitecture is speculative, but in-order. Speculative execution is particularly beneficial when used with OoO.

0

trochę poza security z oryginalnego wątku, ale:

https://www.agner.org/forum/viewtopic.php?t=79&p=187#p187

A chimera is a monster combining parts from different animals, or an organism containing multiple different sets of DNA. I am calling Intel's new Alder Lake processor a chimera because it is a hybrid containing two different kinds of CPU cores with very different designs.

The Alder Lake processor contains from 2 to 8 cores of the 'Golden Cove' architecture, called P cores, and from 0 to 8 cores of the 'Gracemont' architecture, called E cores. The P cores (Performance cores) are high-performance CPU cores using the latest state-of-the-art technology to get maximum performance. The E cores (Efficiency cores) use same technology as the 'Atom' series with low power consumption and lower performance. The idea behind this design is that the P cores can give a high performance for a limited number of threads, while the E cores allow the CPU to run many threads and still limit the power consumption. This may sound like a nice compromise in theory, but it involves a lot of problems when the same program or the same thread can jump arbitrarily between two very different kinds of cores.

The initial Alder Lake design had different CPUID numbers for the two kinds of cores. This gave problems with DRM software. If a program using DRM detects that the CPUID has changed, it will assume that the program has been moved to a different computer in violation of the license. This, of course, will stop the execution. Intel had to modify the Alder Lake and give it the same CPUID for all cores in order to fix this problem[1]. Now, it is difficult for a running program to detect what kind of core it is running on.

Another problem is that the P cores are designed for the latest instruction set extensions, including AVX512 and a new set of half-precision floating point instructions (AVX-512 FP16) that are useful for neural networks. The E cores only support AVX2, not the later instruction set extensions, such as AVX512. What would happen if a program that starts executing in a P core and detects that AVX512 instructions are available is moved by the operating system to an E core that doesn't support this instruction set? A smart operating system might catch the error when the program attempts to execute an AVX512 instruction and move it back to a P core. But this requires that the operating system is designed with special support for the Alder Lake. If the program is running on an older operating system, it will crash in this situation. Therefore, Intel had to disable all instructions that are not supported by the E cores. The AVX512 instructions are actually implemented in the hardware, but they are disabled. Some motherboards have a BIOS feature that makes it possible to disable the E cores and enable the AVX512 instructions[2]. This feature is not endorsed by Intel, and it has now been disabled in a microcode update, even for the i3 models that have no E cores[3]. Intel have actually sacrificed their flagship 512-bit instructions in order to run multiple threads in low-power cores.

It is very difficult to optimize the software execution for this hybrid system. A further complication is that a P core can run two threads in the same core so that each thread gets half of the resources. This is what Intel call hyperthreading. A program thread may run in three different configurations with different performance parameters:

Running alone in a P core with maximum performance
Sharing a P core with another thread, giving half the resources
Running in a low-power E core

It is completely unrealistic that an application program can handle this situation in a reasonable manner and optimally allocate different threads to the different cores. Hardly any software application company can afford to make different versions of their code for every new microprocessor model and verify, maintain, and support all these versions. The Alder Lake has implemented a special hardware solution to this problem called the 'Intel Thread Director'. The Intel Thread Director is an embedded microcontroller that monitors all threads and measures the resource use of each thread. The operating system can use this information to calculate the optimal allocation of P cores and E cores to the different threads[4]. Windows 11 has support for the Intel Thread Director. Future versions of Linux are planned to support it too[5], while there are no known plans to support it in MacOS[6].

It is completely unrealistic that an application program can handle this situation in a reasonable manner and optimally allocate different threads to the different cores. Hardly any software application company can afford to make different versions of their code for every new microprocessor model and verify, maintain, and support all these versions. The Alder Lake has implemented a special hardware solution to this problem called the 'Intel Thread Director'. The Intel Thread Director is an embedded microcontroller that monitors all threads and measures the resource use of each thread. The operating system can use this information to calculate the optimal allocation of P cores and E cores to the different threads[4]. Windows 11 has support for the Intel Thread Director. Future versions of Linux are planned to support it too[5], while there are no known plans to support it in MacOS[6].

The way that Windows 11 handles this problem is still flawed, however. The system is giving high priority only to the thread that has the user focus. This ignores the behavior of many users. A user who is waiting for the computer to finish a heavy duty task is typically not just sitting and waiting. He/she is more likely to do something else during the waiting time, for example checking mails[2]. There are various technical options that the user can use to control the prioritization of threads, but it is unreasonable to require that the user understands and masters such options when the user's attention is on a complicated calculation task rather than on the hardware details of a specific computer. It is already quite difficult to optimize for hyperthreading, as I have argued before[7]. The hybrid design of the Alder Lake just makes the optimization an order of magnitide more complicated. It looks like the hardware designers have unrealistic expectations of how much software designs can be attuned to processor-specific peculiarities.

I have tested an Alder Lake, but I have not been able to get access to a setup that makes it possible to enable the AVX512 instructions. The performance of the P cores is improved somewhat over the Intel Ice Lake. The µop cache can hold 4k µops. The µop cache can deliver a maximum of 6 µops per clock cycle for a single thread or 3 µops per thread when running two threads. This throughput is not limited by code cache lines. The decoders can deliver a maximum of 4 µops per clock for a single thread or 2 µops per thread when running two threads. The decoders can handle a maximum of 16 bytes per clock, or 2x16 bytes when running two threads. The figures of 6 decoders and 8 µops per clock published elsewhere[4] are not confirmed by my measurements.

Instruction latencies and throughputs are similar to the Ice Lake for most instructions, but the latency for floating point addition is reduced from 4 to 2 clock cycles. I have not published instruction tables for the Alder Lake. I prefer to wait until a pure Golden Cove with all instructions enabled becomes available.

ciekawa klasa problemów się ujawnia

1

@1a2b3c4d5e: na pewno to był post na temat? :)

1a2b3c4d5e napisał(a):

https://www.agner.org/forum/viewtopic.php?t=79&p=187#p187

I have tested an Alder Lake, but I have not been able to get access to a setup that makes it possible to enable the AVX512 instructions. The performance of the P cores is improved somewhat over the Intel Ice Lake. The µop cache can hold 4k µops. The µop cache can deliver a maximum of 6 µops per clock cycle for a single thread or 3 µops per thread when running two threads. This throughput is not limited by code cache lines. The decoders can deliver a maximum of 4 µops per clock for a single thread or 2 µops per thread when running two threads. The decoders can handle a maximum of 16 bytes per clock, or 2x16 bytes when running two threads. The figures of 6 decoders and 8 µops per clock published elsewhere [4] are not confirmed by my measurements.

* 4 = https://www.anandtech.com/print/16881/a-deep-dive-into-intels-alder-lake-microarchitectures

ouch! to by bardziej pasowało do tego tematu: https://4programmers.net/Forum/Nietuzinkowe_tematy/359230-dlaczego_rozne_dlugosci_instrukcji_x86_to_az_taki_problem

0

W kontekście tego co podlinkowałeś

Intel Thread Director

One of the biggest criticisms that I’ve levelled at the feet of Intel since it started talking about its hybrid processor architecture designs has been the ability to manage threads in an intelligent way. When you have two cores of different performance and efficiency points, either the processor or the operating system has to be cognizant of what goes where to get the best result from the end-user. This requires doing additional analysis on what is going on with each thread, especially new work that has never been before.

To date, most desktop operating systems operate on the assumption that all cores and the performance of everything in the system is equal. This changed slightly with simultaneous multithreading (SMT, or in Intel speak, HyperThreading), because now the system had double the threads, and these threads offered anywhere from zero to an extra 100% performance based on the workload. Schedulers were hacked a bit to identify primary and secondary threads on a core and schedule new work on separate cores. In mobile situations, the concept of an Energy Aware Scheduler (EAS) would look at the workload characteristics of a thread and based on the battery life/settings, try and schedule a workload where it made sense, particularly if it was a latency sensitive workload.

Mobile processors with Arm architecture designs have been tackling this topic for over a decade. Modern mobile processors now have three types of core inside – a super high performance core, regular high performance cores, and efficiency cores, normally in a 1+3+4 or 2+4+4 configuration. Each set of cores has its own optimal window for performance and power, and so it relies on the scheduler to absorb as much information as possible to determine the best way to do things.

Such an arrangement is rare in the desktop space - but now with Alder Lake, Intel has an SoC that has SMT performance cores and non-SMT efficient cores. With Alder Lake it gets a bit more complex, and the company has built a technology called Thread Director.

That’s Intel Thread Director. Not Intel Threat Detector, which is what I keep calling it all day, or Intel Threadripper, which I have also heard. Intel will use the acronym ITD or ITDT (Intel Thread Director Technology) in its marketing. Not to be confused with TDT, Intel’s Threat Detection Technology, of course.

+

Intel’s Thread Director controller puts an embedded microcontroller inside the processor such that it can monitor what each thread is doing and what it needs out of its performance metrics. It will look at the ratio of loads, stores, branches, average memory access times, patterns, and types of instructions. It then provides suggested hints back to the Windows 11 OS scheduler about what the thread is doing, whether it is important or not, and it is up to the OS scheduler to combine that with other information about the system as to where that thread should go. Ultimately the OS is both topologically aware and now workload aware to a much higher degree.

Ile jest właściwie tych thread schedulerów czy poziomów thread schedulerów szczególnie gdy klepiesz w jakiejś javce/.netcie?

i na jakich poziomach są one implementowane?

vm/runtime? os? hardware?

2

na vmkach do wirtualizacji maszyn się nie znam, więc o tej warstwie się nie wypowiem.

co do wątków oferowanych przez system (w http://openjdk.java.net/projects/loom/ zwanych platform threadami), to generalnie aplikacje nie mają specjalnie dużych możliwości wpływania na działanie schedulera (tzn. wyboru konkretnego rdzenia, na którym chcą się wykonywać). jest thread affinity, ztcw niedostępne z poziomu biblioteki standardowej javki (ale można to obejść odpalając kod natywny, np. poprzez jni, project panama, etc) oraz priorytety wątków. systemowy scheduler wątków wywłaszcza czy przenosi je z rdzenia na rdzeń je kiedy uzna to za słuszne. thread affinity ogranicza ten wybór, ale nie pozwala go wprost kontrolować na bieżąco. zamiast tworzyć mechanizmy, które blokują wywłaszczanie czy przeskakiwanie wątków z rdzenia na rdzeń to tworzy się mechanizmy, które na to reagują z poziomu user space, tzn. chodzi mi tutaj o (względnie?) nowy mechanizm restartable sequences: https://lwn.net/Articles/883104/

w intel thread director się nie wgryzałem, ale brzmi to wszystko jakby ten thread director tylko zbierał dane statystyczne z procesora (różnego rodzaju próbkowanie jest dostępne od dawna, np. https://developer.amd.com/wordpress/media/2012/10/AMD_IBS_paper_EN.pdf ale to nie było ukierunkowane na zwracanie tych danych do systemowego schedulera, a chyba do profilowania wydajności wcześniej ustalonych kawałków kodu) i udostępnia te dane systemowemu schedulerowi. ztcw to procesor nie ma pojęcia o wątkach systemowych. rdzeń procesora widzi tylko strumień instrukcji i tyle. to system operacyjny decyduje co na którym rdzeniu będzie się wykonywać. dzięki danym z thread directora będzie mógł podejmować lepsze decyzje.

czy big.LITTLE rozwiąże problemy z side channelami w procesorach, czy

1a2b3c4d5e napisał(a):

1 użytkowników online, w tym zalogowanych: 0, gości: 1

Praca dla programistów

Forum dyskusyjne

Sprawy administracyjne

O nas

Skontaktuj się z nami