33

I have a laptop with a dedicated GPU, Nvidia Quadro P3200. It has 6 GB of RAM.

The laptop also has 32 GB of “normal” (CPU?) RAM.

I’m planning on using the GPU for parallel computing, running physics simulations. Some of these involve quite big arrays.

I am just wondering, if the total memory (all the variables and all the arrays) in my kernel hits 6 GB of the GPU RAM, can I somehow use the CPU’s one?

I would not be using the laptop for anything else during the computation so the main RAM should not be busy.

P.s. I am using a Dell Precision 7530, windows 10.

SuperCiocia
  • 653
  • 2
  • 7
  • 9
  • 1
    Maybe now for your particular setup but historically there has been such features aka TurboCache https://en.wikipedia.org/wiki/TurboCache – user3528438 Apr 26 '20 at 18:33
  • 1
    Note that it is not CPU RAM per say, it is the RAM on the northbridge that is available to the CPU as well as PCI-E (or whatever other devices on the northbridge). The GPU used to be a PCI device on the southbridge and the bus would be limiting, but I think modern GPUs are PCI-E. Also, DDR (RAM) is not as fast as GDDR (graphics). Also CPU has its own on-chip memory (cache) as SRAM, which is super fast (varying between L1/L2/L3) and hardly refreshed (S=static), but really small (L3 is ~2MB/core but can be more... some L3 is ~50MB total). Calling RAM, CPU RAM makes me think of the on-chip cache – vol7ron Apr 28 '20 at 16:30
  • To do what your asking requires a combination of hardware (to overcome the bottleneck) and software (to do the memory management). Interesting presentation related to your question can be found here: https://developer.download.nvidia.com/video/gputechconf/gtc/2019/presentation/s9426-using-tensor-swapping-and-nvlink-to-overcome-gpu-memory-limits-with-tensorflow.pdf – William D. Irons Apr 28 '20 at 21:48

5 Answers5

48

Short answer: No, you can't.

Longer answer: The bandwidth, and more importantly, latency between the GPU and RAM over the PCIe bus is an order of magnitude worse than between the GPU and VRAM, so if you are going to do that you might as well be number crunching on the CPU.

CPU can use a part of VRAM (part mapped into the PCI aperture, usually 256MB) directly as RAM, but it will be slower than regular RAM because PCIe is a bottleneck. Using it for something like swap might be feasible.

It used to be possible to increase the memory aperture size by changing the strap bits on the GPU BIOS, but I haven't tried this since Nvidia Fermi (GeForce 4xx) GPUs. If it still works, it is also required that your BIOS is up to the task of mapping apertures bigger than standard (it is highly unlikely to have ever been tested on a laptop).

For example, a Xeon Phi compute card needs to map it's entire RAM into the PCI aperture, so it needs a 64-bit capable BIOS in the host that knows how to map apertures above the traditional 4GB (32-bit) boundary.

Peter Cordes
  • 5,681
  • 1
  • 28
  • 33
Gordan Bobić
  • 3,330
  • 1
  • 18
  • 23
  • Larrabee itself was never release commercially https://en.wikipedia.org/wiki/Larrabee_(microarchitecture). 1st-gen Xeon Phi is either [Knight's Ferry](https://en.wikipedia.org/wiki/Xeon_Phi#Knights_Ferry) if you count that prototype version (never widely available commercially), or Knight's Corner which Wikipedia calls "Intel's first many-cores commercial product". The Larrabee project evolved from being a GPU into the "many integrated cores" (MIC) for Xeon Phi, using a precursor of AVX512. (And then actual AVX512 in Knight's Landing, 2nd gen Phi) – Peter Cordes Apr 26 '20 at 23:11
  • All the Xeon Phis on a PCIe card models had the same requirement in that the host motherboard's BIOS has to be able to map a 16GB memory aperture which offered direct access into the Xeon Phi's entire RAM. – Gordan Bobić Apr 26 '20 at 23:43
  • I was only nitpicking the use of the "(Larrabee)" name. The way it actually works makes sense. I edited your answer to just link wikipedia for Xeon Phi in case people don't know what it is. – Peter Cordes Apr 26 '20 at 23:53
  • *> it needs a 64-bit capable BIOS in the host* That part, at least, isn't much of an issue these days: laptops that ship with UEFI and Windows will come with 64-bit UEFI firmware as a matter of course, as Windows does not support booting 64-bit Windows from 32-bit UEFI firmware. Whether that necessarily implies support for >4 GB apertures, though, I don't know. – Bob Apr 27 '20 at 04:07
  • 11
    Note that DDR5 has ~52GiB/s BW per module (say 200 GiB/s for a 4-channel), PCIe 6.0 has 256GiB/s BW for 16x. So PCIe is not going to be the bottleneck, the RAM itself will. The GDDR is faster and optimised for multiple readers, so it's obviously faster but the RAM can still be used as a slower "cache" (like Optane DC memory works) and a simple prefetch can *totally* hide the latency (since GPU works with *stream* tasks, this is easy). BTW the GPU **can** access the RAM, every PCI(e) device can be a master unless disabled. That's why things like IOMMU and GART exist. – Margaret Bloom Apr 27 '20 at 06:52
  • 3
    Also, the "aperture" was an AGP/32-bit thing, with a 64-bit address space, there's no problem mapping even a 1TiB block. Not all the cards allow mapping their entire memory though. – Margaret Bloom Apr 27 '20 at 06:52
  • 1
    @ Bob UEFI firmware does not implicitly mean ability to map apertures above 4GB. @ Margaret there may be bandwidth, but latency will be much higher. Very few cards are set up to expose large IOMEM area. The only devices I am aware of that expose their entire memory that way are the Xeon Phis. – Gordan Bobić Apr 27 '20 at 08:13
  • 2
    i see one expert with 1st hand knowledge and years of experience. and i see one post using a screenshot that doesn't prove anything. i will trust the guy with years of experience. – Trevor Boyd Smith Apr 27 '20 at 18:01
  • 1
    I disagree with this answer. – Overmind Apr 28 '20 at 11:32
  • 2
    @MargaretBloom No, the PCIe connection is a bottleneck. The bandwidth may be high in the example you gave (which is not available in current mainstream CPUs) but it has incredibly high latency compared to to directly accessing the memory and is very low bandwidth compared to VRAM. It may be possible to hide the latency with high throughput devices like GPUs but the latency is much higher (at least 20x-30x higher) than VRAM or regular RAM. – Toothbrush Apr 28 '20 at 13:15
  • @Toothbrush Yes, of course, local memories will always be faster than distant ones. However, I don't see latency as a dominant factor to the point of calling PCIe the bottleneck. If the GPU can find enough independent loads it can occupy the full bandwith of the RAM and after the first roundtrip time, latency won't matter anymore (as long as the traffic is sustained). Being massively parallel, it shouldn't be hard to find such loads. The problem is PCIe+RAM is not optimized for the internal GPU architecture, a local, specifically crafted, memory will always be better. – Margaret Bloom Apr 28 '20 at 13:42
  • 1
    Yes, but PCIe is designed for packet transfers, not memory transfers. With higher speeds like those in PCIe 5.0/6.0, this is not such a problem. However, with current hardware, PCIe 3.0 (15.75 GB/s per direction minus channel overheads) or PCIe 4.0 (31.51 GB/s per direction minus channel overheads) is limiting even with 2 channels of DDR4 (19.2–25.6 GB/s per channel). – Toothbrush Apr 28 '20 at 13:57
  • 1
    I think the point that is missed is being able to accomplish a task with massive amounts of data. Even if RAM is slower, it’s better than disk paging/thrashing, or reaching max GDDR capacity. – vol7ron Apr 28 '20 at 16:39
  • This answer is not correct. Both sides of the fence VRAM and RAM are mapped to each other. They can't directly access each other, rather they have pools which are mapped to each other. VRAM is completely abstracted and from the perspective of the system completely virtualised. Developers use an API, and which is implemented by an ICD (driver)_which transparently manages memory mapping. – lfgtm May 12 '20 at 12:27
  • A lot of issues related to game RAM vs. VRAM usage have changed with Smart Access Memory, which is currently supported by AMD Zen 3 CPUs (like Ryzen 5 5600X and Ryzen 7 5800X), and AMD 6000 series GPUs (like the AMD Radeon RX 6800), and will be supported within the next few weeks by the Nvidia RTX 3000 series GPUs, and later on by the 11th-gen Intel CPUs, but Intel's version of the technology, and the name used on even some AMD motherboards, is Resizable BAR. This provides more VRAM access to the CPU, but things may eventually work the opposite way, where the GPU can access more of the RAM. – Rok Jan 18 '21 at 14:07
  • @MargaretBloom Why do GPUs have an internal GART aperture + GTT instead of using internal virtual addresses and only having a GTT? Just a quick Google search for dmesg GTT seems to show a lot of GPU drivers configuring an aperture range in the internal physical address space – Lewis Kelsey Jul 27 '22 at 03:14
  • @LewisKelsey AFAIK the "aperture" is the region of memory configured in the (old) north bridge to make it aware of what range to translate with the GART. Nowadays the GPU will do the translation itself (is that what you mean by internal virtual addresses?) and I think it still need to know what address range is subject to the GPU<->CPU remapping with the GART. I'm not well versed in modern GPUs architectures, I don't know if GPUs performs virtual-to-physical translation for their internal memory, anyway VA<->PA mapping is different from GPU<->CPU, so maybe an aperture must still be configured – Margaret Bloom Jul 27 '22 at 10:34
  • There is a new thing called "Resizable BARs" now, which allows a base address register to start anywhere in the entire 64 bit physical address space, and can be sized up to the entire 64 bit address space. It breaks the 16MB PCI BAR limitation. That allows you to just map the entire video memory into CPU address space, if your CPU is new enough and your GPU is new enough, and the CPU likes your GPU manufacturer. – doug65536 Mar 05 '23 at 06:22
12

Yes. This is the "shared" memory between the CPU and GPU, and there is always going to be a small amount required as buffers to transfer dataat the GPU but it can also be used as a slower "backing" to the graphics card in much the same way as a pagefile is a slower backing store to your main memory.

You can find shared memory in use in the built-in Windows Task Manager by going to the Performance tab and clicking on your GPU.

enter image description here

Shared memory will be slower than your GPU memory though, but probably faster than your disk. Shared memory will be your CPU memory which may operate up to 30GB/s on a reasonably new machine, but your GPU memory is probably able to do 256GB/s or more. You will also be limited by the link between your GPU and CPU, the PCIe bridge. That may be your limiting factor and you will need to know whether you have a Gen3 or Gen4 PCIe and how many lanes (usually "x16") it is using to find out total theoretical bandwidth between CPU and GPU memory.

Mokubai
  • 89,133
  • 25
  • 207
  • 233
  • 6
    I think you will find you got that backwards. That is the buffer area managed by the CPU for staging data in and out of the GPU memory. It is managed by the code running on the CPU and not visible to the code running in the GPU. – Gordan Bobić Apr 26 '20 at 09:19
  • Not visible, but if the code for GPU overallocates then it will be used. It is, as I mentioned, effectively like the page file. Used as a backup to be able to expand the memory. That you (probably) have 18GB of shared memory available to a 6GB card tells you that the software can use more than locally available on the graphics card. – Mokubai Apr 26 '20 at 09:25
  • 2
    The key point is that it is CPU rather than GPU managed. GPU doesn't have access to it, it's the CPU that uses it as a staging area for pushing data to and from the GPU. – Gordan Bobić Apr 26 '20 at 09:29
  • 1
    @SuperCiocia it depends on your use case, how you access it and broadly what *exactly* you mean by can the GPU use it. The GPU itself cannot access CPU RAM, but it can be made to look like it is able to use the CPU RAM by the software running on the CPU. Effectively your CPU can take program memory that would be on your GPU and replace it with the contents of memory in local RAM. In that way you have more (albeit slower) memory than what is just on the GPU. – Mokubai Apr 26 '20 at 21:13
  • What I mean is: I am running a python code with the tensorflow-gpu library. It is faster because of the 1000+ cores of the GPU but it’s limited by the gpu memory. I was asking if the gpu can “use” the normal RAM and use it as an extension of its own. Is it clear? – SuperCiocia Apr 26 '20 at 21:38
  • 2
    *The GPU itself cannot access CPU RAM* - Are you sure that's true? A PCIe device can do transactions that reads or writes 64 bytes from any physical address. Or are you making a distinction between the actual GPU processing chip on the graphics card having a memory bus directly connected to the DRAM on the graphics card, vs. having to make PCIe transactions for addresses that are CPU physical memory. (However the GPU's own internal address-space is configured.) – Peter Cordes Apr 26 '20 at 23:18
  • 4
    @SuperCiocia: if the GPU drivers let you do that it might be possible in theory, but performance would fall off a cliff from going over the PCIe bus to get to system DRAM instead of using the onboard DRAM over a very fast / wide GDDR5 bus. So there's good reason for them to not make this possible because it's generally not going to be useful. Managing which data is in GPU memory when is something we're probably stuck with if we want good performance. That's one of the major reasons graphics cards have their own RAM in the first place. – Peter Cordes Apr 26 '20 at 23:21
  • 1
    @SuperCiocia Every decent library will be able to swap in and out the necessary block of memory while your code is running on other blocks. GPU programs usually have very high locality, so this is easy. I don't know how TensorFlow works but I would assume it'd be able to handle theoretically infinite workloads by loading swapping data from disk -> ram -> vram automatically and in such a way to hide all the latency of the swapping. – Margaret Bloom Apr 27 '20 at 06:56
  • @PeterCordes I should have qualified "The GPU itself cannot access CPU RAM" as "without some kind of CPU interaction". Even the old PCI Busmaster DMA interface required some "yeah sure, use that block of memory over there" and memory protection from the CPU/MMU. Otherwise anything running on the GPU could escape its cage and steal passwords at will. Perhaps I should have said that "the GPU does not have direct *physical* access to RAM" – Mokubai Apr 27 '20 at 07:12
  • 1
    @Mokubai: I think you're talking about the [IOMMU](https://en.wikipedia.org/wiki/Input%E2%80%93output_memory_management_unit). That means the CPU can set up the mappings once so PCIe devices can access memory without any x86 instructions running on any CPU core. (The physical DRAM controllers are built-in to the CPU so of course it has to go through "the CPU", but just the system agent and so on, not any of the cores.) I think in older systems without an IOMMU, bus-master devices *did* have raw access to physical address space and https://en.wikipedia.org/wiki/DMA_attack was a thing. – Peter Cordes Apr 27 '20 at 07:19
11

As far as I know, you can share the host's RAM as long as it is page-locked (pinned) memory. In that case, data transfer will be much faster because you don't need to explicitly transfer data, you just need to make sure that your synchronize your work (with cudaDeviceSynchronize for instance, if using CUDA).

Now, for this question:

I am just wondering, if the total memory (all the variables and all the arrays) in my kernel hits 6 GB of the GPU RAM, can I somehow use the CPU’s one?

I don't know if there is a way to "extend" the GPU memory. I don't think the GPU can use pinned memory that is bigger than its own, but I am not certain. What I think you could do in this case is to work in batches. Can your work be distributed so that you only work on 6gb at a time, save the result, and work on another 6gb? In that case, then working in batches might be a solution.

For example, you could implement a simple batching scheme like this:

int main() {

    float *hst_ptr = nullptr;
    float *dev_ptr = nullptr;
    size_t ns = 128;  // 128 elements in this example
    size_t data_size = ns * sizeof(*hst_ptr);

    cudaHostAlloc((void**)&hst_ptr, data_size, cudaHostAllocMapped);
    cudaHostGetDevicePointer(&dev_ptr, hst_ptr, 0);

    // say that we want to work on 4 batches of 128 elements
    for (size_t cnt = 0; cnt < 4; ++cnt) {
        populate_data(hst_ptr);  // read from another array in ram
        kernel<<<1, ns>>>(dev_ptr);
        cudaDeviceSynchronize();
        save_data(hst_ptr);  // write to another array in ram
    }

    cudaFreeHost(hst_ptr);

}
  • 2
    This answer is wrong. Using pinned memory you do need to explicitly transfer the data. Its faster because you are telling the OS that it can not move the memory around for optimization reason, that is "pinned" in place, therefore memory transfers are much faster, as they do not need to query the CPU/OS to know if memory is still in the same place. You may be thinking of managed memory in CUDA, which you do not need to explicitly move, but that does not make it faster. And yes, you can pin the entire CPU memory, regardless of its size. In your code, you show managed, not pinned memory – Ander Biguri Apr 27 '20 at 12:09
  • @AnderBiguri pinned memory is accessible from the device as it is locked in place, so data does not need to be transferred explicitly, but work must be synchronized.. Regarding the managed memory, are you talking about [Unified Memory](https://devblogs.nvidia.com/unified-memory-cuda-beginners/)? In that case, memory does not need to be pinned, and it is allocated with `cudaMallocManaged` instead. The usage is similar, though. Could you explain a little bit more? – Armando Herrera Apr 27 '20 at 18:59
  • @AnderBiguri I am not sure if the OS will let you pin the *entire* cpu memory, but a huge chunk of it, maybe. What I said I was not sure about is whether the GPU can access all the memory pinned by the host if it is bigger than its own, not whether the host could pin all its memory. Maybe I was not too clear – Armando Herrera Apr 27 '20 at 19:01
  • I work with this technology and I have pinned 250Gb of RAM just this week, out of 256Gb. Apologies, I have made a mistake and mixed the use of Managed and Pinned memory call functions and I thought you where using `cudaMallocManaged` in your code. If instead of using `cudaHostGetDevicePointer` you use `cudaMemcpyAsync` safely, you can use all the CPU RAM. Apologies as I misunderstood some things in your code for managed/unified memory – Ander Biguri Apr 27 '20 at 21:34
  • 1
    @AnderBiguri I didn't know we could use `cudaMemcpyAsync` to do that. I'll have to give it a try, thanks for letting me know :) – Armando Herrera Apr 28 '20 at 14:39
  • So what is correct now, does pinned memory needs to be copied or not? Regarding the given link, it reads to me like it needs to: `[...]copy the host data to the pinned array, and then transfer the data from the pinned array to device memory, as illustrated below.` `Data transfers using host pinned memory use the same cudaMemcpy() syntax as transfers with pageable memory.` https://developer.nvidia.com/blog/how-optimize-data-transfers-cuda-cc/ – Thorsten Schöning Jun 08 '22 at 07:48
4

Any GPU can use system RAM when running out of its own VRAM.

In a similar manner to running out of RAM on a system and paging all excess data to storage units (SSD/HDD), modern GPUs can and will pull textures or other data from system RAM. Texture data can be used from system RAM over the PCIe bus to make up for the lack of the faster VRAM.

Since system RAM is a few times slower than VRAM and has much higher latency, running out of VRAM would translate into a performance loss and the performance will be limited also by the PCIe bandwidth.

So it's not a matter if there is possible or not, it's a matter of performance when doing it.

Also note that many integrated GPUs use system RAM, do not even have their own.

In the case of GPUs, the main factor in their performance is the software. A well designed software will use the GPU near its output FLOPS limits, while one designed badly will not. Usually the computing and hashing software comes in the 1st category. Same goes for allocating VRAM.

Overmind
  • 9,924
  • 4
  • 25
  • 38
  • 5
    this depends on the GPU and the driver. Many GPU architectures simply can't do certain operations when reading from system memory. E.g. sampling textures. OS libraries may mitigate that by swapping in and out buffers/textures, but a blanket "All GPUs can use system memory" is missing nuance and context. – PeterT Apr 28 '20 at 11:59
  • The software represents the context, that I why I detailed that part also. Of course you can't do all operations but that does not mean the RAM will not be used. – Overmind Apr 28 '20 at 12:01
  • If it's a matter of software, then maybe you should say "Any GPU *can* use system RAM through software". – clemisch Apr 28 '20 at 12:02
  • Agreed, updated. – Overmind Apr 28 '20 at 12:03
0

This question is currently a top search result when using keywords like: Can games use RAM instead of VRAM?

So, I thought it's worth it to add that a lot of issues related to game RAM vs. VRAM usage have changed with the Smart Access Memory technology, which is currently supported by AMD Zen 3 CPUs (like Ryzen 5 5600X and Ryzen 7 5800X), and AMD 6000 series GPUs (like the AMD Radeon RX 6800), and will be supported within the next few weeks by the Nvidia RTX 3000 series GPUs, and later on by the 11th-gen Intel CPUs, but Intel's version of the technology, and the name used on even some AMD motherboards, is Resizable BAR.

The technology essentially provides more VRAM access to the CPU, but it remains to be seen if things will eventually work the opposite way, too, where the GPU can access more of the RAM.

Rok
  • 237
  • 1
  • 3
  • 9