Thread Parallelism and Locks

In addition to network parallelism, mental ray also supports shared memory parallelism through threads. Network parallelism is a form of distributed memory parallelism where processes cooperate by exchanging messages. Messages are used to exchange data as well as to synchronize. With shared memory data can easily be exchanged, a process must only access the common memory to do so. A different mechanism has to be used for synchronization. This is usually done by locking. Basically what has to be done is one process has to tell the other that it is waiting to access data, and another process can signal that it has finished working with it, so that any other process may now access it.

By default threads are used on shared memory multiprocessor machines. Threads are sometimes also called lightweight processes. Threads behave like processes running on a common shared memory.

Since memory is shared between two threads, both can write to memory at the same time. It can also happen that one thread writes while another reads the same memory. Both these cases can lead to surprising unwanted results. Therefore - to guard against these surprises - when using threads certain precautions have to be observed. Care has to be taken when using heap memory such as global or static data, as any thread may potentially modify it. To prevent corrupting any data (or reading corrupted data), locking must be used when it is not otherwise guaranteed that concurrent accesses will not occur. The stack, however, is always safe because every thread has its own stack that is not shared with any other thread.

In addition to making sure that write accesses to data are performed when no other thread accesses the data, it is important to use only so-called concurrency safe libraries and calls. If a call to a nonreentrant function is done, locking should be used. A function is called reentrant if it can be executed by multiple threads at the same time without adverse effects. (Reentrancy and concurrency safety are related, but the terms stem from different historical contexts, and reentrancy also implies the ability to recurse safely.) Details and examples are explained below.

For example, static data on a shared memory multiprocessor can be modified by more than one processor at a time. Consider this test:

static miBoolean is_init = miFALSE;
...
if (!is_init) {
    is_init = miTRUE;
    initialize();
}

This does not guarantee that initialize is called only once. The reason is that all threads share the is_init flag, so two threads may simultaneously examine the flag. It is possible that both will find that it has not been set, and enter the if body. Next, both will set the flag to miTRUE, and then both will call the initialize function. This situation is called a race condition. The example is contrived because initialization and termination should be done with init and exit shaders as described in the next section, but this problem can occur with any heap variable. Even incrementing global or static variables with "++" is not safe - the time window that leads to errors may be small, but that makes such mistakes all the more difficult to find. In general, all threads on a host share all data except local auto variables on the stack.

The behavior described above could also occur if more than one thread is used on a single processor, but by default mental ray does not create more threads than there are processors.

There are two methods for guarding against race conditions. One is to guarantee that only one thread executes certain code at a time. Such code surrounded by lock and unlock operations is called a critical section. Code inside of critical sections may access global or static data or call any function that does so (as long as all is protected by the same lock). The lock used in this example is assumed to have been created and initialized with a call to mi_init_lock before it used here. (See below how locks are initialized.) Here is an example of how a critical section may be used:

miLock lock;

mi_lock(lock);
if (!is_init) {
    is_init = miTRUE;
    initialize();
}
mi_unlock(lock);

The other method is to use separate variables for each thread. There are two places available:

In versions prior to mental ray 3.x versions it was possible to use the mi_par_nthreads function to allocate an array that can be indexed with the thread number, but in mental ray 3.x this is no longer possible because the number of threads has become dynamic and may change at any time, regardless of what number of threads was set with the -threads command-line option. There is no limit on the number of threads in mental ray.

See the section on Persistent Shader Data Storage on page shdstorage for more details on lock-free shader data storage.

Allocation is done in the initialization function (which has the same name as the shader with _init appended) of the shader or shader instance. No locking is required because it is called only once. The termination function (which also has the same name but with _exit appended) must release the data.

Locks reduce multithreading efficiency and should be used only when absolutely necessary, or in shader initialization functions, which are called only rarely. The probability of one thread blocking because another has locked a section of code grows very quickly with the number of threads, and a thread that is blocked is not available to do useful work. Efficiency describes the degree of parallelism: if n threads increase the speed by a factor m, then the efficiency is m ⁄ n. If two threads have an efficiency of 0.95, then 32 threads have an efficiency of only 0.9532 ≈ 0.19, so over 80% of all CPU cycles are wasted! Efficiency drops surprisingly quickly, so careful attention to locks is required. Note that memory allocation and releasing functions ( mi_mem_allocate et. al.) contain a lock.

mental ray provides two locks for general use: state→global_lock is a lock shared by all threads and all shaders. No two critical sections protected by this lock can execute simultaneously on this host. The second is the shader lock, a pointer to which can be obtained with the miQ_FUNC_LOCK mode of mi_query, which is local to all instances of the current shader:

miLock *lock;
mi_query(miQ_FUNC_LOCK, state, miNULLTAG, &lock);
mi_lock(*lock);
...
mi_unlock(*lock);

The lock is tied to the shader, not the particular call with particular shader parameters. Every shader in mental ray, built-in or dynamically linked, has exactly one such lock. mental ray internally uses this lock and the global lock to guarantee that the init and exit shaders of a shader do not execute concurrently. Therefore, they must not be used in these functions.

The relevant functions provided by the parallelism modules are:

mi_init_lock
miBoolean mi_init_lock(
    miLock * const  lock)

Before a lock can be used by one of the other locking functions, it must be initialized with this function. Note that the lock variable must be static or global. Shaders will normally use this function in their _init function. Shaders should not initialize (or delete) state→global_lock or the local shader lock; they are pre-initialized by mental ray. The function returns miFALSE if the operating system failed to create the lock.

mi_delete_lock
void mi_delete_lock(
    miLock * const  lock)

Destroy a lock. This should be done when it is no longer needed. The code should use lock and immediately unlock the lock first to make sure that no other thread is in or waiting for a critical section protected by this lock. Shaders will normally use this function in their exit shader. Do not delete the predefined locks.

mi_lock
void mi_lock(
    const miLock    lock)

Check if any other code holds the lock. If so, block; otherwise set the lock and proceed. This is done in a parallel-safe way so only one critical section locked can execute at a time. Note that locking the same lock twice in a row without anyone unlocking it will block the thread forever, effectively freezing mental ray, because the second lock can never succeed.

mi_unlock
void mi_unlock(
    const miLock    lock)

Release a lock. If another thread was blocked attempting to set the lock, it can proceed now. Locks and unlocks must always be paired, and the code between locking and unlocking must be as short and as fast as possible to avoid defeating parallelism. There is no fairness guarantee that ensures that the thread that has been blocked for the longest time is unblocked first.

mi_par_localvpu
miVpu mi_par_localvpu(void)
int   miTHREAD(miVpu vpu)
int   miHOST(miVpu vpu)
miVpu miVPU(int host, int thread)

Deprecated The term VPU stands for Virtual Processing Unit. All threads on the network have a unique VPU number. mi_par_localvpu returns the VPU number of the VPU this thread is running on. VPUs are a concatenation of the host number and the thread number, both numbered from 0 to the number of hosts or threads, respectively, minus 1. (Future versions of mental ray may use noncontiguous host numbers, but not noncontiguous thread numbers.)

The miTHREAD macro extracts a thread number from a VPU, and the miHOST macro extracts the host number from a VPU. Thread 0 is called the master thread; host 0 is called the master host. Thread 0 on host 0 is normally running the translator that controls the entire operation. The miVPU macro puts a host and thread number together to form a VPU number. The mi_par_localvpu function returns the VPU of the current thread on the local host.

In a shader the fastest way of finding the current thread number is state→thread. This is the only thread function still available in mental ray. Otherwise mental ray does not have mental ray 2.x's fixed notion of VPUs and threads.

mi_par_nthreads
int mi_par_nthreads(void)

Deprecated Returns the number of threads on the local host. This is normally 1 on a single-processor system. mental ray does not have the notion of a fixed number of threads and therefore does not support this function any more. For backwards compatibility the function exists, but it always returns 65. This is obviously not something a shader should rely on but it may keep some older shaders limping along until they are ported. Do not use!

mi_par_aborted
int mi_par_aborted(void)

Return a nonzero value if mental ray has been aborted, and the shader should stop what it is doing, clean up, and return. This is only of interest in output shaders because they can run for a long time. This allows the user to press on an abort button, which causes calls to mi_par_aborted to return nonzero, and have the shader return as soon as possible. For example, the shader might call this function in its scanline loop (not for every pixel to avoid slowing it down), and skip the remaining lines. The shader must still clean up, for example releasing memory that it has allocated.

mi_job_memory_limit
long mi_job_memory_limit(
    long            limit,
    long            vlimit)

Deprecated Set the size of mental ray's memory cache to limit bytes, and the virtual address space cache limit for memory-mapped textures to vlimit bytes. Both are separate and independent. mental ray's memory manager will attempt to keep these limits by evicting data that can be restored if necessary to maintain these limits. A good vlimit is a quarter of the address space of the machine (the default is 1 gigabyte). A good limit is one half of the machine's physical memory (the default is 512 megabytes in 32bit versions of mental ray 3.1 and later). Values of 0 mean "unlimited", and values of -1 mean "don't change". If caches are too small, mental ray will begin to thrash or fail to maintain the limits; this is evident from the garbage collection messages printed if the verbosity level ( -verbose command-line option) is at least 4.

mental ray no longer supports this function. It was replaced with mi_mem_memory_limit.

mi_mem_memory_limit
miUlong mi_mem_memory_limit(
    miUlong         limit)

Limit the total heap size used by mental ray allocations to limit bytes. Unlike the older mi_job_memory_limit, this does not only limit the size of the geometry cache, but all memory allocations performed by mental ray. It can no longer happen that some large memory user can crowd out the geometry cache; everything is subject to flushing and re-using now.

Operating system memory usage is not included, however. This means the program itself, any shared libraries (DSO or DLL), thread stacks, and reserved kernel memory. Also, if mental ray is linked into some other application, memory allocations performed by that application are not included either. In practice, this leaves about 1.3 GB on 32-bit systems except Windows Professional Server (not regular Windows), which allows 3 GB, and Linux, which allows almost 4 GB. The memory limit defined by this function should stay comfortably below this limit, by a hundred megabytes or more, depending on what else is going on in the system. It should also stay well below the amount of physical RAM minus OS usage, even on 64-bit systems, to avoid kernel swapping.

Copyright © 1986, 2015 NVIDIA ARC GmbH. All rights reserved.