--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
NVIDIA CUDA
Linux Release Notes
Version 3.1
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------

On some Linux releases, due to a GRUB bug in the handling of upper
memory and a default vmalloc too small on 32-bit systems, it may be
necessary to pass this information to the bootloader:

vmalloc=256MB, uppermem=524288

Example of grub conf:

title Red Hat Desktop (2.6.9-42.ELsmp)
root (hd0,0)
uppermem 524288
kernel /vmlinuz-2.6.9-42.ELsmp ro root=LABEL=/1 rhgb quiet vmalloc=256MB
pci=nommconf
initrd /initrd-2.6.9-42.ELsmp.img

--------------------------------------------------------------------------------
Migrating to 3.1
--------------------------------------------------------------------------------

  o  Prior to the 3.1 release, nvcc treated __device__ functions as implicitly
     static. This behavior has changed with the 3.1 release. As a result, the
     host linker will give a link error regarding multiple defined symbols, if 
     1) two identical __device__ functions are defined in two different
        compilations units. For example, when including function definitions
        through the #include <> mechanism.
     2) a __device__ function and an identical host function are defined in two
        different compilations units.
     For both cases, declaring the __device__ function as static will make the
     compilation succeed.

--------------------------------------------------------------------------------
New Features
--------------------------------------------------------------------------------

  Hardware Support
  o  See http://www.nvidia.com/object/cuda_learn_products.html

  o  On Fermi hardware, CUDA 3.1 supports up to 16 concurrent kernels.

New Toolkit Features
--------------------------

  o Device emulation has been removed.

  o cublasSpsv, cublasDpsv, cublasCpsv, cublasZpsv, and cublasSbsv,
    cublasDbsv, cublasCbsv, cublasZbsv have been enhanced to remove all
    previous size limitations on the input vector

  o Improved interoperability between the CUDA Driver API and the CUDA
    Runtime API.  Includes support for sharing pointers, events, streams,
    arrays and graphics interop resources between the CUDA Driver API and the
    CUDA Runtime API.  Introduces CUDA Runtime API compatibility with the CUDA
    Driver context migration API (cuCtxPushCurrent and cuCtxPopCurrent).

  o Added the ability to call printf() from kernels.  This feature is supported 
    only on the Fermi architecture.

  o Added support for recursion in device functions.  This feature is supported 
    only on the Fermi architecture.  Note that we default to a stack size limit 
    of 1K per thread, so can run out of stack if recurse too deeply.  
    Can use cuCtxSetLimit() to change the default stack size.

  o Added support for function pointers.  This feature is supported only on the
    Fermi architecture.  Function pointers can only be used inside a single 
    kernel; they cannot be passed to another kernel.

  o Specific GPUs can be made invisible with the CUDA_VISIBLE_DEVICES environment
    variable. Visible devices should be included as a comma-separated list in 
    terms of the system-wide list of devices. For example, to use only devices 0
    and 2 from the system-wide list of devices, set CUDA_VISIBLE_DEVICES equal to
    "0,2" before launching the application. The application will then enumerate
    these devices as device 0 and device 1.

New API Features
-------------------------

  o In CUFFT-3.1, R2C and C2R transforms for power-of-2 sizes now experience a 
    similar speedup to their C2C equivalent. However, CUFFT's internal data layout 
    is different to that used by FFTW; by default CUFFT will match FFTW's data
    format, but at some performance penalty. To enable faster transforms, the user
    must use cufftSetCompatibilityMode() API to disable FFTW-compatible behavior
    and enable faster native mode.

  o CUBLAS now supports CUDA Stream via the cublasSetKernelStream API

  o Unformatted surface load/store (i.e. the ability to write to textures).  
    This feature is supported only on the Fermi architecture.

  o New functions cuCtxSetLimit() and cuCtxGetLimit() have been added to
    control GPU thread stack size and the size of the printf() FIFO queue.

  o Device-to-device transfers in a non-NULL stream with asynchronous
    cudaMemcpy calls may overlap with kernels. Runtime documentation has
    been updated to reflect this.

  o New device attributes report the PCI bus and device identifiers of a 
    particular GPU for better integration with system management tools.

New Performance Improvements
----------------------------

  o Double-precision and C2R/R2C performance of CUFFT has been
    improved significantly for many transform sizes since the CUFFT 3.0
    release.

  o Double precision divide and reciprocal on the Fermi architecture have been
    optimized.
  
  o The performance of selected transcendental functions from the log, pow,
    erf, and gamma families.


--------------------------------------------------------------------------------
Bug Fixes
--------------------------------------------------------------------------------

  o The CUBLAS SGEMM, CGEMM and small matrix DGEMM performance
    regressions that were in v3.0 have been restored in v3.1.

--------------------------------------------------------------------------------
Known Issues
--------------------------------------------------------------------------------

o GPU enumeration order on multi-GPU systems is non-deterministic and
  may change with this or future releases. Users should make sure to
  enumerate all CUDA-capable GPUs in the system and select the most
  appropriate one(s) to use.

o Individual GPU program launches are limited to a run time
  of less than 5 seconds on a GPU with a display attached.
  Exceeding this time limit causes a launch failure reported
  through the CUDA driver or the CUDA runtime. GPUs without
  a display attached are not subject to the 5 second run time
  restriction. For this reason it is recommended that CUDA is
  run on a GPU that is NOT attached to an X display.

o In order to run CUDA applications, the CUDA module must be
  loaded and the entries in /dev created.  This may be achieved
  by initializing X Windows, or by creating a script to load the
  kernel module and create the entries.

  An example script (to be run at boot time):

  #!/bin/bash

  /sbin/modprobe nvidia

  if [ "$?" -eq 0 ]; then

  # Count the number of NVIDIA controllers found.
  N3D=`/sbin/lspci | grep -i NVIDIA | grep "3D controller" | wc -l`
  NVGA=`/sbin/lspci | grep -i NVIDIA | grep "VGA compatible controller" | wc -l`

  N=`expr $N3D + $NVGA - 1`
  for i in `seq 0 $N`; do
  mknod -m 666 /dev/nvidia$i c 195 $i;
  done

  mknod -m 666 /dev/nvidiactl c 195 255

  else
  exit 1
  fi

o When compiling with GCC, special care must be taken for structs that
  contain 64-bit integers.  This is because GCC aligns long longs
  to a 4 byte boundary by default, while NVCC aligns long longs
  to an 8 byte boundary by default.  Thus, when using GCC to
  compile a file that has a struct/union, users must give the
  -malign-double
  option to GCC.  When using NVCC, this option is automatically
  passed to GCC.

o It is a known issue that cudaThreadExit() may not be called implicitly on
  host thread exit. Due to this, developers are recommended to explicitly
  call cudaThreadExit() while the issue is being resolved.

o For maximum performance when using multiple byte sizes to access the
  same data, coalesce adjacent loads and stores when possible rather
  than using a union or individual byte accesses. Accessing the data via
  a union may result in the compiler reserving extra memory for the object,
  and accessing the data as individual bytes may result in non-coalesced
  accesses. This will be improved in a future compiler release.

o OpenGL interoperability
  - OpenGL can not access a buffer that is currently
    *mapped*. If the buffer is registered but not mapped, OpenGL can do any
    requested operations on the buffer.
  - Deleting a buffer while it is mapped for CUDA results in undefined behavior.
  - Attempting to map or unmap while a different context is bound than was
    current during the buffer register operation will generally result in a
    program error and should thus be avoided.
  - Interoperability will use a software path on SLI
  - Interoperability will use a software path if monitors are attached to
    multiple GPUs and a single desktop spans more than one GPU
    (i.e. X11 Xinerama).

o OpenCL program binary formats may change in this or future releases. Users
  should create programs from source and should not rely on compatibility of
  generated binaries between different versions of the driver.

o For the beta release, printf() may silently fail if the format
  string and arguments combined within a single printf() call use more
  than 200 bytes (each argument always requires 8 bytes).  This will be
  fixed in the final 3.1 release, and at that point a single printf()
  call will work even when the format string and arguments require more
  than 200 bytes.

o CUBLAS issues

o CUFFT issues
  - The stability of the large-prime FFT transform (signals with a
    length that is prime and >64k samples) is extremely variable, giving
    single- precision accuracy in the range 0.005->0.025. In general,
    smaller signals experience greater accuracy.

  - If the (batch size * transform size * datatype size) exceeds
    512MB in a 1D transform, CUFFT kernels fail to launch.

  - Performance of CUFFT on the GT200 architecture has been reduced by
    9.4% for transform size of 128 in single precision only

  - Performance of single precision 360x360 2D CUFFT has been reduced by 10%.

  - C2C transforms of length 4 and C2R and R2C transforms of length 8 will
    produce incorrect results when the batch size is not a multiple of 64
    for Tesla and not 128 for Fermi
   
o cuda-gdb 
  - Please refer to "What's New in Version 3.1" and "Known Issues" section in 
    the CUDA_GDB_v3.1.pdf User Manual.

o 32/64-bit Device code mixing
  - While it is currently possible to simultaneously load both 32-bit and
    64-bit modules in a single context with the driver API, this feature may
   	be removed in future CUDA releases.
   
--------------------------------------------------------------------------------
Open64 Sources
--------------------------------------------------------------------------------

The Open64 source files are controlled under terms of the GPL license.
Current and previously released versions are located via anonymous ftp at
download.nvidia.com in the CUDAOpen64 directory.

--------------------------------------------------------------------------------
cuda-gdb Sources
--------------------------------------------------------------------------------

The cuda-gdb source files are controlled under terms of the GPL license.
Source code for current and previously released versions are located via 
anonymous ftp at download.nvidia.com in the CUDAOpen64 directory.


--------------------------------------------------------------------------------
Revision History
--------------------------------------------------------------------------------

  06/2010 - Version 3.1
  04/2010 - Version 3.1 Beta
  02/2010 - Version 3.0
  10/2009 - Version 3.0 Beta
  07/2009 - Version 2.3
  06/2009 - Version 2.3 Beta
  05/2009 - Version 2.2
  03/2009 - Version 2.2 Beta
  11/2008 - Version 2.1 Beta
  06/2008 - Version 2.0
  11/2007 - Version 1.1
  06/2007 - Version 1.0
  06/2007 - Version 0.9
  02/2007 - Version 0.8 - Initial public Beta


--------------------------------------------------------------------------------
More Information
--------------------------------------------------------------------------------

  For more information and help with CUDA, please visit
  http://www.nvidia.com/cuda


--------------------------------------------------------------------------------
Acknowledgements
--------------------------------------------------------------------------------

  NVIDIA extends thanks to Professor Mike Giles of Oxford University for
  suggesting possible optimizations to the erfinv() and erfinvf() functions,
  which inspired the eventual optimizations for these functions in this release.
