From 739cd89a4440f66ed39a16aa968e079826c3d44f Mon Sep 17 00:00:00 2001
From: Florian Goldenberg <florian.goldenberg@tuwien.ac.at>
Date: Fri, 10 Jan 2025 23:59:32 +0100
Subject: [PATCH] Added parallel jobs, some minor correction

---
 docs/running_jobs/parjobs.md    | 338 ++++++++++++++++++++++++++++++++
 docs/software/index.md          |   4 -
 docs/software/packages/index.md |   4 +-
 docs/support/support.md         |   4 +-
 mkdocs.yml                      |   1 +
 5 files changed, 344 insertions(+), 7 deletions(-)
 create mode 100644 docs/running_jobs/parjobs.md

diff --git a/docs/running_jobs/parjobs.md b/docs/running_jobs/parjobs.md
new file mode 100644
index 0000000..60c0ab9
--- /dev/null
+++ b/docs/running_jobs/parjobs.md
@@ -0,0 +1,338 @@
+# Parallel jobs and pinning
+
+Various parallelisation tools and applications, such as OpenMP, OpenMPI, IntelMPI,...
+can be employed for pinning purpose (assigning processes and threads to
+cores and nodes) to enhance the speed and efficiency of parallelised
+programs.
+
+## Need for processor affinity and/or pinning
+
+To improve job performance, users can adjust processor affinity and/or
+pinning. The default cluster settings are generally convenient, but for
+specific cases, consider following:
+
+1.  <span style="color:blue">minimizing communication paths:</span>
+    communication between cores of the same socket is fastest, it slows
+    down in this sequence: between sockets, between nodes
+2.  <span style="color:blue">data locality effects:</span> the cores of one
+    node do not have uniform memory access.
+
+To optimise program parallelisation, involving the allocation of
+multiple processes and threads to nodes and cores for enhanced
+performance, it's essential to understand the cluster being used and its
+configuration. This includes recognising details like the maximum number
+of processes/threads allowable on a node, constrained by the number of
+cores available. Additionally, it's crucial to grasp the distinction
+between threads and processes (threads are generally faster due to
+resource control decoupling) but therefore limited to run on a single
+node (utilising shared memory).
+
+### Compute nodes and cores
+
+#### VSC 4
+
+Physical cores for processes/ threads of Socket 0 are numbered from 0 to
+23. Physical cores for processes/ threads of Socket 1 are numbered from
+24 to 47. Virtual cores for processes/ threads are numbered from 48 to
+95.
+
+#### VSC 5 (Cascade Lake)
+
+Physical cores for processes/ threads of Socket 0 are numbered from 0 to
+47. Physical cores for processes/ threads of Socket 1 are numbered from
+48 to 95. Virtual cores for processes are numbered from 96 to 191.
+
+#### VSC 5 (Zen)
+
+Physical cores for processes/ threads of Socket 0 are numbered from 0 to
+63. Physical cores for processes/ threads of Socket 1 are numbered from
+64 to 127. Virtual cores are numbered from 128 to 255.
+
+**Environment variables:** For MPI, OpenMP, and hybrid job
+applications, the environment variables, such as proclist(OpenMP),
+I_MPI_PIN_PROCESSOR_LIST(IntelMPI), must be configured according to the
+cluster configuration and the desired possible number of processes,
+threads, and nodes according to your parallelised application.
+
+## Types of parallel jobs
+
+### 1. Pure OpenMP jobs
+
+**OpenMP threads** are pinned with <span style="color:red">AFFINITY.</span>
+Its default pin processor list is given by <span style="color:red">0,1,...,n</span>, (n is the last core of a computing node).
+
+#### Compiler examples supporting OpenMP
+
+**ICC Example**
+
+    module load intel-oneapi-compilers/2023.1.0-gcc9.5.0-j52vcxx
+    icc -fopenmp -o myprogram myprogram.c
+
+**GCC Example**
+
+    module load --auto gcc/12.2.0-gcc-9.5.0-aegzcbj 
+    gcc -fopenmp -o myprogram myprogram.c
+
+Note the flag -fopenmp, necessary to instruct the compiler to enable
+OpenMP functionality.
+
+**Example Job Script for ICC**
+
+    #!/bin/bash
+    #SBATCH -J pureOMP
+    #SBATCH -N 1           
+
+    export OMP_NUM_THREADS=4
+    export KMP_AFFINITY="verbose,granularity=fine,proclist=[0,4,8,12]"
+
+    ./myprogram
+
+**Example Job Script for GCC**
+
+    #!/bin/bash
+    #SBATCH -J pureOMP
+    #SBATCH -N 1           
+
+    export OMP_NUM_THREADS=4
+    export GOMP_CPU_AFFINITY="8-11"
+    
+    ./myprogram
+
+**OMP PROC BIND AND OMP PLACES**
+
+    # Example: Places threads on cores in a round-robin fashion
+    export OMP_PLACES="{0:1},{1:1},{2:1},{3:1}"
+
+    # Specify whether threads may be moved between CPUs using OMP_PROC_BIND
+    # "true" indicates that threads should be bound to the specified places
+    export OMP_PROC_BIND=true
+
+OMP_PLACES is set to specify the placement of threads. In this example,
+each thread is assigned to a specific core in a round-robin fashion.
+OMP_PROC_BIND is set to "true" to indicate that threads should be bound
+to the specified places. The rest of your Batch script should remain the
+same. Note that you might need to adjust the OMP_PLACES configuration
+based on your specific hardware architecture and the desired thread
+placement strategy.
+
+Make sure to check the OpenMP documentation and your system's specifics
+to fine-tune these parameters for optimal performance. Additionally,
+monitor the performance of your parallelised code to ensure that the
+chosen thread placement strategy meets your performance goals.
+
+### 2. Pure MPI jobs
+
+**MPI processes** :In a distributed computing environment, processes
+often need to communicate with each other across multiple cores and
+**nodes**. This communication is facilitated by Message Passing
+Interface (MPI), which is a standardised and widely used communication
+protocol in high-performance computing. Unlike threads, processes are
+not decoupled from resource control. There are several MPI
+implementations, including OpenMPI, Intel MPI, and MPICH, each offering
+different options for process pinning, which is the assignment of
+processes to specific processor cores.
+
+To choose the optimal MPI implementation for your parallelised
+application, follow these steps:
+
+- Understand Your Application's Requirements: Consider scalability,
+  compatibility, and any unique features your application needs.
+- Explore Available MPI Implementations: Investigate popular MPI
+  implementations like OpenMPI, Intel MPI, and MPICH. Explore their
+  features, advantages, and limitations through their official
+  documentation.
+- Check Compatibility: Ensure the selected MPI implementation is
+  compatible with the system architecture and meets any specific
+  requirements. Seek guidance from system administrators or relevant
+  documentation.
+- Experiment with Basic Commands: After selecting an MPI implementation,
+  experiment with basic commands like mpirun, mpiexec, and srun.
+- Seek Assistance: Don't hesitate to seek help if you have questions or
+  face challenges.
+- Additional Resources: Explore MPI tutorials at: [VSC training
+  events](https://vsc.ac.at/research/vsc-research-center/vsc-school-seminar/)
+
+The default pin processor list is given by \<color
+<span style="color:red">0,1,...,n</span> (n is the last core of a computing node).
+
+### Examples
+
+#### Compatibility and Compilers
+
+Various MPI compilers and implementations exist catering to different
+programming languages such as C, C++, and Fortran for e.g. MPI
+implementations: Intel MPI, Open MPI, and MPICH:
+
+**OpenMPI**
+
+- C: mpicc
+- C++: mpic++ oder mpiCC
+- Fortran: mpifort oder mpif77 für Fortran 77, mpif90 für Fortran 90
+
+**Intel MPI**
+
+- C: mpiicc
+- C++: mpiicpc
+- Fortran: mpiifort
+
+**MPICH**
+
+- C: mpicc
+- C++: mpic++
+- Fortran: mpifort
+
+Use the 'module avail' command to investigate available MPI versions by
+specifying your preferred MPI module, such as 'module avail openmpi'.
+Similarly, you can check for available compiler versions compatible with
+MPI using the command 'module avail' followed by your preferred compiler
+for MPI, providing a comprehensive overview of the available options. 
+
+Following are a few Slurm script examples written for C applications
+with various compiler versions. These examples provide a glimpse into
+writing batch scripts and serve as a practical guide for creating
+scripts. Note that environment variables differ for different MPI
+implementations (OpenMPI, Intel MPI, and MPICH), and the Slurm scripts
+also vary between srun and mpiexec. Adjust your Slurm scripts
+accordingly on whether you are using srun or mpiexec(mirin) for process
+launching.
+
+#### OPENMPI
+
+**srun**
+
+    #!/bin/bash
+    #
+    #SBATCH -N 2
+    #SBATCH --ntasks-per-node 4
+    #SBATCH --ntasks-per-core 1
+
+    NUMBER_OF_MPI_PROCESSES=8
+
+    module purge
+    module load openmpi/4.1.4-gcc-8.5.0-p6nh7mw 
+
+    mpicc  -o openmpi openmpi.c
+    srun -n $NUMBER_OF_MPI_PROCESSES --mpi=pmi2 --cpu_bind=map_cpu:0,4,8,12 ./openmpi
+
+Note: The *--mpi=pmi2* flag is a command-line argument commonly used
+when executing MPI (Message Passing Interface) applications. It
+specifies the MPI launch system to be used. In this context, pmi2 refers
+to the Process Management Interface (PMI-2), which provides an interface
+for managing processes in parallel applications. PMI-2 is a standard
+part of the MPI interface and is often utilised in conjunction with
+resource management systems like SLURM (Simple Linux Utility for
+Resource Management) to run MPI applications on a cluster.
+
+**mpiexec**
+
+    #!/bin/bash
+    #
+    #SBATCH -N 2
+    #SBATCH --ntasks-per-node 4
+    #SBATCH --ntasks-per-core 1
+
+    NUMBER_OF_MPI_PROCESSES=8
+    export OMPI_MCA_hwloc_base_binding_policy=core
+    export OMPI_MCA_hwloc_base_cpu_set=0,6,16,64
+
+    module purge
+    module load openmpi/4.1.2-gcc-9.5.0-hieglt7
+
+    mpicc  -o openmpi openmpi.c
+    mpiexec -n $NUMBER_OF_MPI_PROCESSES ./openmpi
+
+#### INTELMPI
+
+**srun**
+
+    #!/bin/bash
+    #
+    #SBATCH -M vsc5
+    #SBATCH -N 2
+    #SBATCH --ntasks-per-node 4
+    #SBATCH --ntasks-per-core 1
+
+    export I_MPI_DEBUG=4
+    NUMBER_OF_MPI_PROCESSES=8
+    export I_MPI_PIN_PROCESSOR_LIST=0,6,16,64
+
+    module purge
+    module load intel/19.1.3
+    module load intel-mpi/2021.5.0
+
+    mpiicc  -o intelmpi intelmpi.c
+    srun -n $NUMBER_OF_MPI_PROCESSES --cpu_bind=map_cpu:0,4,8,12 ./intelmpi
+
+**mpiexec**
+
+    #!/bin/bash
+    #
+    #SBATCH -N 2
+    #SBATCH --ntasks-per-node 4
+    #SBATCH --ntasks-per-core 1
+
+
+    export I_MPI_DEBUG=4
+    NUMBER_OF_MPI_PROCESSES=8
+    export I_MPI_PIN_PROCESSOR_LIST=0,6,16,64
+
+    module purge
+    module load intel/19.1.3
+    module load intel-mpi/2021.5.0
+
+    mpiicc  -o intelmpi intelmpi.c
+    mpiexec -n $NUMBER_OF_MPI_PROCESSES ./intelmpi
+
+#### MPICH
+
+**srun**
+
+    #!/bin/bash
+    #
+    #SBATCH -N 2
+    #SBATCH --ntasks-per-node 4
+    #SBATCH --ntasks-per-core 1
+
+    NUMBER_OF_MPI_PROCESSES=8
+
+    module purge
+    module load --auto mpich/4.0.2-gcc-12.2.0-vdvlylu
+
+    mpicc  -o mpich mpich.c
+    srun -n $NUMBER_OF_MPI_PROCESSES --cpu_bind=map_cpu:0,4,8,12 ./mpich
+
+Note: The flag - -auto; loads all dependencies
+
+### 3. Hybrid jobs
+
+MPI (Message Passing Interface) is utilised for facilitating
+communication between processes across multiple nodes. Executing OpenMP
+on each respective node can be advantageous, as it diminishes data
+exchange by decoupling resource management. The combination of both
+approaches (hybrid jobs) results in enhanced performance. For instance,
+threads are assigned to cores within one node, and communication between
+nodes is managed by processes. This can be achieved through CPU binding.
+In comparison, when exclusively utilising processes across multiple
+nodes and within nodes, the hybrid use of MPI and OpenMP typically
+yields improved performance. As an illustration, a configuration might
+involve 3 nodes with 3 MPI processes on each node, without using OpenMP
+within nodes.:
+
+    #!/bin/bash
+    #
+    #SBATCH -J mapCPU
+    #SBATCH -N 3
+    #SBATCH -n 3
+    #SBATCH --ntasks-per-node=1
+    #SBATCH --cpus-per-task=3
+    #SBATCH --time=00:60:00
+
+    export I_MPI_DEBUG=1
+    NUMBER_OF_MPI_PROCESSES=3
+    export OMP_NUM_THREADS=3
+
+    module load intel/19.1.3
+    module load intel-mpi/2021.5.0
+    mpiicc -qopenmp -o myprogram myprogram.c
+
+    srun -n $NUMBER_OF_MPI_PROCESSES --cpu_bind=mask_cpu:0xf,0xf0,0xf00 ./my_program
diff --git a/docs/software/index.md b/docs/software/index.md
index 4eed24b..469aec3 100644
--- a/docs/software/index.md
+++ b/docs/software/index.md
@@ -1,9 +1,5 @@
 # Overview
 
-[vasp]: ./packages/vasp.md
-
-
----
 On this page, you will find information about pre-installed software on VSC as well
 as guidance on ways to install additional software yourself.
 
diff --git a/docs/software/packages/index.md b/docs/software/packages/index.md
index a256c71..78c7eee 100644
--- a/docs/software/packages/index.md
+++ b/docs/software/packages/index.md
@@ -1,4 +1,6 @@
 #Installed software list
 
 [vasp]: ./vasp.md
-- [VASP] 
\ No newline at end of file
+- [VASP] 
+[ansys fluent]: Fluent/fluent.md
+- [Ansys Fluent]
\ No newline at end of file
diff --git a/docs/support/support.md b/docs/support/support.md
index 2cc4dea..a6e1df8 100644
--- a/docs/support/support.md
+++ b/docs/support/support.md
@@ -57,8 +57,8 @@ You may use English or German to write your request, but please note that we hav
 
 Please do not send support questions directly to VSC staff's personal email addresses. Emails sent to personal addresses might not get read promptly or get lost. Emails to the support address are automatically registered in our support system and available to all the support staff.
 
-#VSC user community
+##VSC user community 
 
 We have established a way for our users to talk to each other and help or advise other users without havint to involve the VSC team. This is done via the open, federated and institution-independent chat tool Matrix.
 Our VSC-user space can be joined [here](https://matrix.tuwien.ac.at/#/room/#vsc:tuwien.ac.at)
-If you do not have an account yet, one can be created for free at [matrix.org](https://matrix.org/try-matrix/)
\ No newline at end of file
+If you do not have an account yet, one can be created for free at [matrix.org](https://matrix.org/try-matrix/)
diff --git a/mkdocs.yml b/mkdocs.yml
index a32c2ac..1525691 100644
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -27,6 +27,7 @@ nav:
           - VSC-5 queues: running_jobs/vsc5_queues.md
       - Accounting: running_jobs/job_accounting.md
       - Monitoring jobs: running_jobs/monitoring_jobs.md
+      - Parallel jobs: running_jobs/parjobs.md
       - GPU jobs: running_jobs/gpus.md
       - Interactive jobs: running_jobs/interactive_jobs.md
       - GUI jobs: running_jobs/gui_jobs.md
-- 
GitLab