Add LUMI results for mistral7b-gptq

c4539d58 · Pfister, Martin · 355ad215 · c4539d58 · c4539d58 · c4539d58
Commit c4539d58 authored 1 year ago by Pfister, Martin
--- a/README.md
+++ b/README.md
@@ -35,7 +35,7 @@ Finetune and evaluate [Mistral 7B Instruct v0.3](https://huggingface.co/mistrala
 | VSC5 (Nvidia A40) | 5.1 samples/s | 11.0 samples/s | 10.3 GB | 18.4 GB |
 | VSC5 (Nvidia A100) | 8.8 samples/s | 18.8 samples/s | 9.7 GB | 18.5 GB |
 | Leonardo (Nvidia A100) | 10.0 samples/s | 21.0 samples/s | 9.8 GB | 18.5 GB |
-| LUMI (AMD MI250X) | 5.4 samples/s | 5.6 samples/s | 9.0 GB | 17.0 GB |
+| LUMI (AMD MI250X) | 5.2 samples/s | 5.5 samples/s | 9.0 GB | 17.0 GB |
 ### [mistral7b-bnb](mistral7b-bnb) multi GPU, multi node training
@@ -70,17 +70,23 @@ Finetune and evaluate [Mistral 7B Instruct v0.3](https://huggingface.co/mistrala
 Finetune and evaluate [Mistral 7B Instruct v0.3](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3) with 4-bit [GPTQ quantisation](https://arxiv.org/abs/2210.17323) on the [MedMCQA](https://medmcqa.github.io) dataset on multiple GPUs on a single node using the [distributed data parallel](https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html) approach.
-| System | GPUs | Training speed | Training speed per GPU | Training GPU memory |
+| System | GPUs | Nodes | Training speed | Training speed per GPU | Training GPU memory (max.) |
-| - | - | - | - | - |
+| - | - | - | - | - | - |
-| VSC5 (Nvidia A40) | 1 | 5.1 samples/s | 5.1 samples/s | 10.3 GB |
+| VSC5 (Nvidia A40) | 1 | 1 | 5.1 samples/s | 5.1 samples/s | 10.3 GB |
-|  | 2 | 8.1 samples/s | 4.5 samples/s | 10.3 + 11.6 GB |
+|  | 2 | 1 | 8.1 samples/s | 4.5 samples/s | 11.6 GB |
-| VSC5 (Nvidia A100) | 1 | 8.8 samples/s | 8.8 samples/s | 9.7 GB |
+| VSC5 (Nvidia A100) | 1 | 1 | 8.8 samples/s | 8.8 samples/s | 9.7 GB |
-|  | 2 | 12.7 samples/s | 6.4 samples/s | 9.8 + 11.0 GB |
+|  | 2 | 1 | 12.7 samples/s | 6.4 samples/s | 11.0 GB |
-| Leonardo (Nvidia A100) | 1 | 10.0 samples/s | 10.0 samples/s | 9.8 GB |
+| Leonardo (Nvidia A100) | 1 | 1 | 10.0 samples/s | 10.0 samples/s | 9.8 GB |
-|  | 2 | 16.2 samples/s | 8.1 samples/s | 10.8 + 11.9 GB |
+|  | 2 | 1 | 16.2 samples/s | 8.1 samples/s | 11.9 GB |
-|  | 4 | 30.6 samples/s | 7.6 samples/s | 11.5 + 12.8 + 11.6 + 12.3 GB |
+|  | 4 | 1 | 30.6 samples/s | 7.6 samples/s | 12.8 GB |
-| LUMI (AMD MI250X) | 1 | 5.4 samples/s | 5.4 samples/s | 9.0 GB |
+| LUMI (AMD MI250X) | 1 | 1 | 5.2 samples/s | 5.2 samples/s | 9.0 GB |
+|  | 2 | 1 | 9.2 samples/s | 4.6 samples/s (88%) | 9.8 GB |
+|  | 4 | 1 | 17.4 samples/s | 4.3 samples/s (83%) | 11.3 GB |
+|  | 8 | 1 | 32.7 samples/s | 4.1 samples/s (79%) | 12.1 GB |
+|  | 8 | 8 | 14.4 samples/s | 1.8 samples/s (35%) | 11.8 GB |
+|  | 16 | 2 | 45.5 samples/s | 2.8 samples/s (54%) | 11.6 GB |
+|  | 32 | 4 | 83.6 samples/s | 2.6 samples/s (50%) | 11.4 GB |
+|  | 64 | 8 | 149.3 samples/s | 2.3 samples/s (44%) | 12.4 GB |
 ## Usage

--- a/mistral7b-gptq/output_lumi_20240731.txt
+++ b/mistral7b-gptq/output_lumi_20240731.txt
 + date
-Wed Jul 31 20:27:49 EEST 2024
+Fri Aug  9 16:55:08 EEST 2024
 + hostname
-nid005034
+nid005037
 ++ git rev-parse --show-toplevel
 + CONTAINER=/pfs/lustrep1/scratch/project_465001276/mpfister/llm-finetuning/lumi_container.sif
 + export SINGULARITY_BIND=/pfs,/scratch,/projappl,/project,/flash,/appl,
@@ -12,7 +12,7 @@ nid005034
 ======================= ROCm System Management Interface =======================
 ================================= Concise Info =================================
 GPU  Temp   AvgPwr  SCLK    MCLK     Fan  Perf    PwrCap  VRAM%  GPU%  
-0    42.0c  255.0W  800Mhz  1600Mhz  0%   manual  500.0W    0%   0%    
+0    46.0c  127.0W  800Mhz  1600Mhz  0%   manual  500.0W    0%   0%    
 ================================================================================
 ============================= End of ROCm SMI Log ==============================
 + srun singularity exec /pfs/lustrep1/scratch/project_465001276/mpfister/llm-finetuning/lumi_container.sif python mistral7b_train.py
@@ -28,37 +28,36 @@ trainable params: 41,943,040 || all params: 310,644,736 || trainable%: 13.5019
  attn_output = torch.nn.functional.scaled_dot_product_attention(
 /opt/conda/envs/conda_container_env/lib/python3.11/site-packages/transformers/models/mistral/modeling_mistral.py:647: UserWarning: 1Torch was not compiled with memory efficient attention. (Triggered internally at ../aten/src/ATen/native/transformers/hip/sdp_utils.cpp:320.)
  attn_output = torch.nn.functional.scaled_dot_product_attention(
-100%|██████████| 100/100 [02:29<00:00,  1.49s/it]
+100%|██████████| 100/100 [02:33<00:00,  1.54s/it]
-{'loss': 1.8248, 'grad_norm': 1.8413294553756714, 'learning_rate': 4.7e-05, 'epoch': 0.0}
+{'loss': 1.8183, 'grad_norm': 1.8688207864761353, 'learning_rate': 4.7e-05, 'epoch': 0.0}
-{'loss': 0.9426, 'grad_norm': 1.314175009727478, 'learning_rate': 9.7e-05, 'epoch': 0.0}
+{'loss': 0.9452, 'grad_norm': 1.3900845050811768, 'learning_rate': 9.7e-05, 'epoch': 0.0}
-{'train_runtime': 149.3765, 'train_samples_per_second': 5.356, 'train_steps_per_second': 0.669, 'train_loss': 1.3837376403808594, 'epoch': 0.0}
+{'train_runtime': 153.8564, 'train_samples_per_second': 5.2, 'train_steps_per_second': 0.65, 'train_loss': 1.3817491149902343, 'epoch': 0.0}
-Run time: 149.38 seconds
+Run time: 153.86 seconds
-Samples/second: 5.36
+1 GPUs used.
-Memory occupied on GPU 0: 9186 MB.
+Training speed: 5.2 samples/s (=5.2 samples/s/GPU)
+Memory occupied on GPUs: 9.0 GB.
-real	3m32.159s
+real	3m46.519s
-user	0m0.015s
+user	0m0.018s
-sys	0m0.017s
+sys	0m0.015s
 + srun singularity exec /pfs/lustrep1/scratch/project_465001276/mpfister/llm-finetuning/lumi_container.sif python mistral7b_test.py
-srun: Job 7878123 step creation temporarily disabled, retrying (Requested nodes are busy)
-srun: Step created for job 7878123
 /opt/conda/envs/conda_container_env/lib/python3.11/site-packages/transformers/quantizers/auto.py:167: UserWarning: You passed `quantization_config` or equivalent parameters to `from_pretrained` but the model you're loading already has a `quantization_config` attribute. The `quantization_config` from the model will be used.However, loading attributes (e.g. ['use_cuda_fp16', 'use_exllama', 'max_input_length', 'exllama_config', 'disable_exllama']) will be overwritten with the one you passed to `from_pretrained`. The rest will be ignored.
  warnings.warn(warning_msg)
 CUDA extension not installed.
 CUDA extension not installed.
 /opt/conda/envs/conda_container_env/lib/python3.11/site-packages/transformers/modeling_utils.py:4565: FutureWarning: `_is_quantized_training_enabled` is going to be deprecated in transformers 4.39.0. Please use `model.hf_quantizer.is_trainable` instead
  warnings.warn(
-Map: 100%|██████████| 4183/4183 [00:00<00:00, 5400.36 examples/s]
+Map: 100%|██████████| 4183/4183 [00:00<00:00, 4287.53 examples/s]
  0%|          | 0/66 [00:00<?, ?it/s]/opt/conda/envs/conda_container_env/lib/python3.11/site-packages/transformers/models/mistral/modeling_mistral.py:647: UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at ../aten/src/ATen/native/transformers/hip/sdp_utils.cpp:264.)
  attn_output = torch.nn.functional.scaled_dot_product_attention(
 /opt/conda/envs/conda_container_env/lib/python3.11/site-packages/transformers/models/mistral/modeling_mistral.py:647: UserWarning: 1Torch was not compiled with memory efficient attention. (Triggered internally at ../aten/src/ATen/native/transformers/hip/sdp_utils.cpp:320.)
  attn_output = torch.nn.functional.scaled_dot_product_attention(
-100%|██████████| 66/66 [12:32<00:00, 11.40s/it]
+100%|██████████| 66/66 [12:45<00:00, 11.59s/it]
-45.11% (1887 out of 4183) answers correct.
+44.42% (1858 out of 4183) answers correct.
-Run time: 752.49 seconds
+Run time: 765.11 seconds
-Samples/second: 5.56
+Samples/second: 5.47
-Memory occupied on GPU 0: 17409 MB.
+Memory occupied on GPUs: 17.0 GB.
-real	12m57.740s
+real	13m12.596s
-user	0m0.016s
+user	0m0.019s
-sys	0m0.017s
+sys	0m0.015s
--- a/mistral7b-gptq/output_lumi_2gpus_20240809.txt
+++ b/mistral7b-gptq/output_lumi_2gpus_20240809.txt
+ date
+Fri Aug  9 16:53:30 EEST 2024
+ hostname
+nid005050
++ git rev-parse --show-toplevel
+ CONTAINER=/pfs/lustrep1/scratch/project_465001276/mpfister/llm-finetuning/lumi_container.sif
+ export SINGULARITY_BIND=/pfs,/scratch,/projappl,/project,/flash,/appl,
+ SINGULARITY_BIND=/pfs,/scratch,/projappl,/project,/flash,/appl,
+ rocm-smi
+======================= ROCm System Management Interface =======================
+================================= Concise Info =================================
+GPU  Temp   AvgPwr  SCLK    MCLK     Fan  Perf    PwrCap  VRAM%  GPU%  
+0    47.0c  90.0W   800Mhz  1600Mhz  0%   manual  500.0W    0%   0%    
+1    49.0c  N/A     800Mhz  1600Mhz  0%   manual  0.0W      0%   0%    
+================================================================================
+============================= End of ROCm SMI Log ==============================
+ srun singularity exec /pfs/lustrep1/scratch/project_465001276/mpfister/llm-finetuning/lumi_container.sif torchrun --nproc_per_node 2 mistral7b_train.py
+[2024-08-09 16:53:46,267] torch.distributed.run: [WARNING] 
+[2024-08-09 16:53:46,267] torch.distributed.run: [WARNING] *****************************************
+[2024-08-09 16:53:46,267] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
+[2024-08-09 16:53:46,267] torch.distributed.run: [WARNING] *****************************************
+CUDA extension not installed.
+CUDA extension not installed.
+CUDA extension not installed.
+CUDA extension not installed.
+/opt/conda/envs/conda_container_env/lib/python3.11/site-packages/transformers/quantizers/auto.py:167: UserWarning: You passed `quantization_config` or equivalent parameters to `from_pretrained` but the model you're loading already has a `quantization_config` attribute. The `quantization_config` from the model will be used.However, loading attributes (e.g. ['use_cuda_fp16', 'use_exllama', 'max_input_length', 'exllama_config', 'disable_exllama']) will be overwritten with the one you passed to `from_pretrained`. The rest will be ignored.
+  warnings.warn(warning_msg)
+/opt/conda/envs/conda_container_env/lib/python3.11/site-packages/transformers/quantizers/auto.py:167: UserWarning: You passed `quantization_config` or equivalent parameters to `from_pretrained` but the model you're loading already has a `quantization_config` attribute. The `quantization_config` from the model will be used.However, loading attributes (e.g. ['use_cuda_fp16', 'use_exllama', 'max_input_length', 'exllama_config', 'disable_exllama']) will be overwritten with the one you passed to `from_pretrained`. The rest will be ignored.
+  warnings.warn(warning_msg)
+Using the `SDPA` attention implementation on multi-gpu setup with ROCM may lead to performance issues due to the FA backend. Disabling it to use alternative backends.
+Using the `SDPA` attention implementation on multi-gpu setup with ROCM may lead to performance issues due to the FA backend. Disabling it to use alternative backends.
+/opt/conda/envs/conda_container_env/lib/python3.11/site-packages/transformers/modeling_utils.py:4565: FutureWarning: `_is_quantized_training_enabled` is going to be deprecated in transformers 4.39.0. Please use `model.hf_quantizer.is_trainable` instead
+  warnings.warn(
+/opt/conda/envs/conda_container_env/lib/python3.11/site-packages/transformers/modeling_utils.py:4565: FutureWarning: `_is_quantized_training_enabled` is going to be deprecated in transformers 4.39.0. Please use `model.hf_quantizer.is_trainable` instead
+  warnings.warn(
+trainable params: 41,943,040 || all params: 310,644,736 || trainable%: 13.5019
+max_steps is given, it will override any value given in num_train_epochs
+  0%|          | 0/100 [00:00<?, ?it/s]/opt/conda/envs/conda_container_env/lib/python3.11/site-packages/transformers/models/mistral/modeling_mistral.py:647: UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at ../aten/src/ATen/native/transformers/hip/sdp_utils.cpp:264.)
+  attn_output = torch.nn.functional.scaled_dot_product_attention(
+/opt/conda/envs/conda_container_env/lib/python3.11/site-packages/transformers/models/mistral/modeling_mistral.py:647: UserWarning: 1Torch was not compiled with memory efficient attention. (Triggered internally at ../aten/src/ATen/native/transformers/hip/sdp_utils.cpp:320.)
+  attn_output = torch.nn.functional.scaled_dot_product_attention(
+/opt/conda/envs/conda_container_env/lib/python3.11/site-packages/transformers/models/mistral/modeling_mistral.py:647: UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at ../aten/src/ATen/native/transformers/hip/sdp_utils.cpp:264.)
+  attn_output = torch.nn.functional.scaled_dot_product_attention(
+/opt/conda/envs/conda_container_env/lib/python3.11/site-packages/transformers/models/mistral/modeling_mistral.py:647: UserWarning: 1Torch was not compiled with memory efficient attention. (Triggered internally at ../aten/src/ATen/native/transformers/hip/sdp_utils.cpp:320.)
+  attn_output = torch.nn.functional.scaled_dot_product_attention(
+ 39%|███▉ {'loss': 1.8514, 'grad_norm': 1.9098231792449951, 'learning_rate': 4.600000000000001e-05, 'epoch': 0.0}
+ 86%|████████▌ | 86/100 [02:29<00:24,  1.76s{'loss': 0.909, 'grad_norm': 0.9930923581123352, 'learning_rate': 9.6e-05, 'epoch': 0.01}
+100%|██{'train_runtime': 173.0809, 'train_samples_per_second': 9.244, 'train_steps_per_second': 0.578, 'train_loss': 1.3802322006225587, 'epoch': 0.01}
+100%|██████████| 100/100 [02:53<00:00,  1.73s/it]
+Run time: 173.08 seconds
+2 GPUs used.
+Training speed: 9.2 samples/s (=4.6 samples/s/GPU)
+Memory occupied on GPUs: 9.8 + 6.0 GB.
+real	5m10.680s
+user	0m0.017s
+sys	0m0.017s
--- a/mistral7b-gptq/output_lumi_2nodes_20240809.txt
+++ b/mistral7b-gptq/output_lumi_2nodes_20240809.txt
--- a/mistral7b-gptq/output_lumi_4gpus_20240813.txt
+++ b/mistral7b-gptq/output_lumi_4gpus_20240813.txt
+ date
+Tue Aug 13 10:27:28 EEST 2024
+ hostname
+nid005032
++ git rev-parse --show-toplevel
+ CONTAINER=/pfs/lustrep1/scratch/project_465001276/mpfister/llm-finetuning/lumi_container.sif
+ export SINGULARITY_BIND=/pfs,/scratch,/projappl,/project,/flash,/appl,
+ SINGULARITY_BIND=/pfs,/scratch,/projappl,/project,/flash,/appl,
+ rocm-smi
+======================= ROCm System Management Interface =======================
+================================= Concise Info =================================
+GPU  Temp   AvgPwr  SCLK    MCLK     Fan  Perf    PwrCap  VRAM%  GPU%  
+0    50.0c  N/A     800Mhz  1600Mhz  0%   manual  0.0W      0%   0%    
+1    41.0c  86.0W   800Mhz  1600Mhz  0%   manual  500.0W    0%   0%    
+2    43.0c  N/A     800Mhz  1600Mhz  0%   manual  0.0W      0%   0%    
+3    43.0c  90.0W   800Mhz  1600Mhz  0%   manual  500.0W    0%   0%    
+================================================================================
+============================= End of ROCm SMI Log ==============================
+ srun singularity exec /pfs/lustrep1/scratch/project_465001276/mpfister/llm-finetuning/lumi_container.sif torchrun --nproc_per_node 4 mistral7b_train.py
+[2024-08-13 10:27:41,396] torch.distributed.run: [WARNING] 
+[2024-08-13 10:27:41,396] torch.distributed.run: [WARNING] *****************************************
+[2024-08-13 10:27:41,396] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
+[2024-08-13 10:27:41,396] torch.distributed.run: [WARNING] *****************************************
+CUDA extension not installed.
+CUDA extension not installed.
+CUDA extension not installed.
+CUDA extension not installed.
+CUDA extension not installed.
+CUDA extension not installed.
+CUDA extension not installed.
+CUDA extension not installed.
+/opt/conda/envs/conda_container_env/lib/python3.11/site-packages/transformers/quantizers/auto.py:167: UserWarning: You passed `quantization_config` or equivalent parameters to `from_pretrained` but the model you're loading already has a `quantization_config` attribute. The `quantization_config` from the model will be used.However, loading attributes (e.g. ['use_cuda_fp16', 'use_exllama', 'max_input_length', 'exllama_config', 'disable_exllama']) will be overwritten with the one you passed to `from_pretrained`. The rest will be ignored.
+  warnings.warn(warning_msg)
+/opt/conda/envs/conda_container_env/lib/python3.11/site-packages/transformers/quantizers/auto.py:167: UserWarning: You passed `quantization_config` or equivalent parameters to `from_pretrained` but the model you're loading already has a `quantization_config` attribute. The `quantization_config` from the model will be used.However, loading attributes (e.g. ['use_cuda_fp16', 'use_exllama', 'max_input_length', 'exllama_config', 'disable_exllama']) will be overwritten with the one you passed to `from_pretrained`. The rest will be ignored.
+  warnings.warn(warning_msg)
+/opt/conda/envs/conda_container_env/lib/python3.11/site-packages/transformers/quantizers/auto.py:167: UserWarning: You passed `quantization_config` or equivalent parameters to `from_pretrained` but the model you're loading already has a `quantization_config` attribute. The `quantization_config` from the model will be used.However, loading attributes (e.g. ['use_cuda_fp16', 'use_exllama', 'max_input_length', 'exllama_config', 'disable_exllama']) will be overwritten with the one you passed to `from_pretrained`. The rest will be ignored.
+  warnings.warn(warning_msg)
+/opt/conda/envs/conda_container_env/lib/python3.11/site-packages/transformers/quantizers/auto.py:167: UserWarning: You passed `quantization_config` or equivalent parameters to `from_pretrained` but the model you're loading already has a `quantization_config` attribute. The `quantization_config` from the model will be used.However, loading attributes (e.g. ['use_cuda_fp16', 'use_exllama', 'max_input_length', 'exllama_config', 'disable_exllama']) will be overwritten with the one you passed to `from_pretrained`. The rest will be ignored.
+  warnings.warn(warning_msg)
+Using the `SDPA` attention implementation on multi-gpu setup with ROCM may lead to performance issues due to the FA backend. Disabling it to use alternative backends.
+Using the `SDPA` attention implementation on multi-gpu setup with ROCM may lead to performance issues due to the FA backend. Disabling it to use alternative backends.
+Using the `SDPA` attention implementation on multi-gpu setup with ROCM may lead to performance issues due to the FA backend. Disabling it to use alternative backends.
+Using the `SDPA` attention implementation on multi-gpu setup with ROCM may lead to performance issues due to the FA backend. Disabling it to use alternative backends.
+/opt/conda/envs/conda_container_env/lib/python3.11/site-packages/transformers/modeling_utils.py:4565: FutureWarning: `_is_quantized_training_enabled` is going to be deprecated in transformers 4.39.0. Please use `model.hf_quantizer.is_trainable` instead
+  warnings.warn(
+/opt/conda/envs/conda_container_env/lib/python3.11/site-packages/transformers/modeling_utils.py:4565: FutureWarning: `_is_quantized_training_enabled` is going to be deprecated in transformers 4.39.0. Please use `model.hf_quantizer.is_trainable` instead
+  warnings.warn(
+/opt/conda/envs/conda_container_env/lib/python3.11/site-packages/transformers/modeling_utils.py:4565: FutureWarning: `_is_quantized_training_enabled` is going to be deprecated in transformers 4.39.0. Please use `model.hf_quantizer.is_trainable` instead
+  warnings.warn(
+/opt/conda/envs/conda_container_env/lib/python3.11/site-packages/transformers/modeling_utils.py:4565: FutureWarning: `_is_quantized_training_enabled` is going to be deprecated in transformers 4.39.0. Please use `model.hf_quantizer.is_trainable` instead
+  warnings.warn(
+trainable params: 41,943,040 || all params: 310,644,736 || trainable%: 13.5019
+max_steps is given, it will override any value given in num_train_epochs
+  0%|          | 0/100 [00:00<?, ?it/s]/opt/conda/envs/conda_container_env/lib/python3.11/site-packages/transformers/models/mistral/modeling_mistral.py:647: UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at ../aten/src/ATen/native/transformers/hip/sdp_utils.cpp:264.)
+  attn_output = torch.nn.functional.scaled_dot_product_attention(
+/opt/conda/envs/conda_container_env/lib/python3.11/site-packages/transformers/models/mistral/modeling_mistral.py:647: UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at ../aten/src/ATen/native/transformers/hip/sdp_utils.cpp:264.)
+  attn_output = torch.nn.functional.scaled_dot_product_attention(
+/opt/conda/envs/conda_container_env/lib/python3.11/site-packages/transformers/models/mistral/modeling_mistral.py:647: UserWarning: 1Torch was not compiled with memory efficient attention. (Triggered internally at ../aten/src/ATen/native/transformers/hip/sdp_utils.cpp:320.)
+  attn_output = torch.nn.functional.scaled_dot_product_attention(
+/opt/conda/envs/conda_container_env/lib/python3.11/site-packages/transformers/models/mistral/modeling_mistral.py:647: UserWarning: 1Torch was not compiled with memory efficient attention. (Triggered internally at ../aten/src/ATen/native/transformers/hip/sdp_utils.cpp:320.)
+  attn_output = torch.nn.functional.scaled_dot_product_attention(
+/opt/conda/envs/conda_container_env/lib/python3.11/site-packages/transformers/models/mistral/modeling_mistral.py:647: UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at ../aten/src/ATen/native/transformers/hip/sdp_utils.cpp:264.)
+  attn_output = torch.nn.functional.scaled_dot_product_attention(
+/opt/conda/envs/conda_container_env/lib/python3.11/site-packages/transformers/models/mistral/modeling_mistral.py:647: UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at ../aten/src/ATen/native/transformers/hip/sdp_utils.cpp:264.)
+  attn_output = torch.nn.functional.scaled_dot_product_attention(
+/opt/conda/envs/conda_container_env/lib/python3.11/site-packages/transformers/models/mistral/modeling_mistral.py:647: UserWarning: 1Torch was not compiled with memory efficient attention. (Triggered internally at ../aten/src/ATen/native/transformers/hip/sdp_utils.cpp:320.)
+  attn_output = torch.nn.functional.scaled_dot_product_attention(
+/opt/conda/envs/conda_container_env/lib/python3.11/site-packages/transformers/models/mistral/modeling_mistral.py:647: UserWarning: 1Torch was not compiled with memory efficient attention. (Triggered internally at ../aten/src/ATen/native/transformers/hip/sdp_utils.cpp:320.)
+  attn_output = torch.nn.functional.scaled_dot_product_attention(
+ 39%|███▉ {'loss': 1.8326, 'grad_norm': 1.8880447149276733, 'learning_rate': 4.7e-05, 'epoch': 0.01}
+ 86%|████████▌ | 86/100 [02:39<00:26,  1.90s{'loss': 0.9009, 'grad_norm': 0.8143380880355835, 'learning_rate': 9.7e-05, 'epoch': 0.02}
+100%|██{'train_runtime': 184.0957, 'train_samples_per_second': 17.382, 'train_steps_per_second': 0.543, 'train_loss': 1.3667191314697265, 'epoch': 0.02}
+100%|██████████| 100/100 [03:03<00:00,  1.84s/it]
+Run time: 184.10 seconds
+4 GPUs used.
+Training speed: 17.4 samples/s (=4.3 samples/s/GPU)
+Memory occupied on GPUs: 11.3 + 10.2 + 10.8 + 9.8 GB.
+real	4m48.227s
+user	0m0.017s
+sys	0m0.016s
--- a/mistral7b-gptq/output_lumi_4nodes_20240813.txt
+++ b/mistral7b-gptq/output_lumi_4nodes_20240813.txt
--- a/mistral7b-gptq/output_lumi_8gpus8nodes_20240809.txt
+++ b/mistral7b-gptq/output_lumi_8gpus8nodes_20240809.txt
+ date
+Fri Aug  9 16:50:00 EEST 2024
+ hostname
+nid005145
++ git rev-parse --show-toplevel
+ CONTAINER=/pfs/lustrep1/scratch/project_465001276/mpfister/llm-finetuning/lumi_container.sif
+ export SINGULARITY_BIND=/pfs,/scratch,/projappl,/project,/flash,/appl,
+ SINGULARITY_BIND=/pfs,/scratch,/projappl,/project,/flash,/appl,
+ export SINGULARITY_BIND=/var/spool/slurmd,/opt/cray,/usr/lib64/libcxi.so.1,/usr/lib64/libjansson.so.4,/pfs,/scratch,/projappl,/project,/flash,/appl,
+ SINGULARITY_BIND=/var/spool/slurmd,/opt/cray,/usr/lib64/libcxi.so.1,/usr/lib64/libjansson.so.4,/pfs,/scratch,/projappl,/project,/flash,/appl,
+ rocm-smi
+======================= ROCm System Management Interface =======================
+================================= Concise Info =================================
+GPU  Temp   AvgPwr  SCLK    MCLK     Fan  Perf    PwrCap  VRAM%  GPU%  
+0    47.0c  90.0W   800Mhz  1600Mhz  0%   manual  500.0W    0%   0%    
+1    48.0c  N/A     800Mhz  1600Mhz  0%   manual  0.0W      0%   0%    
+2    37.0c  88.0W   800Mhz  1600Mhz  0%   manual  500.0W    0%   0%    
+3    50.0c  N/A     800Mhz  1600Mhz  0%   manual  0.0W      0%   0%    
+4    44.0c  87.0W   800Mhz  1600Mhz  0%   manual  500.0W    0%   0%    
+5    45.0c  N/A     800Mhz  1600Mhz  0%   manual  0.0W      0%   0%    
+6    41.0c  84.0W   800Mhz  1600Mhz  0%   manual  500.0W    0%   0%    
+7    43.0c  N/A     800Mhz  1600Mhz  0%   manual  0.0W      0%   0%    
+================================================================================
+============================= End of ROCm SMI Log ==============================
+ export MASTER_PORT=24998
+ MASTER_PORT=24998
++ head -n 1
++ scontrol show hostnames 'nid[005145-005152]'
+ export MASTER_ADDR=nid005145
+ MASTER_ADDR=nid005145
+ export NCCL_SOCKET_IFNAME=hsn0,hsn1,hsn2,hsn3
+ NCCL_SOCKET_IFNAME=hsn0,hsn1,hsn2,hsn3
+ export NCCL_NET_GDR_LEVEL=PHB
+ NCCL_NET_GDR_LEVEL=PHB
+ srun singularity exec /pfs/lustrep1/scratch/project_465001276/mpfister/llm-finetuning/lumi_container.sif torchrun --nnodes=8 --nproc_per_node=1 --rdzv_id=7934548 --rdzv_endpoint=nid005145:24998 --rdzv_backend=c10d mistral7b_train.py
+CUDA extension not installed.
+CUDA extension not installed.
+CUDA extension not installed.
+CUDA extension not installed.
+CUDA extension not installed.
+CUDA extension not installed.
+CUDA extension not installed.
+CUDA extension not installed.
+CUDA extension not installed.
+CUDA extension not installed.
+CUDA extension not installed.
+CUDA extension not installed.
+CUDA extension not installed.
+CUDA extension not installed.
+CUDA extension not installed.
+CUDA extension not installed.
+Map: 100%|██████████| 182822/182822 [00:27<00:00, 6563.36 examples/s]167833/182822 [00:25<00:02, 6992.49 examples/s]/s]examples/s]s/s]]amples/s]
+Map: 100%|██████████| 182822/182822 [00:27<00:00, 6562.03 examples/s]
+Map: 100%|██████████| 182822/182822 [00:27<00:00, 6545.65 examples/s]
+Map: 100%|██████████| 182822/182822 [00:27<00:00, 6590.39 examples/s]
+Map: 100%|██████████| 182822/182822 [00:27<00:00, 6636.14 examples/s]
+Map: 100%|██████████| 182822/182822 [00:29<00:00, 6212.45 examples/s]
+Map: 100%|██████████| 182822/182822 [00:29<00:00, 6211.78 examples/s]
+Map: 100%|██████████| 182822/182822 [00:29<00:00, 6240.70 examples/s]
+/opt/conda/envs/conda_container_env/lib/python3.11/site-packages/transformers/quantizers/auto.py:167: UserWarning: You passed `quantization_config` or equivalent parameters to `from_pretrained` but the model you're loading already has a `quantization_config` attribute. The `quantization_config` from the model will be used.However, loading attributes (e.g. ['use_cuda_fp16', 'use_exllama', 'max_input_length', 'exllama_config', 'disable_exllama']) will be overwritten with the one you passed to `from_pretrained`. The rest will be ignored.
+  warnings.warn(warning_msg)
+/opt/conda/envs/conda_container_env/lib/python3.11/site-packages/transformers/quantizers/auto.py:167: UserWarning: You passed `quantization_config` or equivalent parameters to `from_pretrained` but the model you're loading already has a `quantization_config` attribute. The `quantization_config` from the model will be used.However, loading attributes (e.g. ['use_cuda_fp16', 'use_exllama', 'max_input_length', 'exllama_config', 'disable_exllama']) will be overwritten with the one you passed to `from_pretrained`. The rest will be ignored.
+  warnings.warn(warning_msg)
+/opt/conda/envs/conda_container_env/lib/python3.11/site-packages/transformers/quantizers/auto.py:167: UserWarning: You passed `quantization_config` or equivalent parameters to `from_pretrained` but the model you're loading already has a `quantization_config` attribute. The `quantization_config` from the model will be used.However, loading attributes (e.g. ['use_cuda_fp16', 'use_exllama', 'max_input_length', 'exllama_config', 'disable_exllama']) will be overwritten with the one you passed to `from_pretrained`. The rest will be ignored.
+  warnings.warn(warning_msg)
+/opt/conda/envs/conda_container_env/lib/python3.11/site-packages/transformers/quantizers/auto.py:167: UserWarning: You passed `quantization_config` or equivalent parameters to `from_pretrained` but the model you're loading already has a `quantization_config` attribute. The `quantization_config` from the model will be used.However, loading attributes (e.g. ['use_cuda_fp16', 'use_exllama', 'max_input_length', 'exllama_config', 'disable_exllama']) will be overwritten with the one you passed to `from_pretrained`. The rest will be ignored.
+  warnings.warn(warning_msg)
+/opt/conda/envs/conda_container_env/lib/python3.11/site-packages/transformers/quantizers/auto.py:167: UserWarning: You passed `quantization_config` or equivalent parameters to `from_pretrained` but the model you're loading already has a `quantization_config` attribute. The `quantization_config` from the model will be used.However, loading attributes (e.g. ['use_cuda_fp16', 'use_exllama', 'max_input_length', 'exllama_config', 'disable_exllama']) will be overwritten with the one you passed to `from_pretrained`. The rest will be ignored.
+  warnings.warn(warning_msg)
+/opt/conda/envs/conda_container_env/lib/python3.11/site-packages/transformers/quantizers/auto.py:167: UserWarning: You passed `quantization_config` or equivalent parameters to `from_pretrained` but the model you're loading already has a `quantization_config` attribute. The `quantization_config` from the model will be used.However, loading attributes (e.g. ['use_cuda_fp16', 'use_exllama', 'max_input_length', 'exllama_config', 'disable_exllama']) will be overwritten with the one you passed to `from_pretrained`. The rest will be ignored.
+  warnings.warn(warning_msg)
+/opt/conda/envs/conda_container_env/lib/python3.11/site-packages/transformers/quantizers/auto.py:167: UserWarning: You passed `quantization_config` or equivalent parameters to `from_pretrained` but the model you're loading already has a `quantization_config` attribute. The `quantization_config` from the model will be used.However, loading attributes (e.g. ['use_cuda_fp16', 'use_exllama', 'max_input_length', 'exllama_config', 'disable_exllama']) will be overwritten with the one you passed to `from_pretrained`. The rest will be ignored.
+  warnings.warn(warning_msg)
+/opt/conda/envs/conda_container_env/lib/python3.11/site-packages/transformers/quantizers/auto.py:167: UserWarning: You passed `quantization_config` or equivalent parameters to `from_pretrained` but the model you're loading already has a `quantization_config` attribute. The `quantization_config` from the model will be used.However, loading attributes (e.g. ['use_cuda_fp16', 'use_exllama', 'max_input_length', 'exllama_config', 'disable_exllama']) will be overwritten with the one you passed to `from_pretrained`. The rest will be ignored.
+  warnings.warn(warning_msg)
+/opt/conda/envs/conda_container_env/lib/python3.11/site-packages/transformers/modeling_utils.py:4565: FutureWarning: `_is_quantized_training_enabled` is going to be deprecated in transformers 4.39.0. Please use `model.hf_quantizer.is_trainable` instead
+  warnings.warn(
+/opt/conda/envs/conda_container_env/lib/python3.11/site-packages/transformers/modeling_utils.py:4565: FutureWarning: `_is_quantized_training_enabled` is going to be deprecated in transformers 4.39.0. Please use `model.hf_quantizer.is_trainable` instead
+  warnings.warn(
+/opt/conda/envs/conda_container_env/lib/python3.11/site-packages/transformers/modeling_utils.py:4565: FutureWarning: `_is_quantized_training_enabled` is going to be deprecated in transformers 4.39.0. Please use `model.hf_quantizer.is_trainable` instead
+  warnings.warn(
+/opt/conda/envs/conda_container_env/lib/python3.11/site-packages/transformers/modeling_utils.py:4565: FutureWarning: `_is_quantized_training_enabled` is going to be deprecated in transformers 4.39.0. Please use `model.hf_quantizer.is_trainable` instead
+  warnings.warn(
+/opt/conda/envs/conda_container_env/lib/python3.11/site-packages/transformers/modeling_utils.py:4565: FutureWarning: `_is_quantized_training_enabled` is going to be deprecated in transformers 4.39.0. Please use `model.hf_quantizer.is_trainable` instead
+  warnings.warn(
+/opt/conda/envs/conda_container_env/lib/python3.11/site-packages/transformers/modeling_utils.py:4565: FutureWarning: `_is_quantized_training_enabled` is going to be deprecated in transformers 4.39.0. Please use `model.hf_quantizer.is_trainable` instead
+  warnings.warn(
+/opt/conda/envs/conda_container_env/lib/python3.11/site-packages/transformers/modeling_utils.py:4565: FutureWarning: `_is_quantized_training_enabled` is going to be deprecated in transformers 4.39.0. Please use `model.hf_quantizer.is_trainable` instead
+  warnings.warn(
+/opt/conda/envs/conda_container_env/lib/python3.11/site-packages/transformers/modeling_utils.py:4565: FutureWarning: `_is_quantized_training_enabled` is going to be deprecated in transformers 4.39.0. Please use `model.hf_quantizer.is_trainable` instead
+  warnings.warn(
+trainable params: 41,943,040 || all params: 310,644,736 || trainable%: 13.5019
+Map: 100%|██████████| 182822/182822 [00:09<00:00, 19730.54 examples/s]ples/s]
+Map: 100%|██████████| 182822/182822 [00:09<00:00, 19766.39 examples/s]
+Map: 100%|██████████| 182822/182822 [00:09<00:00, 19725.38 examples/s]
+Map: 100%|██████████| 182822/182822 [00:09<00:00, 19787.15 examples/s]
+Map: 100%|██████████| 182822/182822 [00:15<00:00, 11562.67 examples/s]82822 [00:13<00:02, 11271.98 examples/s]70.78 examples/s]
+Map: 100%|██████████| 182822/182822 [00:16<00:00, 11322.34 examples/s]
+Map: 100%|██████████| 182822/182822 [00:16<00:00, 11249.85 examples/s]
+Map: 100%|██████████| 182822/182822 [00:15<00:00, 11608.60 examples/s]
+max_steps is given, it will override any value given in num_train_epochs
+max_steps is given, it will override any value given in num_train_epochs
+max_steps is given, it will override any value given in num_train_epochs
+max_steps is given, it will override any value given in num_train_epochs
+max_steps is given, it will override any value given in num_train_epochs
+max_steps is given, it will override any value given in num_train_epochs
+max_steps is given, it will override any value given in num_train_epochs
+max_steps is given, it will override any value given in num_train_epochs
+  0%|          | 0/100 [00:00<?, ?it/s]/opt/conda/envs/conda_container_env/lib/python3.11/site-packages/transformers/models/mistral/modeling_mistral.py:647: UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at ../aten/src/ATen/native/transformers/hip/sdp_utils.cpp:264.)
+  attn_output = torch.nn.functional.scaled_dot_product_attention(
+/opt/conda/envs/conda_container_env/lib/python3.11/site-packages/transformers/models/mistral/modeling_mistral.py:647: UserWarning: 1Torch was not compiled with memory efficient attention. (Triggered internally at ../aten/src/ATen/native/transformers/hip/sdp_utils.cpp:320.)
+  attn_output = torch.nn.functional.scaled_dot_product_attention(
+/opt/conda/envs/conda_container_env/lib/python3.11/site-packages/transformers/models/mistral/modeling_mistral.py:647: UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at ../aten/src/ATen/native/transformers/hip/sdp_utils.cpp:264.)
+  attn_output = torch.nn.functional.scaled_dot_product_attention(
+/opt/conda/envs/conda_container_env/lib/python3.11/site-packages/transformers/models/mistral/modeling_mistral.py:647: UserWarning: 1Torch was not compiled with memory efficient attention. (Triggered internally at ../aten/src/ATen/native/transformers/hip/sdp_utils.cpp:320.)
+  attn_output = torch.nn.functional.scaled_dot_product_attention(
+/opt/conda/envs/conda_container_env/lib/python3.11/site-packages/transformers/models/mistral/modeling_mistral.py:647: UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at ../aten/src/ATen/native/transformers/hip/sdp_utils.cpp:264.)
+  attn_output = torch.nn.functional.scaled_dot_product_attention(
+/opt/conda/envs/conda_container_env/lib/python3.11/site-packages/transformers/models/mistral/modeling_mistral.py:647: UserWarning: 1Torch was not compiled with memory efficient attention. (Triggered internally at ../aten/src/ATen/native/transformers/hip/sdp_utils.cpp:320.)
+  attn_output = torch.nn.functional.scaled_dot_product_attention(
+/opt/conda/envs/conda_container_env/lib/python3.11/site-packages/transformers/models/mistral/modeling_mistral.py:647: UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at ../aten/src/ATen/native/transformers/hip/sdp_utils.cpp:264.)
+  attn_output = torch.nn.functional.scaled_dot_product_attention(
+/opt/conda/envs/conda_container_env/lib/python3.11/site-packages/transformers/models/mistral/modeling_mistral.py:647: UserWarning: 1Torch was not compiled with memory efficient attention. (Triggered internally at ../aten/src/ATen/native/transformers/hip/sdp_utils.cpp:320.)
+  attn_output = torch.nn.functional.scaled_dot_product_attention(
+/opt/conda/envs/conda_container_env/lib/python3.11/site-packages/transformers/models/mistral/modeling_mistral.py:647: UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at ../aten/src/ATen/native/transformers/hip/sdp_utils.cpp:264.)
+  attn_output = torch.nn.functional.scaled_dot_product_attention(
+/opt/conda/envs/conda_container_env/lib/python3.11/site-packages/transformers/models/mistral/modeling_mistral.py:647: UserWarning: 1Torch was not compiled with memory efficient attention. (Triggered internally at ../aten/src/ATen/native/transformers/hip/sdp_utils.cpp:320.)
+  attn_output = torch.nn.functional.scaled_dot_product_attention(
+/opt/conda/envs/conda_container_env/lib/python3.11/site-packages/transformers/models/mistral/modeling_mistral.py:647: UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at ../aten/src/ATen/native/transformers/hip/sdp_utils.cpp:264.)
+  attn_output = torch.nn.functional.scaled_dot_product_attention(
+/opt/conda/envs/conda_container_env/lib/python3.11/site-packages/transformers/models/mistral/modeling_mistral.py:647: UserWarning: 1Torch was not compiled with memory efficient attention. (Triggered internally at ../aten/src/ATen/native/transformers/hip/sdp_utils.cpp:320.)
+  attn_output = torch.nn.functional.scaled_dot_product_attention(
+/opt/conda/envs/conda_container_env/lib/python3.11/site-packages/transformers/models/mistral/modeling_mistral.py:647: UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at ../aten/src/ATen/native/transformers/hip/sdp_utils.cpp:264.)
+  attn_output = torch.nn.functional.scaled_dot_product_attention(
+/opt/conda/envs/conda_container_env/lib/python3.11/site-packages/transformers/models/mistral/modeling_mistral.py:647: UserWarning: 1Torch was not compiled with memory efficient attention. (Triggered internally at ../aten/src/ATen/native/transformers/hip/sdp_utils.cpp:320.)
+  attn_output = torch.nn.functional.scaled_dot_product_attention(
+/opt/conda/envs/conda_container_env/lib/python3.11/site-packages/transformers/models/mistral/modeling_mistral.py:647: UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at ../aten/src/ATen/native/transformers/hip/sdp_utils.cpp:264.)
+  attn_output = torch.nn.functional.scaled_dot_product_attention(
+/opt/conda/envs/conda_container_env/lib/python3.11/site-packages/transformers/models/mistral/modeling_mistral.py:647: UserWarning: 1Torch was not compiled with memory efficient attention. (Triggered internally at ../aten/src/ATen/native/transformers/hip/sdp_utils.cpp:320.)
+  attn_output = torch.nn.functional.scaled_dot_product_attention(
+ 39%|███▉ {'loss': 1.8314, 'grad_norm': 2.0096566677093506, 'learning_rate': 4.7e-05, 'epoch': 0.02}
+ 86%|████████▌ | 86/100 [06:20<01:00,  4.29s{'loss': 0.8769, 'grad_norm': 0.5495427250862122, 'learning_rate': 9.7e-05, 'epoch': 0.04}
+100%|██{'train_runtime': 444.2417, 'train_samples_per_second': 14.407, 'train_steps_per_second': 0.225, 'train_loss': 1.354150924682617, 'epoch': 0.04}
+100%|██████████| 100/100 [07:21<00:00,  4.42s/it]
+Run time: 444.24 seconds
+8 GPUs used.
+Training speed: 14.4 samples/s (=1.8 samples/s/GPU)
+Memory occupied on GPUs: 11.3 GB.
+Memory occupied on GPUs: 10.0 GB.
+Memory occupied on GPUs: 11.0 GB.
+Memory occupied on GPUs: 11.3 GB.
+Memory occupied on GPUs: 10.9 GB.
+Memory occupied on GPUs: 11.4 GB.
+Memory occupied on GPUs: 11.8 GB.
+Memory occupied on GPUs: 9.2 GB.
+real	9m21.027s
+user	0m0.017s
+sys	0m0.068s
--- a/mistral7b-gptq/output_lumi_8gpus_20240809.txt
+++ b/mistral7b-gptq/output_lumi_8gpus_20240809.txt
+ date
+Fri Aug  9 16:49:57 EEST 2024
+ hostname
+nid005630
++ git rev-parse --show-toplevel
+ CONTAINER=/pfs/lustrep1/scratch/project_465001276/mpfister/llm-finetuning/lumi_container.sif
+ export SINGULARITY_BIND=/pfs,/scratch,/projappl,/project,/flash,/appl,
+ SINGULARITY_BIND=/pfs,/scratch,/projappl,/project,/flash,/appl,
+ rocm-smi
+======================= ROCm System Management Interface =======================
+================================= Concise Info =================================
+GPU  Temp   AvgPwr  SCLK    MCLK     Fan  Perf    PwrCap  VRAM%  GPU%  
+0    48.0c  94.0W   800Mhz  1600Mhz  0%   manual  500.0W    0%   0%    
+1    49.0c  N/A     800Mhz  1600Mhz  0%   manual  0.0W      0%   0%    
+2    42.0c  87.0W   800Mhz  1600Mhz  0%   manual  500.0W    0%   0%    
+3    41.0c  N/A     800Mhz  1600Mhz  0%   manual  0.0W      0%   0%    
+4    40.0c  86.0W   800Mhz  1600Mhz  0%   manual  500.0W    0%   0%    
+5    48.0c  N/A     800Mhz  1600Mhz  0%   manual  0.0W      0%   0%    
+6    37.0c  87.0W   800Mhz  1600Mhz  0%   manual  500.0W    0%   0%    
+7    46.0c  N/A     800Mhz  1600Mhz  0%   manual  0.0W      0%   0%    
+================================================================================
+============================= End of ROCm SMI Log ==============================
+ srun singularity exec /pfs/lustrep1/scratch/project_465001276/mpfister/llm-finetuning/lumi_container.sif torchrun --nproc_per_node 8 mistral7b_train.py
+[2024-08-09 16:50:11,407] torch.distributed.run: [WARNING] 
+[2024-08-09 16:50:11,407] torch.distributed.run: [WARNING] *****************************************
+[2024-08-09 16:50:11,407] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
+[2024-08-09 16:50:11,407] torch.distributed.run: [WARNING] *****************************************
+CUDA extension not installed.
+CUDA extension not installed.
+CUDA extension not installed.
+CUDA extension not installed.
+CUDA extension not installed.
+CUDA extension not installed.
+CUDA extension not installed.
+CUDA extension not installed.
+CUDA extension not installed.
+CUDA extension not installed.
+CUDA extension not installed.
+CUDA extension not installed.
+CUDA extension not installed.
+CUDA extension not installed.
+CUDA extension not installed.
+CUDA extension not installed.
+Map: 100%|██████████| 182822/182822 [00:29<00:00, 6292.57 examples/s]
+Map: 100%|██████████| 182822/182822 [00:29<00:00, 6294.48 examples/s]
+Map: 100%|██████████| 182822/182822 [00:29<00:00, 6292.52 examples/s]
+Map: 100%|██████████| 182822/182822 [00:29<00:00, 6292.91 examples/s]
+Map: 100%|██████████| 182822/182822 [00:29<00:00, 6292.55 examples/s]
+Map: 100%|██████████| 182822/182822 [00:29<00:00, 6292.78 examples/s]
+Map: 100%|██████████| 182822/182822 [00:29<00:00, 6292.48 examples/s]
+/opt/conda/envs/conda_container_env/lib/python3.11/site-packages/transformers/quantizers/auto.py:167: UserWarning: You passed `quantization_config` or equivalent parameters to `from_pretrained` but the model you're loading already has a `quantization_config` attribute. The `quantization_config` from the model will be used.However, loading attributes (e.g. ['use_cuda_fp16', 'use_exllama', 'max_input_length', 'exllama_config', 'disable_exllama']) will be overwritten with the one you passed to `from_pretrained`. The rest will be ignored.
+  warnings.warn(warning_msg)
+/opt/conda/envs/conda_container_env/lib/python3.11/site-packages/transformers/quantizers/auto.py:167: UserWarning: You passed `quantization_config` or equivalent parameters to `from_pretrained` but the model you're loading already has a `quantization_config` attribute. The `quantization_config` from the model will be used.However, loading attributes (e.g. ['use_cuda_fp16', 'use_exllama', 'max_input_length', 'exllama_config', 'disable_exllama']) will be overwritten with the one you passed to `from_pretrained`. The rest will be ignored.
+  warnings.warn(warning_msg)
+/opt/conda/envs/conda_container_env/lib/python3.11/site-packages/transformers/quantizers/auto.py:167: UserWarning: You passed `quantization_config` or equivalent parameters to `from_pretrained` but the model you're loading already has a `quantization_config` attribute. The `quantization_config` from the model will be used.However, loading attributes (e.g. ['use_cuda_fp16', 'use_exllama', 'max_input_length', 'exllama_config', 'disable_exllama']) will be overwritten with the one you passed to `from_pretrained`. The rest will be ignored.
+  warnings.warn(warning_msg)
+/opt/conda/envs/conda_container_env/lib/python3.11/site-packages/transformers/quantizers/auto.py:167: UserWarning: You passed `quantization_config` or equivalent parameters to `from_pretrained` but the model you're loading already has a `quantization_config` attribute. The `quantization_config` from the model will be used.However, loading attributes (e.g. ['use_cuda_fp16', 'use_exllama', 'max_input_length', 'exllama_config', 'disable_exllama']) will be overwritten with the one you passed to `from_pretrained`. The rest will be ignored.
+  warnings.warn(warning_msg)
+/opt/conda/envs/conda_container_env/lib/python3.11/site-packages/transformers/quantizers/auto.py:167: UserWarning: You passed `quantization_config` or equivalent parameters to `from_pretrained` but the model you're loading already has a `quantization_config` attribute. The `quantization_config` from the model will be used.However, loading attributes (e.g. ['use_cuda_fp16', 'use_exllama', 'max_input_length', 'exllama_config', 'disable_exllama']) will be overwritten with the one you passed to `from_pretrained`. The rest will be ignored.
+  warnings.warn(warning_msg)
+/opt/conda/envs/conda_container_env/lib/python3.11/site-packages/transformers/quantizers/auto.py:167: UserWarning: You passed `quantization_config` or equivalent parameters to `from_pretrained` but the model you're loading already has a `quantization_config` attribute. The `quantization_config` from the model will be used.However, loading attributes (e.g. ['use_cuda_fp16', 'use_exllama', 'max_input_length', 'exllama_config', 'disable_exllama']) will be overwritten with the one you passed to `from_pretrained`. The rest will be ignored.
+  warnings.warn(warning_msg)
+/opt/conda/envs/conda_container_env/lib/python3.11/site-packages/transformers/quantizers/auto.py:167: UserWarning: You passed `quantization_config` or equivalent parameters to `from_pretrained` but the model you're loading already has a `quantization_config` attribute. The `quantization_config` from the model will be used.However, loading attributes (e.g. ['use_cuda_fp16', 'use_exllama', 'max_input_length', 'exllama_config', 'disable_exllama']) will be overwritten with the one you passed to `from_pretrained`. The rest will be ignored.
+  warnings.warn(warning_msg)
+/opt/conda/envs/conda_container_env/lib/python3.11/site-packages/transformers/quantizers/auto.py:167: UserWarning: You passed `quantization_config` or equivalent parameters to `from_pretrained` but the model you're loading already has a `quantization_config` attribute. The `quantization_config` from the model will be used.However, loading attributes (e.g. ['use_cuda_fp16', 'use_exllama', 'max_input_length', 'exllama_config', 'disable_exllama']) will be overwritten with the one you passed to `from_pretrained`. The rest will be ignored.
+  warnings.warn(warning_msg)
+Using the `SDPA` attention implementation on multi-gpu setup with ROCM may lead to performance issues due to the FA backend. Disabling it to use alternative backends.
+Using the `SDPA` attention implementation on multi-gpu setup with ROCM may lead to performance issues due to the FA backend. Disabling it to use alternative backends.
+Using the `SDPA` attention implementation on multi-gpu setup with ROCM may lead to performance issues due to the FA backend. Disabling it to use alternative backends.
+Using the `SDPA` attention implementation on multi-gpu setup with ROCM may lead to performance issues due to the FA backend. Disabling it to use alternative backends.
+Using the `SDPA` attention implementation on multi-gpu setup with ROCM may lead to performance issues due to the FA backend. Disabling it to use alternative backends.
+Using the `SDPA` attention implementation on multi-gpu setup with ROCM may lead to performance issues due to the FA backend. Disabling it to use alternative backends.
+Using the `SDPA` attention implementation on multi-gpu setup with ROCM may lead to performance issues due to the FA backend. Disabling it to use alternative backends.
+Using the `SDPA` attention implementation on multi-gpu setup with ROCM may lead to performance issues due to the FA backend. Disabling it to use alternative backends.
+/opt/conda/envs/conda_container_env/lib/python3.11/site-packages/transformers/modeling_utils.py:4565: FutureWarning: `_is_quantized_training_enabled` is going to be deprecated in transformers 4.39.0. Please use `model.hf_quantizer.is_trainable` instead
+  warnings.warn(
+/opt/conda/envs/conda_container_env/lib/python3.11/site-packages/transformers/modeling_utils.py:4565: FutureWarning: `_is_quantized_training_enabled` is going to be deprecated in transformers 4.39.0. Please use `model.hf_quantizer.is_trainable` instead
+  warnings.warn(
+/opt/conda/envs/conda_container_env/lib/python3.11/site-packages/transformers/modeling_utils.py:4565: FutureWarning: `_is_quantized_training_enabled` is going to be deprecated in transformers 4.39.0. Please use `model.hf_quantizer.is_trainable` instead
+  warnings.warn(
+/opt/conda/envs/conda_container_env/lib/python3.11/site-packages/transformers/modeling_utils.py:4565: FutureWarning: `_is_quantized_training_enabled` is going to be deprecated in transformers 4.39.0. Please use `model.hf_quantizer.is_trainable` instead
+  warnings.warn(
+/opt/conda/envs/conda_container_env/lib/python3.11/site-packages/transformers/modeling_utils.py:4565: FutureWarning: `_is_quantized_training_enabled` is going to be deprecated in transformers 4.39.0. Please use `model.hf_quantizer.is_trainable` instead
+  warnings.warn(
+/opt/conda/envs/conda_container_env/lib/python3.11/site-packages/transformers/modeling_utils.py:4565: FutureWarning: `_is_quantized_training_enabled` is going to be deprecated in transformers 4.39.0. Please use `model.hf_quantizer.is_trainable` instead
+  warnings.warn(
+/opt/conda/envs/conda_container_env/lib/python3.11/site-packages/transformers/modeling_utils.py:4565: FutureWarning: `_is_quantized_training_enabled` is going to be deprecated in transformers 4.39.0. Please use `model.hf_quantizer.is_trainable` instead
+  warnings.warn(
+/opt/conda/envs/conda_container_env/lib/python3.11/site-packages/transformers/modeling_utils.py:4565: FutureWarning: `_is_quantized_training_enabled` is going to be deprecated in transformers 4.39.0. Please use `model.hf_quantizer.is_trainable` instead
+  warnings.warn(
+trainable params: 41,943,040 || all params: 310,644,736 || trainable%: 13.5019
+max_steps is given, it will override any value given in num_train_epochs
+  0%|          | 0/100 [00:00<?, ?it/s]/opt/conda/envs/conda_container_env/lib/python3.11/site-packages/transformers/models/mistral/modeling_mistral.py:647: UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at ../aten/src/ATen/native/transformers/hip/sdp_utils.cpp:264.)
+  attn_output = torch.nn.functional.scaled_dot_product_attention(
+/opt/conda/envs/conda_container_env/lib/python3.11/site-packages/transformers/models/mistral/modeling_mistral.py:647: UserWarning: 1Torch was not compiled with memory efficient attention. (Triggered internally at ../aten/src/ATen/native/transformers/hip/sdp_utils.cpp:320.)
+  attn_output = torch.nn.functional.scaled_dot_product_attention(
+/opt/conda/envs/conda_container_env/lib/python3.11/site-packages/transformers/models/mistral/modeling_mistral.py:647: UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at ../aten/src/ATen/native/transformers/hip/sdp_utils.cpp:264.)
+  attn_output = torch.nn.functional.scaled_dot_product_attention(
+/opt/conda/envs/conda_container_env/lib/python3.11/site-packages/transformers/models/mistral/modeling_mistral.py:647: UserWarning: 1Torch was not compiled with memory efficient attention. (Triggered internally at ../aten/src/ATen/native/transformers/hip/sdp_utils.cpp:320.)
+  attn_output = torch.nn.functional.scaled_dot_product_attention(
+/opt/conda/envs/conda_container_env/lib/python3.11/site-packages/transformers/models/mistral/modeling_mistral.py:647: UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at ../aten/src/ATen/native/transformers/hip/sdp_utils.cpp:264.)
+  attn_output = torch.nn.functional.scaled_dot_product_attention(
+/opt/conda/envs/conda_container_env/lib/python3.11/site-packages/transformers/models/mistral/modeling_mistral.py:647: UserWarning: 1Torch was not compiled with memory efficient attention. (Triggered internally at ../aten/src/ATen/native/transformers/hip/sdp_utils.cpp:320.)
+  attn_output = torch.nn.functional.scaled_dot_product_attention(
+/opt/conda/envs/conda_container_env/lib/python3.11/site-packages/transformers/models/mistral/modeling_mistral.py:647: UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at ../aten/src/ATen/native/transformers/hip/sdp_utils.cpp:264.)
+  attn_output = torch.nn.functional.scaled_dot_product_attention(
+/opt/conda/envs/conda_container_env/lib/python3.11/site-packages/transformers/models/mistral/modeling_mistral.py:647: UserWarning: 1Torch was not compiled with memory efficient attention. (Triggered internally at ../aten/src/ATen/native/transformers/hip/sdp_utils.cpp:320.)
+  attn_output = torch.nn.functional.scaled_dot_product_attention(
+/opt/conda/envs/conda_container_env/lib/python3.11/site-packages/transformers/models/mistral/modeling_mistral.py:647: UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at ../aten/src/ATen/native/transformers/hip/sdp_utils.cpp:264.)
+  attn_output = torch.nn.functional.scaled_dot_product_attention(
+/opt/conda/envs/conda_container_env/lib/python3.11/site-packages/transformers/models/mistral/modeling_mistral.py:647: UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at ../aten/src/ATen/native/transformers/hip/sdp_utils.cpp:264.)
+  attn_output = torch.nn.functional.scaled_dot_product_attention(
+/opt/conda/envs/conda_container_env/lib/python3.11/site-packages/transformers/models/mistral/modeling_mistral.py:647: UserWarning: 1Torch was not compiled with memory efficient attention. (Triggered internally at ../aten/src/ATen/native/transformers/hip/sdp_utils.cpp:320.)
+  attn_output = torch.nn.functional.scaled_dot_product_attention(
+/opt/conda/envs/conda_container_env/lib/python3.11/site-packages/transformers/models/mistral/modeling_mistral.py:647: UserWarning: 1Torch was not compiled with memory efficient attention. (Triggered internally at ../aten/src/ATen/native/transformers/hip/sdp_utils.cpp:320.)
+  attn_output = torch.nn.functional.scaled_dot_product_attention(
+/opt/conda/envs/conda_container_env/lib/python3.11/site-packages/transformers/models/mistral/modeling_mistral.py:647: UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at ../aten/src/ATen/native/transformers/hip/sdp_utils.cpp:264.)
+  attn_output = torch.nn.functional.scaled_dot_product_attention(
+/opt/conda/envs/conda_container_env/lib/python3.11/site-packages/transformers/models/mistral/modeling_mistral.py:647: UserWarning: 1Torch was not compiled with memory efficient attention. (Triggered internally at ../aten/src/ATen/native/transformers/hip/sdp_utils.cpp:320.)
+  attn_output = torch.nn.functional.scaled_dot_product_attention(
+/opt/conda/envs/conda_container_env/lib/python3.11/site-packages/transformers/models/mistral/modeling_mistral.py:647: UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at ../aten/src/ATen/native/transformers/hip/sdp_utils.cpp:264.)
+  attn_output = torch.nn.functional.scaled_dot_product_attention(
+/opt/conda/envs/conda_container_env/lib/python3.11/site-packages/transformers/models/mistral/modeling_mistral.py:647: UserWarning: 1Torch was not compiled with memory efficient attention. (Triggered internally at ../aten/src/ATen/native/transformers/hip/sdp_utils.cpp:320.)
+  attn_output = torch.nn.functional.scaled_dot_product_attention(
+ 39%|███▉ {'loss': 1.8351, 'grad_norm': 1.872937560081482, 'learning_rate': 4.600000000000001e-05, 'epoch': 0.02}
+ 86%|████████▌ | 86/100 [02:48<00:26,  1.89s{'loss': 0.8854, 'grad_norm': 0.573049783706665, 'learning_rate': 9.6e-05, 'epoch': 0.04}
+100%|██{'train_runtime': 195.5189, 'train_samples_per_second': 32.733, 'train_steps_per_second': 0.511, 'train_loss': 1.3602521514892578, 'epoch': 0.04}
+100%|██████████| 100/100 [03:15<00:00,  1.95s/it]
+Run time: 195.52 seconds
+8 GPUs used.
+Training speed: 32.7 samples/s (=4.1 samples/s/GPU)
+Memory occupied on GPUs: 11.6 + 10.2 + 11.2 + 9.3 + 10.4 + 11.7 + 12.1 + 11.3 GB.
+real	6m19.382s
+user	0m0.018s
+sys	0m0.044s
--- a/mistral7b-gptq/output_lumi_8nodes_20240813.txt
+++ b/mistral7b-gptq/output_lumi_8nodes_20240813.txt
--- a/mistral7b-gptq/run_lumi.slurm
+++ b/mistral7b-gptq/run_lumi.slurm
 #!/bin/bash
 #SBATCH --partition=small-g
-#SBATCH --gpus-per-node=1  # 1-8, but recommended to use multiples of 2, as each MI250X contains 2 compute dies
+#SBATCH --nodes=1
 #SBATCH --ntasks-per-node=1
-#SBATCH --cpus-per-task=7  # 7 * number of GPUs
+#SBATCH --gpus-per-node=1  # 1-8, but recommended to use multiples of 2, as each MI250X contains 2 compute dies
 #SBATCH --mem-per-gpu=60G
+#SBATCH --cpus-per-task=7  # 7 * number of GPUs
 #SBATCH --time=1:00:00
 # Include commands in output:

--- a/mistral7b-gptq/run_lumi_2gpus.slurm
+++ b/mistral7b-gptq/run_lumi_2gpus.slurm
+#!/bin/bash
+#SBATCH --partition=small-g
+#SBATCH --nodes=1
+#SBATCH --ntasks-per-node=1
+#SBATCH --gpus-per-node=2  # 1-8, but recommended to use multiples of 2, as each MI250X contains 2 compute dies
+#SBATCH --mem-per-gpu=60G
+#SBATCH --cpus-per-task=14  # 7 * number of GPUs
+#SBATCH --time=1:00:00
+# Include commands in output:
+set -x
+# Print current time and date:
+date
+# Print host name:
+hostname
+# Find container in top level directory of git repository:
+CONTAINER=$(git rev-parse --show-toplevel)/lumi_container.sif
+# Tell singularity to bind all relevant paths to container:
+export SINGULARITY_BIND=/pfs,/scratch,/projappl,/project,/flash,/appl,$SINGULARITY_BIND
+# List available GPUs:
+rocm-smi
+# Run AI scripts:
+# time srun singularity exec $CONTAINER python mistral7b_train.py
+# time srun singularity exec $CONTAINER python mistral7b_test.py
+time srun singularity exec $CONTAINER torchrun --nproc_per_node 2 mistral7b_train.py
\ No newline at end of file
--- a/mistral7b-gptq/run_lumi_2nodes.slurm
+++ b/mistral7b-gptq/run_lumi_2nodes.slurm
+#!/bin/bash
+#SBATCH --partition=standard-g
+#SBATCH --nodes=2
+#SBATCH --ntasks-per-node=1
+#SBATCH --gpus-per-node=8  # 1-8, but recommended to use multiples of 2, as each MI250X contains 2 compute dies
+#SBATCH --mem-per-gpu=60G
+#SBATCH --cpus-per-task=56  # 7 * number of GPUs
+#SBATCH --time=1:00:00
+# Include commands in output:
+set -x
+# Print current time and date:
+date
+# Print host name:
+hostname
+# Find container in top level directory of git repository:
+CONTAINER=$(git rev-parse --show-toplevel)/lumi_container.sif
+# Tell singularity to bind all relevant paths to container:
+export SINGULARITY_BIND=/pfs,/scratch,/projappl,/project,/flash,/appl,$SINGULARITY_BIND
+export SINGULARITY_BIND=/var/spool/slurmd,/opt/cray,/usr/lib64/libcxi.so.1,/usr/lib64/libjansson.so.4,$SINGULARITY_BIND
+# List available GPUs:
+rocm-smi
+# Set environment variables for communication between nodes:
+export MASTER_PORT=24998
+export MASTER_ADDR=$(scontrol show hostnames ${SLURM_JOB_NODELIST} | head -n 1)
+# Tell RCCL to use only Slingshot interfaces and GPU RDMA
+export NCCL_SOCKET_IFNAME=hsn0,hsn1,hsn2,hsn3
+export NCCL_NET_GDR_LEVEL=PHB
+# Run AI scripts:
+# time srun singularity exec $CONTAINER python mistral7b_train.py
+# time srun singularity exec $CONTAINER python mistral7b_test.py
+# time srun singularity exec $CONTAINER torchrun --nproc_per_node 8 mistral7b_train.py
+time srun singularity exec $CONTAINER torchrun \
+    --nnodes=$SLURM_JOB_NUM_NODES \
+    --nproc_per_node=8 \
+    --rdzv_id=$SLURM_JOB_ID \
+    --rdzv_endpoint=$MASTER_ADDR:$MASTER_PORT \
+    --rdzv_backend=c10d \
+    mistral7b_train.py
--- a/mistral7b-gptq/run_lumi_4gpus.slurm
+++ b/mistral7b-gptq/run_lumi_4gpus.slurm
+#!/bin/bash
+#SBATCH --partition=small-g
+#SBATCH --nodes=1
+#SBATCH --ntasks-per-node=1
+#SBATCH --gpus-per-node=4  # 1-8, but recommended to use multiples of 2, as each MI250X contains 2 compute dies
+#SBATCH --mem-per-gpu=60G
+#SBATCH --cpus-per-task=28  # 7 * number of GPUs
+#SBATCH --time=1:00:00
+# Include commands in output:
+set -x
+# Print current time and date:
+date
+# Print host name:
+hostname
+# Find container in top level directory of git repository:
+CONTAINER=$(git rev-parse --show-toplevel)/lumi_container.sif
+# Tell singularity to bind all relevant paths to container:
+export SINGULARITY_BIND=/pfs,/scratch,/projappl,/project,/flash,/appl,$SINGULARITY_BIND
+# List available GPUs:
+rocm-smi
+# Run AI scripts:
+# time srun singularity exec $CONTAINER python mistral7b_train.py
+# time srun singularity exec $CONTAINER python mistral7b_test.py
+time srun singularity exec $CONTAINER torchrun --nproc_per_node 4 mistral7b_train.py
\ No newline at end of file
--- a/mistral7b-gptq/run_lumi_4nodes.slurm
+++ b/mistral7b-gptq/run_lumi_4nodes.slurm
+#!/bin/bash
+#SBATCH --partition=standard-g
+#SBATCH --nodes=4
+#SBATCH --ntasks-per-node=1
+#SBATCH --gpus-per-node=8  # 1-8, but recommended to use multiples of 2, as each MI250X contains 2 compute dies
+#SBATCH --mem-per-gpu=60G
+#SBATCH --cpus-per-task=56  # 7 * number of GPUs
+#SBATCH --time=1:00:00
+# Include commands in output:
+set -x
+# Print current time and date:
+date
+# Print host name:
+hostname
+# Find container in top level directory of git repository:
+CONTAINER=$(git rev-parse --show-toplevel)/lumi_container.sif
+# Tell singularity to bind all relevant paths to container:
+export SINGULARITY_BIND=/pfs,/scratch,/projappl,/project,/flash,/appl,$SINGULARITY_BIND
+export SINGULARITY_BIND=/var/spool/slurmd,/opt/cray,/usr/lib64/libcxi.so.1,/usr/lib64/libjansson.so.4,$SINGULARITY_BIND
+# List available GPUs:
+rocm-smi
+# Set environment variables for communication between nodes:
+export MASTER_PORT=24998
+export MASTER_ADDR=$(scontrol show hostnames ${SLURM_JOB_NODELIST} | head -n 1)
+# Tell RCCL to use only Slingshot interfaces and GPU RDMA
+export NCCL_SOCKET_IFNAME=hsn0,hsn1,hsn2,hsn3
+export NCCL_NET_GDR_LEVEL=PHB
+# Run AI scripts:
+# time srun singularity exec $CONTAINER python mistral7b_train.py
+# time srun singularity exec $CONTAINER python mistral7b_test.py
+# time srun singularity exec $CONTAINER torchrun --nproc_per_node 8 mistral7b_train.py
+time srun singularity exec $CONTAINER torchrun \
+    --nnodes=$SLURM_JOB_NUM_NODES \
+    --nproc_per_node=8 \
+    --rdzv_id=$SLURM_JOB_ID \
+    --rdzv_endpoint=$MASTER_ADDR:$MASTER_PORT \
+    --rdzv_backend=c10d \
+    mistral7b_train.py
--- a/mistral7b-gptq/run_lumi_8gpus.slurm
+++ b/mistral7b-gptq/run_lumi_8gpus.slurm
+#!/bin/bash
+#SBATCH --partition=standard-g
+#SBATCH --nodes=1
+#SBATCH --ntasks-per-node=1
+#SBATCH --gpus-per-node=8  # 1-8, but recommended to use multiples of 2, as each MI250X contains 2 compute dies
+#SBATCH --mem-per-gpu=60G
+#SBATCH --cpus-per-task=56  # 7 * number of GPUs
+#SBATCH --time=1:00:00
+# Include commands in output:
+set -x
+# Print current time and date:
+date
+# Print host name:
+hostname
+# Find container in top level directory of git repository:
+CONTAINER=$(git rev-parse --show-toplevel)/lumi_container.sif
+# Tell singularity to bind all relevant paths to container:
+export SINGULARITY_BIND=/pfs,/scratch,/projappl,/project,/flash,/appl,$SINGULARITY_BIND
+# List available GPUs:
+rocm-smi
+# Run AI scripts:
+# time srun singularity exec $CONTAINER python mistral7b_train.py
+# time srun singularity exec $CONTAINER python mistral7b_test.py
+time srun singularity exec $CONTAINER torchrun --nproc_per_node 8 mistral7b_train.py
\ No newline at end of file
--- a/mistral7b-gptq/run_lumi_8gpus8nodes.slurm
+++ b/mistral7b-gptq/run_lumi_8gpus8nodes.slurm
+#!/bin/bash
+#SBATCH --partition=standard-g
+#SBATCH --nodes=8
+#SBATCH --ntasks-per-node=1
+#SBATCH --gpus-per-node=1  # 1-8, but recommended to use multiples of 2, as each MI250X contains 2 compute dies
+#SBATCH --mem-per-gpu=60G
+#SBATCH --cpus-per-task=7  # 7 * number of GPUs
+#SBATCH --time=1:00:00
+# Include commands in output:
+set -x
+# Print current time and date:
+date
+# Print host name:
+hostname
+# Find container in top level directory of git repository:
+CONTAINER=$(git rev-parse --show-toplevel)/lumi_container.sif
+# Tell singularity to bind all relevant paths to container:
+export SINGULARITY_BIND=/pfs,/scratch,/projappl,/project,/flash,/appl,$SINGULARITY_BIND
+export SINGULARITY_BIND=/var/spool/slurmd,/opt/cray,/usr/lib64/libcxi.so.1,/usr/lib64/libjansson.so.4,$SINGULARITY_BIND
+# List available GPUs:
+rocm-smi
+# Set environment variables for communication between nodes:
+export MASTER_PORT=24998
+export MASTER_ADDR=$(scontrol show hostnames ${SLURM_JOB_NODELIST} | head -n 1)
+# Tell RCCL to use only Slingshot interfaces and GPU RDMA
+export NCCL_SOCKET_IFNAME=hsn0,hsn1,hsn2,hsn3
+export NCCL_NET_GDR_LEVEL=PHB
+# Run AI scripts:
+# time srun singularity exec $CONTAINER python mistral7b_train.py
+# time srun singularity exec $CONTAINER python mistral7b_test.py
+# time srun singularity exec $CONTAINER torchrun --nproc_per_node 8 mistral7b_train.py
+time srun singularity exec $CONTAINER torchrun \
+    --nnodes=$SLURM_JOB_NUM_NODES \
+    --nproc_per_node=1 \
+    --rdzv_id=$SLURM_JOB_ID \
+    --rdzv_endpoint=$MASTER_ADDR:$MASTER_PORT \
+    --rdzv_backend=c10d \
+    mistral7b_train.py
--- a/mistral7b-gptq/run_lumi_8nodes.slurm
+++ b/mistral7b-gptq/run_lumi_8nodes.slurm
+#!/bin/bash
+#SBATCH --partition=standard-g
+#SBATCH --nodes=8
+#SBATCH --ntasks-per-node=1
+#SBATCH --gpus-per-node=8  # 1-8, but recommended to use multiples of 2, as each MI250X contains 2 compute dies
+#SBATCH --mem-per-gpu=60G
+#SBATCH --cpus-per-task=56  # 7 * number of GPUs
+#SBATCH --time=1:00:00
+# Include commands in output:
+set -x
+# Print current time and date:
+date
+# Print host name:
+hostname
+# Find container in top level directory of git repository:
+CONTAINER=$(git rev-parse --show-toplevel)/lumi_container.sif
+# Tell singularity to bind all relevant paths to container:
+export SINGULARITY_BIND=/pfs,/scratch,/projappl,/project,/flash,/appl,$SINGULARITY_BIND
+export SINGULARITY_BIND=/var/spool/slurmd,/opt/cray,/usr/lib64/libcxi.so.1,/usr/lib64/libjansson.so.4,$SINGULARITY_BIND
+# List available GPUs:
+rocm-smi
+# Set environment variables for communication between nodes:
+export MASTER_PORT=24998
+export MASTER_ADDR=$(scontrol show hostnames ${SLURM_JOB_NODELIST} | head -n 1)
+# Tell RCCL to use only Slingshot interfaces and GPU RDMA
+export NCCL_SOCKET_IFNAME=hsn0,hsn1,hsn2,hsn3
+export NCCL_NET_GDR_LEVEL=PHB
+# Run AI scripts:
+# time srun singularity exec $CONTAINER python mistral7b_train.py
+# time srun singularity exec $CONTAINER python mistral7b_test.py
+# time srun singularity exec $CONTAINER torchrun --nproc_per_node 8 mistral7b_train.py
+time srun singularity exec $CONTAINER torchrun \
+    --nnodes=$SLURM_JOB_NUM_NODES \
+    --nproc_per_node=8 \
+    --rdzv_id=$SLURM_JOB_ID \
+    --rdzv_endpoint=$MASTER_ADDR:$MASTER_PORT \
+    --rdzv_backend=c10d \
+    mistral7b_train.py