Add LEONARDO FSDP output files.

4c8e7d3f · Pfister, Martin · 7c2667b5 · 4c8e7d3f · 4c8e7d3f · 4c8e7d3f
Commit 4c8e7d3f authored 1 year ago by Pfister, Martin
--- a/llama3.1-70b-bnb/output_leonardo_fsdp_1gpu_20240830.txt
+++ b/llama3.1-70b-bnb/output_leonardo_fsdp_1gpu_20240830.txt
+Unloading profile/base
+  ERROR: Module evaluation aborted
+ date
+Fri Aug 30 17:26:02 CEST 2024
+ hostname
+lrdn3279.leonardo.local
+ nvidia-smi
+Fri Aug 30 17:26:02 2024       
+---------------------------------------------------------------------------------------+
+| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
+|-----------------------------------------+----------------------+----------------------+
+| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
+| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
+|                                         |                      |               MIG M. |
+|=========================================+======================+======================|
+|   0  NVIDIA A100-SXM-64GB            On | 00000000:8F:00.0 Off |                    0 |
+| N/A   42C    P0               61W / 464W|      0MiB / 65536MiB |      0%      Default |
+|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
+                                                                                         
+---------------------------------------------------------------------------------------+
+| Processes:                                                                            |
+|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
+|        ID   ID                                                             Usage      |
+|=======================================================================================|
+|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+
+ conda run -n finetuning --no-capture-output python llama3.1-70b_train.py
+Using the latest cached version of the dataset since medmcqa couldn't be found on the Hugging Face Hub
+Found the latest cached dataset configuration 'default' at /leonardo/home/userexternal/mpfister/.cache/huggingface/datasets/medmcqa/default/0.0.0/91c6572c454088bf71b679ad90aa8dffcd0d5868 (last modified on Thu Aug 29 19:38:14 2024).
+`low_cpu_mem_usage` was None, now set to True since model is quantized.
+Loading checkpoint shards: 100%|██████████| 30/30 [05:53<00:00, 11.80s/it]
+Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
+max_steps is given, it will override any value given in num_train_epochs
+trainable params: 207,093,760 || all params: 70,760,800,256 || trainable%: 0.2927
+100%|██████████| 100/100 [10:42<00:00,  6.43s/it]
+/leonardo/home/userexternal/mpfister/.conda/envs/finetuning/lib/python3.11/site-packages/peft/utils/other.py:619: UserWarning: Unable to fetch remote file due to the following error (MaxRetryError("HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /meta-llama/Meta-Llama-3.1-70B-Instruct/resolve/main/config.json (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x146f620ac510>: Failed to establish a new connection: [Errno 101] Network is unreachable'))"), '(Request ID: 7c6ef0fe-4985-43dd-9e1e-c63edaadc650)') - silently ignoring the lookup for the file config.json in meta-llama/Meta-Llama-3.1-70B-Instruct.
+  warnings.warn(
+/leonardo/home/userexternal/mpfister/.conda/envs/finetuning/lib/python3.11/site-packages/peft/utils/save_and_load.py:218: UserWarning: Could not find a config file in meta-llama/Meta-Llama-3.1-70B-Instruct - will assume that the vocabulary was not modified.
+  warnings.warn(
+{'loss': 1.9947, 'grad_norm': 0.8367132544517517, 'learning_rate': 2.5e-05, 'epoch': 0.0}
+{'loss': 0.7486, 'grad_norm': 0.6377346515655518, 'learning_rate': 5e-05, 'epoch': 0.0}
+{'train_runtime': 642.6727, 'train_samples_per_second': 1.245, 'train_steps_per_second': 0.156, 'train_loss': 1.3716077423095703, 'epoch': 0.0}
+Run time: 642.67 seconds
+1 GPUs used.
+Training speed: 1.2 samples/s (=1.2 samples/s/GPU)
+Memory occupied on GPUs: 53.9 GB.
+
+real	17m34.095s
+user	6m18.376s
+sys	6m6.919s
--- a/llama3.1-70b-bnb/output_leonardo_fsdp_2gpus_20240830.txt
+++ b/llama3.1-70b-bnb/output_leonardo_fsdp_2gpus_20240830.txt
+login05-ext.leonardo.cineca.it
+Unloading profile/base
+  ERROR: Module evaluation aborted
+ date
+Fri Aug 30 16:52:56 CEST 2024
+ hostname
+lrdn3183.leonardo.local
+ nvidia-smi
+Fri Aug 30 16:52:56 2024       
+---------------------------------------------------------------------------------------+
+| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
+|-----------------------------------------+----------------------+----------------------+
+| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
+| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
+|                                         |                      |               MIG M. |
+|=========================================+======================+======================|
+|   0  NVIDIA A100-SXM-64GB            On | 00000000:1D:00.0 Off |                    0 |
+| N/A   43C    P0               60W / 465W|      0MiB / 65536MiB |      0%      Default |
+|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
+|   1  NVIDIA A100-SXM-64GB            On | 00000000:56:00.0 Off |                    0 |
+| N/A   42C    P0               60W / 468W|      0MiB / 65536MiB |      0%      Default |
+|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
+                                                                                         
+---------------------------------------------------------------------------------------+
+| Processes:                                                                            |
+|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
+|        ID   ID                                                             Usage      |
+|=======================================================================================|
+|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+
+ export MASTER_PORT=24998
+ MASTER_PORT=24998
++ scontrol show hostnames lrdn3183
++ head -n 1
+ export MASTER_ADDR=lrdn3183
+ MASTER_ADDR=lrdn3183
+ echo 'Using 2 GPUs on 1 nodes.'
+Using 2 GPUs on 1 nodes.
+ srun bash -c 'conda run -n finetuning --no-capture-output accelerate launch     --num_machines 1     --num_processes 2     --num_cpu_threads_per_process 8     --main_process_ip lrdn3183     --main_process_port 24998     --machine_rank $SLURM_PROCID     --config_file "fsdp_config.yml"     llama3.1-70b_train.py'
+Loading checkpoint shards: 100%|██████████| 30/30 [09:13<00:00, 18.45s/it]
+Loading checkpoint shards: 100%|██████████| 30/30 [09:13<00:00, 18.45s/it]
+Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
+max_steps is given, it will override any value given in num_train_epochs
+trainable params: 207,093,760 || all params: 70,760,800,256 || trainable%: 0.2927
+/leonardo/home/userexternal/mpfister/.conda/envs/finetuning/lib/python3.11/site-packages/accelerate/accelerator.py:1558: UserWarning: Upcasted low precision parameters in Linear because mixed precision turned on in FSDP. Affects: weight.
+  warnings.warn(
+/leonardo/home/userexternal/mpfister/.conda/envs/finetuning/lib/python3.11/site-packages/accelerate/accelerator.py:1564: UserWarning: FSDP upcast of low precision parameters may affect the precision of model checkpoints.
+  warnings.warn(
+ 38%|███▊      | 38/100 [03{'loss': 2.1235, 'grad_norm': 1.6637215614318848, 'learning_rate': 2.5e-05, 'epoch': 0.0}
+                      {'loss': 0.7388, 'grad_norm': 0.32860875129699707, 'learning_rate': 5e-05, 'epoch': 0.01}
+{'train_runtime': 536.0503, 'train_samples_per_second': 2.985, 'train_steps_per_second': 0.187, 'train_loss': 1.4311787033081054, 'epoch': 0.01}
+100%|██████████| 100/100 [08:56<00:00,  5.36s/it]
+Run time: 536.05 seconds
+2 GPUs used.
+Training speed: 3.0 samples/s (=1.5 samples/s/GPU)
+Memory occupied on GPUs: 39.2 + 39.2 GB.
+slurmstepd: error: *** STEP 7330690.0 ON lrdn3183 CANCELLED AT 2024-08-30T17:26:00 DUE TO TIME LIMIT ***
+srun: Job step aborted: Waiting up to 92 seconds for job step to finish.
+slurmstepd: error: *** JOB 7330690 ON lrdn3183 CANCELLED AT 2024-08-30T17:26:00 DUE TO TIME LIMIT ***
--- a/llama3.1-70b-bnb/output_leonardo_fsdp_2nodes_20240830.txt
+++ b/llama3.1-70b-bnb/output_leonardo_fsdp_2nodes_20240830.txt
+login05-ext.leonardo.cineca.it
+Unloading profile/base
+  ERROR: Module evaluation aborted
+ date
+Fri Aug 30 16:19:37 CEST 2024
+ hostname
+lrdn2709.leonardo.local
+ nvidia-smi
+Fri Aug 30 16:19:37 2024       
+---------------------------------------------------------------------------------------+
+| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
+|-----------------------------------------+----------------------+----------------------+
+| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
+| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
+|                                         |                      |               MIG M. |
+|=========================================+======================+======================|
+|   0  NVIDIA A100-SXM-64GB            On | 00000000:1D:00.0 Off |                    0 |
+| N/A   43C    P0               64W / 465W|      0MiB / 65536MiB |      0%      Default |
+|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
+|   1  NVIDIA A100-SXM-64GB            On | 00000000:56:00.0 Off |                    0 |
+| N/A   43C    P0               63W / 469W|      0MiB / 65536MiB |      0%      Default |
+|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
+|   2  NVIDIA A100-SXM-64GB            On | 00000000:8F:00.0 Off |                    0 |
+| N/A   43C    P0               61W / 457W|      0MiB / 65536MiB |      0%      Default |
+|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
+|   3  NVIDIA A100-SXM-64GB            On | 00000000:C8:00.0 Off |                    0 |
+| N/A   43C    P0               63W / 462W|      0MiB / 65536MiB |      0%      Default |
+|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
+                                                                                         
+---------------------------------------------------------------------------------------+
+| Processes:                                                                            |
+|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
+|        ID   ID                                                             Usage      |
+|=======================================================================================|
+|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+
+ export MASTER_PORT=24998
+ MASTER_PORT=24998
++ head -n 1
++ scontrol show hostnames 'lrdn[2709,2718]'
+ export MASTER_ADDR=lrdn2709
+ MASTER_ADDR=lrdn2709
+ echo 'Using 8 GPUs on 2 nodes.'
+Using 8 GPUs on 2 nodes.
+ srun bash -c 'conda run -n finetuning --no-capture-output accelerate launch     --num_machines 2     --num_processes 8     --num_cpu_threads_per_process 8     --main_process_ip lrdn2709     --main_process_port 24998     --machine_rank $SLURM_PROCID     --config_file "fsdp_config.yml"     llama3.1-70b_train.py'
+Loading checkpoint shards: 100%|██████████| 30/30 [10:40<00:00, 21.34s/it]]<03:12, 21.35s/it]48<04:05, 22.32s/it]20.26s/it]
+Loading checkpoint shards: 100%|██████████| 30/30 [10:40<00:00, 21.34s/it]
+Loading checkpoint shards: 100%|██████████| 30/30 [10:40<00:00, 21.34s/it]
+Loading checkpoint shards: 100%|██████████| 30/30 [10:40<00:00, 21.35s/it]
+Loading checkpoint shards: 100%|██████████| 30/30 [10:40<00:00, 21.34s/it]
+Loading checkpoint shards: 100%|██████████| 30/30 [10:40<00:00, 21.35s/it]
+Loading checkpoint shards: 100%|██████████| 30/30 [10:40<00:00, 21.34s/it]
+Loading checkpoint shards: 100%|██████████| 30/30 [10:40<00:00, 21.34s/it]
+max_steps is given, it will override any value given in num_train_epochs
+Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
+max_steps is given, it will override any value given in num_train_epochs
+trainable params: 207,093,760 || all params: 70,760,800,256 || trainable%: 0.2927
+/leonardo/home/userexternal/mpfister/.conda/envs/finetuning/lib/python3.11/site-packages/accelerate/accelerator.py:1558: UserWarning: Upcasted low precision parameters in Linear because mixed precision turned on in FSDP. Affects: weight.
+  warnings.warn(
+/leonardo/home/userexternal/mpfister/.conda/envs/finetuning/lib/python3.11/site-packages/accelerate/accelerator.py:1564: UserWarning: FSDP upcast of low precision parameters may affect the precision of model checkpoints.
+  warnings.warn(
+ 38%|███▊      | 38/100 [04{'loss': 2.1236, 'grad_norm': 2.036155939102173, 'learning_rate': 2.5e-05, 'epoch': 0.02}
+                      {'loss': 0.732, 'grad_norm': 0.3916214108467102, 'learning_rate': 5e-05, 'epoch': 0.04}
+{'train_runtime': 654.3867, 'train_samples_per_second': 9.78, 'train_steps_per_second': 0.153, 'train_loss': 1.4278192138671875, 'epoch': 0.04}
+Memory occupied on GPUs: 24.4 + 24.4 + 24.4 + 24.4 GB.
+100%|██████████| 100/100 [10:54<00:00,  6.54s/it]
+Run time: 654.39 seconds
+8 GPUs used.
+Training speed: 9.8 samples/s (=1.2 samples/s/GPU)
+Memory occupied on GPUs: 24.4 + 24.4 + 24.4 + 24.4 GB.
+[W socket.cpp:432] [c10d] While waitForInput, poolFD failed with (errno: 0 - Success).
+[2024-08-30 16:46:56,209] torch.distributed.elastic.agent.server.api: [ERROR] Error waiting on exit barrier. Elapsed: 300.1137435436249 seconds
+[2024-08-30 16:46:56,209] torch.distributed.elastic.agent.server.api: [ERROR] Traceback (most recent call last):
+[2024-08-30 16:46:56,209] torch.distributed.elastic.agent.server.api: [ERROR]   File "/leonardo/home/userexternal/mpfister/.conda/envs/finetuning/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py", line 929, in _exit_barrier
+[2024-08-30 16:46:56,209] torch.distributed.elastic.agent.server.api: [ERROR]     store_util.barrier(
+[2024-08-30 16:46:56,209] torch.distributed.elastic.agent.server.api: [ERROR]   File "/leonardo/home/userexternal/mpfister/.conda/envs/finetuning/lib/python3.11/site-packages/torch/distributed/elastic/utils/store.py", line 78, in barrier
+[2024-08-30 16:46:56,209] torch.distributed.elastic.agent.server.api: [ERROR]     synchronize(store, data, rank, world_size, key_prefix, barrier_timeout)
+[2024-08-30 16:46:56,209] torch.distributed.elastic.agent.server.api: [ERROR]   File "/leonardo/home/userexternal/mpfister/.conda/envs/finetuning/lib/python3.11/site-packages/torch/distributed/elastic/utils/store.py", line 64, in synchronize
+[2024-08-30 16:46:56,209] torch.distributed.elastic.agent.server.api: [ERROR]     agent_data = get_all(store, rank, key_prefix, world_size)
+[2024-08-30 16:46:56,209] torch.distributed.elastic.agent.server.api: [ERROR]                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[2024-08-30 16:46:56,209] torch.distributed.elastic.agent.server.api: [ERROR]   File "/leonardo/home/userexternal/mpfister/.conda/envs/finetuning/lib/python3.11/site-packages/torch/distributed/elastic/utils/store.py", line 34, in get_all
+[2024-08-30 16:46:56,209] torch.distributed.elastic.agent.server.api: [ERROR]     data = store.get(f"{prefix}{idx}")
+[2024-08-30 16:46:56,209] torch.distributed.elastic.agent.server.api: [ERROR]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[2024-08-30 16:46:56,209] torch.distributed.elastic.agent.server.api: [ERROR] torch.distributed.DistStoreError: Socket Timeout
+slurmstepd: error: *** JOB 7330673 ON lrdn2709 CANCELLED AT 2024-08-30T16:52:50 DUE TO TIME LIMIT ***
+srun: Job step aborted: Waiting up to 92 seconds for job step to finish.
+slurmstepd: error: *** STEP 7330673.0 ON lrdn2709 CANCELLED AT 2024-08-30T16:52:50 DUE TO TIME LIMIT ***
--- a/llama3.1-70b-bnb/output_leonardo_fsdp_4gpus_20240830.txt
+++ b/llama3.1-70b-bnb/output_leonardo_fsdp_4gpus_20240830.txt
+login05-ext.leonardo.cineca.it
+Unloading profile/base
+  ERROR: Module evaluation aborted
+ date
+Fri Aug 30 16:52:56 CEST 2024
+ hostname
+lrdn3301.leonardo.local
+ nvidia-smi
+Fri Aug 30 16:52:56 2024       
+---------------------------------------------------------------------------------------+
+| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
+|-----------------------------------------+----------------------+----------------------+
+| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
+| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
+|                                         |                      |               MIG M. |
+|=========================================+======================+======================|
+|   0  NVIDIA A100-SXM-64GB            On | 00000000:1D:00.0 Off |                    0 |
+| N/A   42C    P0               61W / 467W|      0MiB / 65536MiB |      0%      Default |
+|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
+|   1  NVIDIA A100-SXM-64GB            On | 00000000:56:00.0 Off |                    0 |
+| N/A   43C    P0               60W / 458W|      0MiB / 65536MiB |      0%      Default |
+|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
+|   2  NVIDIA A100-SXM-64GB            On | 00000000:8F:00.0 Off |                    0 |
+| N/A   42C    P0               61W / 454W|      0MiB / 65536MiB |      0%      Default |
+|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
+|   3  NVIDIA A100-SXM-64GB            On | 00000000:C8:00.0 Off |                    0 |
+| N/A   42C    P0               63W / 464W|      0MiB / 65536MiB |      0%      Default |
+|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
+                                                                                         
+---------------------------------------------------------------------------------------+
+| Processes:                                                                            |
+|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
+|        ID   ID                                                             Usage      |
+|=======================================================================================|
+|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+
+ export MASTER_PORT=24998
+ MASTER_PORT=24998
++ head -n 1
++ scontrol show hostnames lrdn3301
+ export MASTER_ADDR=lrdn3301
+ MASTER_ADDR=lrdn3301
+ echo 'Using 4 GPUs on 1 nodes.'
+Using 4 GPUs on 1 nodes.
+ srun bash -c 'conda run -n finetuning --no-capture-output accelerate launch     --num_machines 1     --num_processes 4     --num_cpu_threads_per_process 8     --main_process_ip lrdn3301     --main_process_port 24998     --machine_rank $SLURM_PROCID     --config_file "fsdp_config.yml"     llama3.1-70b_train.py'
+Loading checkpoint shards: 100%|██████████| 30/30 [09:13<00:00, 18.46s/it]
+Loading checkpoint shards: 100%|██████████| 30/30 [09:13<00:00, 18.46s/it]
+Loading checkpoint shards: 100%|██████████| 30/30 [09:13<00:00, 18.46s/it]
+Loading checkpoint shards: 100%|██████████| 30/30 [09:13<00:00, 18.46s/it]
+Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
+max_steps is given, it will override any value given in num_train_epochs
+trainable params: 207,093,760 || all params: 70,760,800,256 || trainable%: 0.2927
+/leonardo/home/userexternal/mpfister/.conda/envs/finetuning/lib/python3.11/site-packages/accelerate/accelerator.py:1558: UserWarning: Upcasted low precision parameters in Linear because mixed precision turned on in FSDP. Affects: weight.
+  warnings.warn(
+/leonardo/home/userexternal/mpfister/.conda/envs/finetuning/lib/python3.11/site-packages/accelerate/accelerator.py:1564: UserWarning: FSDP upcast of low precision parameters may affect the precision of model checkpoints.
+  warnings.warn(
+ 38%|███▊      | 38/100 [03{'loss': 2.1259, 'grad_norm': 12.881935119628906, 'learning_rate': 2.5e-05, 'epoch': 0.01}
+                      {'loss': 0.7453, 'grad_norm': 0.21734948456287384, 'learning_rate': 5e-05, 'epoch': 0.02}
+{'train_runtime': 557.075, 'train_samples_per_second': 5.744, 'train_steps_per_second': 0.18, 'train_loss': 1.435568504333496, 'epoch': 0.02}
+100%|██████████| 100/100 [09:17<00:00,  5.57s/it]
+Run time: 557.08 seconds
+4 GPUs used.
+Training speed: 5.7 samples/s (=1.4 samples/s/GPU)
+Memory occupied on GPUs: 30.1 + 29.2 + 29.2 + 30.1 GB.
+slurmstepd: error: *** JOB 7330688 ON lrdn3301 CANCELLED AT 2024-08-30T17:26:00 DUE TO TIME LIMIT ***
+srun: Job step aborted: Waiting up to 92 seconds for job step to finish.
+slurmstepd: error: *** STEP 7330688.0 ON lrdn3301 CANCELLED AT 2024-08-30T17:26:00 DUE TO TIME LIMIT ***
--- a/llama3.1-70b-bnb/output_leonardo_fsdp_4nodes_20240830.txt
+++ b/llama3.1-70b-bnb/output_leonardo_fsdp_4nodes_20240830.txt
+Unloading profile/base
+  ERROR: Module evaluation aborted
+ date
+Fri Aug 30 18:52:36 CEST 2024
+ hostname
+lrdn0667.leonardo.local
+ nvidia-smi
+Fri Aug 30 18:52:36 2024       
+---------------------------------------------------------------------------------------+
+| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
+|-----------------------------------------+----------------------+----------------------+
+| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
+| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
+|                                         |                      |               MIG M. |
+|=========================================+======================+======================|
+|   0  NVIDIA A100-SXM-64GB            On | 00000000:1D:00.0 Off |                    0 |
+| N/A   42C    P0               61W / 467W|      0MiB / 65536MiB |      0%      Default |
+|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
+|   1  NVIDIA A100-SXM-64GB            On | 00000000:56:00.0 Off |                    0 |
+| N/A   42C    P0               59W / 456W|      0MiB / 65536MiB |      0%      Default |
+|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
+|   2  NVIDIA A100-SXM-64GB            On | 00000000:8F:00.0 Off |                    0 |
+| N/A   43C    P0               61W / 461W|      0MiB / 65536MiB |      0%      Default |
+|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
+|   3  NVIDIA A100-SXM-64GB            On | 00000000:C8:00.0 Off |                    0 |
+| N/A   43C    P0               60W / 456W|      0MiB / 65536MiB |      0%      Default |
+|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
+                                                                                         
+---------------------------------------------------------------------------------------+
+| Processes:                                                                            |
+|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
+|        ID   ID                                                             Usage      |
+|=======================================================================================|
+|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+
+ export MASTER_PORT=24998
+ MASTER_PORT=24998
++ head -n 1
++ scontrol show hostnames 'lrdn[0667,0768,0793,0997]'
+ export MASTER_ADDR=lrdn0667
+ MASTER_ADDR=lrdn0667
+ echo 'Using 16 GPUs on 4 nodes.'
+Using 16 GPUs on 4 nodes.
+ srun bash -c 'conda run -n finetuning --no-capture-output accelerate launch     --num_machines 4     --num_processes 16     --num_cpu_threads_per_process 8     --main_process_ip lrdn0667     --main_process_port 24998     --machine_rank $SLURM_PROCID     --config_file "fsdp_config.yml"     llama3.1-70b_train.py'
+Using the latest cached version of the dataset since medmcqa couldn't be found on the Hugging Face Hub
+Found the latest cached dataset configuration 'default' at /leonardo/home/userexternal/mpfister/.cache/huggingface/datasets/medmcqa/default/0.0.0/91c6572c454088bf71b679ad90aa8dffcd0d5868 (last modified on Thu Aug 29 19:38:14 2024).
+Using the latest cached version of the dataset since medmcqa couldn't be found on the Hugging Face Hub
+Found the latest cached dataset configuration 'default' at /leonardo/home/userexternal/mpfister/.cache/huggingface/datasets/medmcqa/default/0.0.0/91c6572c454088bf71b679ad90aa8dffcd0d5868 (last modified on Thu Aug 29 19:38:14 2024).
+Using the latest cached version of the dataset since medmcqa couldn't be found on the Hugging Face Hub
+Found the latest cached dataset configuration 'default' at /leonardo/home/userexternal/mpfister/.cache/huggingface/datasets/medmcqa/default/0.0.0/91c6572c454088bf71b679ad90aa8dffcd0d5868 (last modified on Thu Aug 29 19:38:14 2024).
+Using the latest cached version of the dataset since medmcqa couldn't be found on the Hugging Face Hub
+Found the latest cached dataset configuration 'default' at /leonardo/home/userexternal/mpfister/.cache/huggingface/datasets/medmcqa/default/0.0.0/91c6572c454088bf71b679ad90aa8dffcd0d5868 (last modified on Thu Aug 29 19:38:14 2024).
+Using the latest cached version of the dataset since medmcqa couldn't be found on the Hugging Face Hub
+Found the latest cached dataset configuration 'default' at /leonardo/home/userexternal/mpfister/.cache/huggingface/datasets/medmcqa/default/0.0.0/91c6572c454088bf71b679ad90aa8dffcd0d5868 (last modified on Thu Aug 29 19:38:14 2024).
+Using the latest cached version of the dataset since medmcqa couldn't be found on the Hugging Face Hub
+Found the latest cached dataset configuration 'default' at /leonardo/home/userexternal/mpfister/.cache/huggingface/datasets/medmcqa/default/0.0.0/91c6572c454088bf71b679ad90aa8dffcd0d5868 (last modified on Thu Aug 29 19:38:14 2024).
+Loading checkpoint shards:   0%|          | 0/30 [00:00<?, ?it/s]Using the latest cached version of the dataset since medmcqa couldn't be found on the Hugging Face Hub
+Found the latest cached dataset configuration 'default' at /leonardo/home/userexternal/mpfister/.cache/huggingface/datasets/medmcqa/default/0.0.0/91c6572c454088bf71b679ad90aa8dffcd0d5868 (last modified on Thu Aug 29 19:38:14 2024).
+Using the latest cached version of the dataset since medmcqa couldn't be found on the Hugging Face Hub
+Found the latest cached dataset configuration 'default' at /leonardo/home/userexternal/mpfister/.cache/huggingface/datasets/medmcqa/default/0.0.0/91c6572c454088bf71b679ad90aa8dffcd0d5868 (last modified on Thu Aug 29 19:38:14 2024).
+Using the latest cached version of the dataset since medmcqa couldn't be found on the Hugging Face Hub
+Found the latest cached dataset configuration 'default' at /leonardo/home/userexternal/mpfister/.cache/huggingface/datasets/medmcqa/default/0.0.0/91c6572c454088bf71b679ad90aa8dffcd0d5868 (last modified on Thu Aug 29 19:38:14 2024).
+Loading checkpoint shards:   0%|          | 0/30 [00:00<?, ?it/s]Using the latest cached version of the dataset since medmcqa couldn't be found on the Hugging Face Hub
+Found the latest cached dataset configuration 'default' at /leonardo/home/userexternal/mpfister/.cache/huggingface/datasets/medmcqa/default/0.0.0/91c6572c454088bf71b679ad90aa8dffcd0d5868 (last modified on Thu Aug 29 19:38:14 2024).
+Loading checkpoint shards:   3%|â–Ž         | 1/30 [00:01<00:56,  1.93s/it]Using the latest cached version of the dataset since medmcqa couldn't be found on the Hugging Face Hub
+Found the latest cached dataset configuration 'default' at /leonardo/home/userexternal/mpfister/.cache/huggingface/datasets/medmcqa/default/0.0.0/91c6572c454088bf71b679ad90aa8dffcd0d5868 (last modified on Thu Aug 29 19:38:14 2024).
+Loading checkpoint shards:   0%|          | 0/30 [00:00<?, ?it/s]Using the latest cached version of the dataset since medmcqa couldn't be found on the Hugging Face Hub
+Found the latest cached dataset configuration 'default' at /leonardo/home/userexternal/mpfister/.cache/huggingface/datasets/medmcqa/default/0.0.0/91c6572c454088bf71b679ad90aa8dffcd0d5868 (last modified on Thu Aug 29 19:38:14 2024).
+Loading checkpoint shards:   3%|â–Ž         | 1/30 [00:01<00:57,  1.97s/it]Using the latest cached version of the dataset since medmcqa couldn't be found on the Hugging Face Hub
+Found the latest cached dataset configuration 'default' at /leonardo/home/userexternal/mpfister/.cache/huggingface/datasets/medmcqa/default/0.0.0/91c6572c454088bf71b679ad90aa8dffcd0d5868 (last modified on Thu Aug 29 19:38:14 2024).
+Loading checkpoint shards:   0%|          | 0/30 [00:00<?, ?it/s]Using the latest cached version of the dataset since medmcqa couldn't be found on the Hugging Face Hub
+Found the latest cached dataset configuration 'default' at /leonardo/home/userexternal/mpfister/.cache/huggingface/datasets/medmcqa/default/0.0.0/91c6572c454088bf71b679ad90aa8dffcd0d5868 (last modified on Thu Aug 29 19:38:14 2024).
+Loading checkpoint shards:   3%|â–Ž         | 1/30 [00:01<00:54,  1.87s/it]Using the latest cached version of the dataset since medmcqa couldn't be found on the Hugging Face Hub
+Found the latest cached dataset configuration 'default' at /leonardo/home/userexternal/mpfister/.cache/huggingface/datasets/medmcqa/default/0.0.0/91c6572c454088bf71b679ad90aa8dffcd0d5868 (last modified on Thu Aug 29 19:38:14 2024).
+Loading checkpoint shards:   3%|â–Ž         | 1/30 [00:24<11:56, 24.69s/it]Using the latest cached version of the dataset since medmcqa couldn't be found on the Hugging Face Hub
+Found the latest cached dataset configuration 'default' at /leonardo/home/userexternal/mpfister/.cache/huggingface/datasets/medmcqa/default/0.0.0/91c6572c454088bf71b679ad90aa8dffcd0d5868 (last modified on Thu Aug 29 19:38:14 2024).
+Loading checkpoint shards: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 30/30 [12:55<00:00, 25.86s/it]–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–‹  | 23/30 [09:55<03:45, 32.21s/it]
+Loading checkpoint shards: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 30/30 [13:11<00:00, 26.39s/it]
+Loading checkpoint shards: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 30/30 [12:51<00:00, 25.70s/it]
+Loading checkpoint shards: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 30/30 [12:35<00:00, 25.20s/it]
+Loading checkpoint shards: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 30/30 [13:10<00:00, 26.34s/it]ˆâ–Ž | 25/30 [11:41<02:51, 34.33s/it]1s/it]
+Loading checkpoint shards: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 30/30 [12:51<00:00, 25.72s/it]
+Loading checkpoint shards: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 30/30 [13:00<00:00, 26.01s/it]
+Loading checkpoint shards: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 30/30 [13:30<00:00, 27.01s/it]
+Loading checkpoint shards: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 30/30 [13:46<00:00, 27.55s/it]
+Loading checkpoint shards: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 30/30 [12:56<00:00, 25.87s/it]
+Loading checkpoint shards: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 30/30 [13:29<00:00, 26.99s/it]
+Loading checkpoint shards: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 30/30 [13:27<00:00, 26.92s/it]
+Loading checkpoint shards: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 30/30 [14:16<00:00, 28.56s/it]
+Loading checkpoint shards: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 30/30 [14:10<00:00, 28.36s/it]
+Loading checkpoint shards: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 30/30 [13:50<00:00, 27.69s/it]
+Loading checkpoint shards: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 30/30 [14:16<00:00, 28.56s/it]
+max_steps is given, it will override any value given in num_train_epochs
+Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
+max_steps is given, it will override any value given in num_train_epochs
+trainable params: 207,093,760 || all params: 70,760,800,256 || trainable%: 0.2927
+max_steps is given, it will override any value given in num_train_epochs
+max_steps is given, it will override any value given in num_train_epochs
+/leonardo/home/userexternal/mpfister/.conda/envs/finetuning/lib/python3.11/site-packages/accelerate/accelerator.py:1558: UserWarning: Upcasted low precision parameters in Linear because mixed precision turned on in FSDP. Affects: weight.
+  warnings.warn(
+/leonardo/home/userexternal/mpfister/.conda/envs/finetuning/lib/python3.11/site-packages/accelerate/accelerator.py:1564: UserWarning: FSDP upcast of low precision parameters may affect the precision of model checkpoints.
+  warnings.warn(
+ 38%|â–ˆâ–ˆâ–ˆâ–Š      | 38/100 [04{'loss': 2.1152, 'grad_norm': 0.8707819581031799, 'learning_rate': 2.5e-05, 'epoch': 0.03}
+                      {'loss': 0.7285, 'grad_norm': 0.4297076463699341, 'learning_rate': 5e-05, 'epoch': 0.07}
+{'train_runtime': 701.4784, 'train_samples_per_second': 18.247, 'train_steps_per_second': 0.143, 'train_loss': 1.4218804931640625, 'epoch': 0.07}
+Memory occupied on GPUs: 21.6 + 21.6 + 21.6 + 21.6 GB.
+100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 100/100 [11:41<00:00,  7.01s/it]
+Run time: 701.48 seconds
+16 GPUs used.
+Training speed: 18.2 samples/s (=1.1 samples/s/GPU)
+Memory occupied on GPUs: 21.6 + 21.6 + 21.6 + 21.6 GB.
+Memory occupied on GPUs: 21.6 + 21.6 + 21.6 + 21.6 GB.
+Memory occupied on GPUs: 21.6 + 21.6 + 21.6 + 21.6 GB.
+[W socket.cpp:432] [c10d] While waitForInput, poolFD failed with (errno: 0 - Success).
+[2024-08-30 19:26:06,714] torch.distributed.elastic.agent.server.api: [ERROR] Error waiting on exit barrier. Elapsed: 300.139240026474 seconds
+[2024-08-30 19:26:06,714] torch.distributed.elastic.agent.server.api: [ERROR] Traceback (most recent call last):
+[2024-08-30 19:26:06,714] torch.distributed.elastic.agent.server.api: [ERROR]   File "/leonardo/home/userexternal/mpfister/.conda/envs/finetuning/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py", line 929, in _exit_barrier
+[2024-08-30 19:26:06,714] torch.distributed.elastic.agent.server.api: [ERROR]     store_util.barrier(
+[2024-08-30 19:26:06,714] torch.distributed.elastic.agent.server.api: [ERROR]   File "/leonardo/home/userexternal/mpfister/.conda/envs/finetuning/lib/python3.11/site-packages/torch/distributed/elastic/utils/store.py", line 78, in barrier
+[2024-08-30 19:26:06,714] torch.distributed.elastic.agent.server.api: [ERROR]     synchronize(store, data, rank, world_size, key_prefix, barrier_timeout)
+[2024-08-30 19:26:06,714] torch.distributed.elastic.agent.server.api: [ERROR]   File "/leonardo/home/userexternal/mpfister/.conda/envs/finetuning/lib/python3.11/site-packages/torch/distributed/elastic/utils/store.py", line 64, in synchronize
+[2024-08-30 19:26:06,714] torch.distributed.elastic.agent.server.api: [ERROR]     agent_data = get_all(store, rank, key_prefix, world_size)
+[2024-08-30 19:26:06,714] torch.distributed.elastic.agent.server.api: [ERROR]                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[2024-08-30 19:26:06,714] torch.distributed.elastic.agent.server.api: [ERROR]   File "/leonardo/home/userexternal/mpfister/.conda/envs/finetuning/lib/python3.11/site-packages/torch/distributed/elastic/utils/store.py", line 34, in get_all
+[2024-08-30 19:26:06,714] torch.distributed.elastic.agent.server.api: [ERROR]     data = store.get(f"{prefix}{idx}")
+[2024-08-30 19:26:06,714] torch.distributed.elastic.agent.server.api: [ERROR]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[2024-08-30 19:26:06,714] torch.distributed.elastic.agent.server.api: [ERROR] torch.distributed.DistStoreError: Socket Timeout
+[W socket.cpp:432] [c10d] While waitForInput, poolFD failed with (errno: 0 - Success).
+[W socket.cpp:432] [c10d] While waitForInput, poolFD failed with (errno: 0 - Success).
+[2024-08-30 19:26:06,809] torch.distributed.elastic.agent.server.api: [ERROR] Error waiting on exit barrier. Elapsed: 300.15521478652954 seconds
+[2024-08-30 19:26:06,809] torch.distributed.elastic.agent.server.api: [ERROR] Traceback (most recent call last):
+[2024-08-30 19:26:06,809] torch.distributed.elastic.agent.server.api: [ERROR]   File "/leonardo/home/userexternal/mpfister/.conda/envs/finetuning/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py", line 929, in _exit_barrier
+[2024-08-30 19:26:06,809] torch.distributed.elastic.agent.server.api: [ERROR]     store_util.barrier(
+[2024-08-30 19:26:06,809] torch.distributed.elastic.agent.server.api: [ERROR]   File "/leonardo/home/userexternal/mpfister/.conda/envs/finetuning/lib/python3.11/site-packages/torch/distributed/elastic/utils/store.py", line 78, in barrier
+[2024-08-30 19:26:06,809] torch.distributed.elastic.agent.server.api: [ERROR]     synchronize(store, data, rank, world_size, key_prefix, barrier_timeout)
+[2024-08-30 19:26:06,809] torch.distributed.elastic.agent.server.api: [ERROR]   File "/leonardo/home/userexternal/mpfister/.conda/envs/finetuning/lib/python3.11/site-packages/torch/distributed/elastic/utils/store.py", line 64, in synchronize
+[2024-08-30 19:26:06,809] torch.distributed.elastic.agent.server.api: [ERROR]     agent_data = get_all(store, rank, key_prefix, world_size)
+[2024-08-30 19:26:06,809] torch.distributed.elastic.agent.server.api: [ERROR]                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[2024-08-30 19:26:06,809] torch.distributed.elastic.agent.server.api: [ERROR]   File "/leonardo/home/userexternal/mpfister/.conda/envs/finetuning/lib/python3.11/site-packages/torch/distributed/elastic/utils/store.py", line 34, in get_all
+[2024-08-30 19:26:06,809] torch.distributed.elastic.agent.server.api: [ERROR]     data = store.get(f"{prefix}{idx}")
+[2024-08-30 19:26:06,809] torch.distributed.elastic.agent.server.api: [ERROR]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[2024-08-30 19:26:06,809] torch.distributed.elastic.agent.server.api: [ERROR] torch.distributed.DistStoreError: Socket Timeout
+[2024-08-30 19:26:06,821] torch.distributed.elastic.agent.server.api: [ERROR] Error waiting on exit barrier. Elapsed: 300.1139500141144 seconds
+[2024-08-30 19:26:06,821] torch.distributed.elastic.agent.server.api: [ERROR] Traceback (most recent call last):
+[2024-08-30 19:26:06,821] torch.distributed.elastic.agent.server.api: [ERROR]   File "/leonardo/home/userexternal/mpfister/.conda/envs/finetuning/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py", line 929, in _exit_barrier
+[2024-08-30 19:26:06,821] torch.distributed.elastic.agent.server.api: [ERROR]     store_util.barrier(
+[2024-08-30 19:26:06,821] torch.distributed.elastic.agent.server.api: [ERROR]   File "/leonardo/home/userexternal/mpfister/.conda/envs/finetuning/lib/python3.11/site-packages/torch/distributed/elastic/utils/store.py", line 78, in barrier
+[2024-08-30 19:26:06,821] torch.distributed.elastic.agent.server.api: [ERROR]     synchronize(store, data, rank, world_size, key_prefix, barrier_timeout)
+[2024-08-30 19:26:06,821] torch.distributed.elastic.agent.server.api: [ERROR]   File "/leonardo/home/userexternal/mpfister/.conda/envs/finetuning/lib/python3.11/site-packages/torch/distributed/elastic/utils/store.py", line 64, in synchronize
+[2024-08-30 19:26:06,821] torch.distributed.elastic.agent.server.api: [ERROR]     agent_data = get_all(store, rank, key_prefix, world_size)
+[2024-08-30 19:26:06,821] torch.distributed.elastic.agent.server.api: [ERROR]                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[2024-08-30 19:26:06,821] torch.distributed.elastic.agent.server.api: [ERROR]   File "/leonardo/home/userexternal/mpfister/.conda/envs/finetuning/lib/python3.11/site-packages/torch/distributed/elastic/utils/store.py", line 34, in get_all
+[2024-08-30 19:26:06,821] torch.distributed.elastic.agent.server.api: [ERROR]     data = store.get(f"{prefix}{idx}")
+[2024-08-30 19:26:06,821] torch.distributed.elastic.agent.server.api: [ERROR]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[2024-08-30 19:26:06,821] torch.distributed.elastic.agent.server.api: [ERROR] torch.distributed.DistStoreError: Socket Timeout
--- a/llama3.1-70b-bnb/output_leonardo_fsdp_8nodes_20240830.txt
+++ b/llama3.1-70b-bnb/output_leonardo_fsdp_8nodes_20240830.txt
--- a/llama3.1-70b-bnb/output_leonardo_nofsdp_1gpu_20240830.txt
+++ b/llama3.1-70b-bnb/output_leonardo_nofsdp_1gpu_20240830.txt
+login05-ext.leonardo.cineca.it
+Unloading profile/base
+  ERROR: Module evaluation aborted
+ date
+Fri Aug 30 17:26:03 CEST 2024
+ hostname
+lrdn0760.leonardo.local
+ nvidia-smi
+Fri Aug 30 17:26:03 2024       
+---------------------------------------------------------------------------------------+
+| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
+|-----------------------------------------+----------------------+----------------------+
+| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
+| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
+|                                         |                      |               MIG M. |
+|=========================================+======================+======================|
+|   0  NVIDIA A100-SXM-64GB            On | 00000000:1D:00.0 Off |                    0 |
+| N/A   44C    P0               67W / 484W|      0MiB / 65536MiB |      0%      Default |
+|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
+                                                                                         
+---------------------------------------------------------------------------------------+
+| Processes:                                                                            |
+|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
+|        ID   ID                                                             Usage      |
+|=======================================================================================|
+|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+
+ export MASTER_PORT=24998
+ MASTER_PORT=24998
++ scontrol show hostnames lrdn0760
++ head -n 1
+ export MASTER_ADDR=lrdn0760
+ MASTER_ADDR=lrdn0760
+ echo 'Using 1 GPUs on 1 nodes.'
+Using 1 GPUs on 1 nodes.
+ srun bash -c 'conda run -n finetuning --no-capture-output accelerate launch     --num_machines 1     --num_processes 1     --num_cpu_threads_per_process 8     --main_process_ip lrdn0760     --main_process_port 24998     --machine_rank $SLURM_PROCID     --config_file "fsdp_config.yml"     llama3.1-70b_train.py'
+Loading checkpoint shards: 100%|██████████| 30/30 [05:57<00:00, 11.91s/it]
+Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
+max_steps is given, it will override any value given in num_train_epochs
+trainable params: 207,093,760 || all params: 70,760,800,256 || trainable%: 0.2927
+/leonardo/home/userexternal/mpfister/.conda/envs/finetuning/lib/python3.11/site-packages/accelerate/accelerator.py:1558: UserWarning: Upcasted low precision parameters in Linear because mixed precision turned on in FSDP. Affects: weight.
+  warnings.warn(
+/leonardo/home/userexternal/mpfister/.conda/envs/finetuning/lib/python3.11/site-packages/accelerate/accelerator.py:1564: UserWarning: FSDP upcast of low precision parameters may affect the precision of model checkpoints.
+  warnings.warn(
+ 38%|███▊      | 38/100 [02{'loss': 2.186, 'grad_norm': 1.1888428926467896, 'learning_rate': 2.5e-05, 'epoch': 0.0}
+                      {'loss': 0.7615, 'grad_norm': 0.5940068364143372, 'learning_rate': 5e-05, 'epoch': 0.0}
+{'train_runtime': 439.7792, 'train_samples_per_second': 1.819, 'train_steps_per_second': 0.227, 'train_loss': 1.4737699508666993, 'epoch': 0.0}
+100%|██████████| 100/100 [07:19<00:00,  4.40s/it]
+Run time: 439.78 seconds
+1 GPUs used.
+Training speed: 1.8 samples/s (=1.8 samples/s/GPU)
+Memory occupied on GPUs: 51.6 GB.