*** Wartungsfenster jeden ersten Mittwoch vormittag im Monat ***

Skip to content
Snippets Groups Projects
Commit 4c8e7d3f authored by Pfister, Martin's avatar Pfister, Martin
Browse files

Add LEONARDO FSDP output files.

parent 7c2667b5
Branches
No related tags found
No related merge requests found
Unloading profile/base
ERROR: Module evaluation aborted
+ date
Fri Aug 30 17:26:02 CEST 2024
+ hostname
lrdn3279.leonardo.local
+ nvidia-smi
Fri Aug 30 17:26:02 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02 Driver Version: 530.30.02 CUDA Version: 12.1 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A100-SXM-64GB On | 00000000:8F:00.0 Off | 0 |
| N/A 42C P0 61W / 464W| 0MiB / 65536MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
+ conda run -n finetuning --no-capture-output python llama3.1-70b_train.py
Using the latest cached version of the dataset since medmcqa couldn't be found on the Hugging Face Hub
Found the latest cached dataset configuration 'default' at /leonardo/home/userexternal/mpfister/.cache/huggingface/datasets/medmcqa/default/0.0.0/91c6572c454088bf71b679ad90aa8dffcd0d5868 (last modified on Thu Aug 29 19:38:14 2024).
`low_cpu_mem_usage` was None, now set to True since model is quantized.
Loading checkpoint shards: 100%|██████████| 30/30 [05:53<00:00, 11.80s/it]
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
max_steps is given, it will override any value given in num_train_epochs
trainable params: 207,093,760 || all params: 70,760,800,256 || trainable%: 0.2927
100%|██████████| 100/100 [10:42<00:00, 6.43s/it]
/leonardo/home/userexternal/mpfister/.conda/envs/finetuning/lib/python3.11/site-packages/peft/utils/other.py:619: UserWarning: Unable to fetch remote file due to the following error (MaxRetryError("HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /meta-llama/Meta-Llama-3.1-70B-Instruct/resolve/main/config.json (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x146f620ac510>: Failed to establish a new connection: [Errno 101] Network is unreachable'))"), '(Request ID: 7c6ef0fe-4985-43dd-9e1e-c63edaadc650)') - silently ignoring the lookup for the file config.json in meta-llama/Meta-Llama-3.1-70B-Instruct.
warnings.warn(
/leonardo/home/userexternal/mpfister/.conda/envs/finetuning/lib/python3.11/site-packages/peft/utils/save_and_load.py:218: UserWarning: Could not find a config file in meta-llama/Meta-Llama-3.1-70B-Instruct - will assume that the vocabulary was not modified.
warnings.warn(
{'loss': 1.9947, 'grad_norm': 0.8367132544517517, 'learning_rate': 2.5e-05, 'epoch': 0.0}
{'loss': 0.7486, 'grad_norm': 0.6377346515655518, 'learning_rate': 5e-05, 'epoch': 0.0}
{'train_runtime': 642.6727, 'train_samples_per_second': 1.245, 'train_steps_per_second': 0.156, 'train_loss': 1.3716077423095703, 'epoch': 0.0}
Run time: 642.67 seconds
1 GPUs used.
Training speed: 1.2 samples/s (=1.2 samples/s/GPU)
Memory occupied on GPUs: 53.9 GB.
real 17m34.095s
user 6m18.376s
sys 6m6.919s
login05-ext.leonardo.cineca.it
Unloading profile/base
ERROR: Module evaluation aborted
+ date
Fri Aug 30 16:52:56 CEST 2024
+ hostname
lrdn3183.leonardo.local
+ nvidia-smi
Fri Aug 30 16:52:56 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02 Driver Version: 530.30.02 CUDA Version: 12.1 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A100-SXM-64GB On | 00000000:1D:00.0 Off | 0 |
| N/A 43C P0 60W / 465W| 0MiB / 65536MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA A100-SXM-64GB On | 00000000:56:00.0 Off | 0 |
| N/A 42C P0 60W / 468W| 0MiB / 65536MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
+ export MASTER_PORT=24998
+ MASTER_PORT=24998
++ scontrol show hostnames lrdn3183
++ head -n 1
+ export MASTER_ADDR=lrdn3183
+ MASTER_ADDR=lrdn3183
+ echo 'Using 2 GPUs on 1 nodes.'
Using 2 GPUs on 1 nodes.
+ srun bash -c 'conda run -n finetuning --no-capture-output accelerate launch --num_machines 1 --num_processes 2 --num_cpu_threads_per_process 8 --main_process_ip lrdn3183 --main_process_port 24998 --machine_rank $SLURM_PROCID --config_file "fsdp_config.yml" llama3.1-70b_train.py'
Loading checkpoint shards: 100%|██████████| 30/30 [09:13<00:00, 18.45s/it]
Loading checkpoint shards: 100%|██████████| 30/30 [09:13<00:00, 18.45s/it]
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
max_steps is given, it will override any value given in num_train_epochs
trainable params: 207,093,760 || all params: 70,760,800,256 || trainable%: 0.2927
/leonardo/home/userexternal/mpfister/.conda/envs/finetuning/lib/python3.11/site-packages/accelerate/accelerator.py:1558: UserWarning: Upcasted low precision parameters in Linear because mixed precision turned on in FSDP. Affects: weight.
warnings.warn(
/leonardo/home/userexternal/mpfister/.conda/envs/finetuning/lib/python3.11/site-packages/accelerate/accelerator.py:1564: UserWarning: FSDP upcast of low precision parameters may affect the precision of model checkpoints.
warnings.warn(
38%|███▊ | 38/100 [03{'loss': 2.1235, 'grad_norm': 1.6637215614318848, 'learning_rate': 2.5e-05, 'epoch': 0.0}
{'loss': 0.7388, 'grad_norm': 0.32860875129699707, 'learning_rate': 5e-05, 'epoch': 0.01}
{'train_runtime': 536.0503, 'train_samples_per_second': 2.985, 'train_steps_per_second': 0.187, 'train_loss': 1.4311787033081054, 'epoch': 0.01}
100%|██████████| 100/100 [08:56<00:00, 5.36s/it]
Run time: 536.05 seconds
2 GPUs used.
Training speed: 3.0 samples/s (=1.5 samples/s/GPU)
Memory occupied on GPUs: 39.2 + 39.2 GB.
slurmstepd: error: *** STEP 7330690.0 ON lrdn3183 CANCELLED AT 2024-08-30T17:26:00 DUE TO TIME LIMIT ***
srun: Job step aborted: Waiting up to 92 seconds for job step to finish.
slurmstepd: error: *** JOB 7330690 ON lrdn3183 CANCELLED AT 2024-08-30T17:26:00 DUE TO TIME LIMIT ***
login05-ext.leonardo.cineca.it
Unloading profile/base
ERROR: Module evaluation aborted
+ date
Fri Aug 30 16:19:37 CEST 2024
+ hostname
lrdn2709.leonardo.local
+ nvidia-smi
Fri Aug 30 16:19:37 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02 Driver Version: 530.30.02 CUDA Version: 12.1 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A100-SXM-64GB On | 00000000:1D:00.0 Off | 0 |
| N/A 43C P0 64W / 465W| 0MiB / 65536MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA A100-SXM-64GB On | 00000000:56:00.0 Off | 0 |
| N/A 43C P0 63W / 469W| 0MiB / 65536MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 2 NVIDIA A100-SXM-64GB On | 00000000:8F:00.0 Off | 0 |
| N/A 43C P0 61W / 457W| 0MiB / 65536MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 3 NVIDIA A100-SXM-64GB On | 00000000:C8:00.0 Off | 0 |
| N/A 43C P0 63W / 462W| 0MiB / 65536MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
+ export MASTER_PORT=24998
+ MASTER_PORT=24998
++ head -n 1
++ scontrol show hostnames 'lrdn[2709,2718]'
+ export MASTER_ADDR=lrdn2709
+ MASTER_ADDR=lrdn2709
+ echo 'Using 8 GPUs on 2 nodes.'
Using 8 GPUs on 2 nodes.
+ srun bash -c 'conda run -n finetuning --no-capture-output accelerate launch --num_machines 2 --num_processes 8 --num_cpu_threads_per_process 8 --main_process_ip lrdn2709 --main_process_port 24998 --machine_rank $SLURM_PROCID --config_file "fsdp_config.yml" llama3.1-70b_train.py'
Loading checkpoint shards: 100%|██████████| 30/30 [10:40<00:00, 21.34s/it]]<03:12, 21.35s/it]48<04:05, 22.32s/it]20.26s/it]
Loading checkpoint shards: 100%|██████████| 30/30 [10:40<00:00, 21.34s/it]
Loading checkpoint shards: 100%|██████████| 30/30 [10:40<00:00, 21.34s/it]
Loading checkpoint shards: 100%|██████████| 30/30 [10:40<00:00, 21.35s/it]
Loading checkpoint shards: 100%|██████████| 30/30 [10:40<00:00, 21.34s/it]
Loading checkpoint shards: 100%|██████████| 30/30 [10:40<00:00, 21.35s/it]
Loading checkpoint shards: 100%|██████████| 30/30 [10:40<00:00, 21.34s/it]
Loading checkpoint shards: 100%|██████████| 30/30 [10:40<00:00, 21.34s/it]
max_steps is given, it will override any value given in num_train_epochs
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
max_steps is given, it will override any value given in num_train_epochs
trainable params: 207,093,760 || all params: 70,760,800,256 || trainable%: 0.2927
/leonardo/home/userexternal/mpfister/.conda/envs/finetuning/lib/python3.11/site-packages/accelerate/accelerator.py:1558: UserWarning: Upcasted low precision parameters in Linear because mixed precision turned on in FSDP. Affects: weight.
warnings.warn(
/leonardo/home/userexternal/mpfister/.conda/envs/finetuning/lib/python3.11/site-packages/accelerate/accelerator.py:1564: UserWarning: FSDP upcast of low precision parameters may affect the precision of model checkpoints.
warnings.warn(
38%|███▊ | 38/100 [04{'loss': 2.1236, 'grad_norm': 2.036155939102173, 'learning_rate': 2.5e-05, 'epoch': 0.02}
{'loss': 0.732, 'grad_norm': 0.3916214108467102, 'learning_rate': 5e-05, 'epoch': 0.04}
{'train_runtime': 654.3867, 'train_samples_per_second': 9.78, 'train_steps_per_second': 0.153, 'train_loss': 1.4278192138671875, 'epoch': 0.04}
Memory occupied on GPUs: 24.4 + 24.4 + 24.4 + 24.4 GB.
100%|██████████| 100/100 [10:54<00:00, 6.54s/it]
Run time: 654.39 seconds
8 GPUs used.
Training speed: 9.8 samples/s (=1.2 samples/s/GPU)
Memory occupied on GPUs: 24.4 + 24.4 + 24.4 + 24.4 GB.
[W socket.cpp:432] [c10d] While waitForInput, poolFD failed with (errno: 0 - Success).
[2024-08-30 16:46:56,209] torch.distributed.elastic.agent.server.api: [ERROR] Error waiting on exit barrier. Elapsed: 300.1137435436249 seconds
[2024-08-30 16:46:56,209] torch.distributed.elastic.agent.server.api: [ERROR] Traceback (most recent call last):
[2024-08-30 16:46:56,209] torch.distributed.elastic.agent.server.api: [ERROR] File "/leonardo/home/userexternal/mpfister/.conda/envs/finetuning/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py", line 929, in _exit_barrier
[2024-08-30 16:46:56,209] torch.distributed.elastic.agent.server.api: [ERROR] store_util.barrier(
[2024-08-30 16:46:56,209] torch.distributed.elastic.agent.server.api: [ERROR] File "/leonardo/home/userexternal/mpfister/.conda/envs/finetuning/lib/python3.11/site-packages/torch/distributed/elastic/utils/store.py", line 78, in barrier
[2024-08-30 16:46:56,209] torch.distributed.elastic.agent.server.api: [ERROR] synchronize(store, data, rank, world_size, key_prefix, barrier_timeout)
[2024-08-30 16:46:56,209] torch.distributed.elastic.agent.server.api: [ERROR] File "/leonardo/home/userexternal/mpfister/.conda/envs/finetuning/lib/python3.11/site-packages/torch/distributed/elastic/utils/store.py", line 64, in synchronize
[2024-08-30 16:46:56,209] torch.distributed.elastic.agent.server.api: [ERROR] agent_data = get_all(store, rank, key_prefix, world_size)
[2024-08-30 16:46:56,209] torch.distributed.elastic.agent.server.api: [ERROR] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2024-08-30 16:46:56,209] torch.distributed.elastic.agent.server.api: [ERROR] File "/leonardo/home/userexternal/mpfister/.conda/envs/finetuning/lib/python3.11/site-packages/torch/distributed/elastic/utils/store.py", line 34, in get_all
[2024-08-30 16:46:56,209] torch.distributed.elastic.agent.server.api: [ERROR] data = store.get(f"{prefix}{idx}")
[2024-08-30 16:46:56,209] torch.distributed.elastic.agent.server.api: [ERROR] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2024-08-30 16:46:56,209] torch.distributed.elastic.agent.server.api: [ERROR] torch.distributed.DistStoreError: Socket Timeout
slurmstepd: error: *** JOB 7330673 ON lrdn2709 CANCELLED AT 2024-08-30T16:52:50 DUE TO TIME LIMIT ***
srun: Job step aborted: Waiting up to 92 seconds for job step to finish.
slurmstepd: error: *** STEP 7330673.0 ON lrdn2709 CANCELLED AT 2024-08-30T16:52:50 DUE TO TIME LIMIT ***
login05-ext.leonardo.cineca.it
Unloading profile/base
ERROR: Module evaluation aborted
+ date
Fri Aug 30 16:52:56 CEST 2024
+ hostname
lrdn3301.leonardo.local
+ nvidia-smi
Fri Aug 30 16:52:56 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02 Driver Version: 530.30.02 CUDA Version: 12.1 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A100-SXM-64GB On | 00000000:1D:00.0 Off | 0 |
| N/A 42C P0 61W / 467W| 0MiB / 65536MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA A100-SXM-64GB On | 00000000:56:00.0 Off | 0 |
| N/A 43C P0 60W / 458W| 0MiB / 65536MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 2 NVIDIA A100-SXM-64GB On | 00000000:8F:00.0 Off | 0 |
| N/A 42C P0 61W / 454W| 0MiB / 65536MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 3 NVIDIA A100-SXM-64GB On | 00000000:C8:00.0 Off | 0 |
| N/A 42C P0 63W / 464W| 0MiB / 65536MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
+ export MASTER_PORT=24998
+ MASTER_PORT=24998
++ head -n 1
++ scontrol show hostnames lrdn3301
+ export MASTER_ADDR=lrdn3301
+ MASTER_ADDR=lrdn3301
+ echo 'Using 4 GPUs on 1 nodes.'
Using 4 GPUs on 1 nodes.
+ srun bash -c 'conda run -n finetuning --no-capture-output accelerate launch --num_machines 1 --num_processes 4 --num_cpu_threads_per_process 8 --main_process_ip lrdn3301 --main_process_port 24998 --machine_rank $SLURM_PROCID --config_file "fsdp_config.yml" llama3.1-70b_train.py'
Loading checkpoint shards: 100%|██████████| 30/30 [09:13<00:00, 18.46s/it]
Loading checkpoint shards: 100%|██████████| 30/30 [09:13<00:00, 18.46s/it]
Loading checkpoint shards: 100%|██████████| 30/30 [09:13<00:00, 18.46s/it]
Loading checkpoint shards: 100%|██████████| 30/30 [09:13<00:00, 18.46s/it]
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
max_steps is given, it will override any value given in num_train_epochs
trainable params: 207,093,760 || all params: 70,760,800,256 || trainable%: 0.2927
/leonardo/home/userexternal/mpfister/.conda/envs/finetuning/lib/python3.11/site-packages/accelerate/accelerator.py:1558: UserWarning: Upcasted low precision parameters in Linear because mixed precision turned on in FSDP. Affects: weight.
warnings.warn(
/leonardo/home/userexternal/mpfister/.conda/envs/finetuning/lib/python3.11/site-packages/accelerate/accelerator.py:1564: UserWarning: FSDP upcast of low precision parameters may affect the precision of model checkpoints.
warnings.warn(
38%|███▊ | 38/100 [03{'loss': 2.1259, 'grad_norm': 12.881935119628906, 'learning_rate': 2.5e-05, 'epoch': 0.01}
{'loss': 0.7453, 'grad_norm': 0.21734948456287384, 'learning_rate': 5e-05, 'epoch': 0.02}
{'train_runtime': 557.075, 'train_samples_per_second': 5.744, 'train_steps_per_second': 0.18, 'train_loss': 1.435568504333496, 'epoch': 0.02}
100%|██████████| 100/100 [09:17<00:00, 5.57s/it]
Run time: 557.08 seconds
4 GPUs used.
Training speed: 5.7 samples/s (=1.4 samples/s/GPU)
Memory occupied on GPUs: 30.1 + 29.2 + 29.2 + 30.1 GB.
slurmstepd: error: *** JOB 7330688 ON lrdn3301 CANCELLED AT 2024-08-30T17:26:00 DUE TO TIME LIMIT ***
srun: Job step aborted: Waiting up to 92 seconds for job step to finish.
slurmstepd: error: *** STEP 7330688.0 ON lrdn3301 CANCELLED AT 2024-08-30T17:26:00 DUE TO TIME LIMIT ***
Unloading profile/base
ERROR: Module evaluation aborted
+ date
Fri Aug 30 18:52:36 CEST 2024
+ hostname
lrdn0667.leonardo.local
+ nvidia-smi
Fri Aug 30 18:52:36 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02 Driver Version: 530.30.02 CUDA Version: 12.1 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A100-SXM-64GB On | 00000000:1D:00.0 Off | 0 |
| N/A 42C P0 61W / 467W| 0MiB / 65536MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA A100-SXM-64GB On | 00000000:56:00.0 Off | 0 |
| N/A 42C P0 59W / 456W| 0MiB / 65536MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 2 NVIDIA A100-SXM-64GB On | 00000000:8F:00.0 Off | 0 |
| N/A 43C P0 61W / 461W| 0MiB / 65536MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 3 NVIDIA A100-SXM-64GB On | 00000000:C8:00.0 Off | 0 |
| N/A 43C P0 60W / 456W| 0MiB / 65536MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
+ export MASTER_PORT=24998
+ MASTER_PORT=24998
++ head -n 1
++ scontrol show hostnames 'lrdn[0667,0768,0793,0997]'
+ export MASTER_ADDR=lrdn0667
+ MASTER_ADDR=lrdn0667
+ echo 'Using 16 GPUs on 4 nodes.'
Using 16 GPUs on 4 nodes.
+ srun bash -c 'conda run -n finetuning --no-capture-output accelerate launch --num_machines 4 --num_processes 16 --num_cpu_threads_per_process 8 --main_process_ip lrdn0667 --main_process_port 24998 --machine_rank $SLURM_PROCID --config_file "fsdp_config.yml" llama3.1-70b_train.py'
Using the latest cached version of the dataset since medmcqa couldn't be found on the Hugging Face Hub
Found the latest cached dataset configuration 'default' at /leonardo/home/userexternal/mpfister/.cache/huggingface/datasets/medmcqa/default/0.0.0/91c6572c454088bf71b679ad90aa8dffcd0d5868 (last modified on Thu Aug 29 19:38:14 2024).
Using the latest cached version of the dataset since medmcqa couldn't be found on the Hugging Face Hub
Found the latest cached dataset configuration 'default' at /leonardo/home/userexternal/mpfister/.cache/huggingface/datasets/medmcqa/default/0.0.0/91c6572c454088bf71b679ad90aa8dffcd0d5868 (last modified on Thu Aug 29 19:38:14 2024).
Using the latest cached version of the dataset since medmcqa couldn't be found on the Hugging Face Hub
Found the latest cached dataset configuration 'default' at /leonardo/home/userexternal/mpfister/.cache/huggingface/datasets/medmcqa/default/0.0.0/91c6572c454088bf71b679ad90aa8dffcd0d5868 (last modified on Thu Aug 29 19:38:14 2024).
Using the latest cached version of the dataset since medmcqa couldn't be found on the Hugging Face Hub
Found the latest cached dataset configuration 'default' at /leonardo/home/userexternal/mpfister/.cache/huggingface/datasets/medmcqa/default/0.0.0/91c6572c454088bf71b679ad90aa8dffcd0d5868 (last modified on Thu Aug 29 19:38:14 2024).
Using the latest cached version of the dataset since medmcqa couldn't be found on the Hugging Face Hub
Found the latest cached dataset configuration 'default' at /leonardo/home/userexternal/mpfister/.cache/huggingface/datasets/medmcqa/default/0.0.0/91c6572c454088bf71b679ad90aa8dffcd0d5868 (last modified on Thu Aug 29 19:38:14 2024).
Using the latest cached version of the dataset since medmcqa couldn't be found on the Hugging Face Hub
Found the latest cached dataset configuration 'default' at /leonardo/home/userexternal/mpfister/.cache/huggingface/datasets/medmcqa/default/0.0.0/91c6572c454088bf71b679ad90aa8dffcd0d5868 (last modified on Thu Aug 29 19:38:14 2024).
Loading checkpoint shards: 0%| | 0/30 [00:00<?, ?it/s]Using the latest cached version of the dataset since medmcqa couldn't be found on the Hugging Face Hub
Found the latest cached dataset configuration 'default' at /leonardo/home/userexternal/mpfister/.cache/huggingface/datasets/medmcqa/default/0.0.0/91c6572c454088bf71b679ad90aa8dffcd0d5868 (last modified on Thu Aug 29 19:38:14 2024).
Using the latest cached version of the dataset since medmcqa couldn't be found on the Hugging Face Hub
Found the latest cached dataset configuration 'default' at /leonardo/home/userexternal/mpfister/.cache/huggingface/datasets/medmcqa/default/0.0.0/91c6572c454088bf71b679ad90aa8dffcd0d5868 (last modified on Thu Aug 29 19:38:14 2024).
Using the latest cached version of the dataset since medmcqa couldn't be found on the Hugging Face Hub
Found the latest cached dataset configuration 'default' at /leonardo/home/userexternal/mpfister/.cache/huggingface/datasets/medmcqa/default/0.0.0/91c6572c454088bf71b679ad90aa8dffcd0d5868 (last modified on Thu Aug 29 19:38:14 2024).
Loading checkpoint shards: 0%| | 0/30 [00:00<?, ?it/s]Using the latest cached version of the dataset since medmcqa couldn't be found on the Hugging Face Hub
Found the latest cached dataset configuration 'default' at /leonardo/home/userexternal/mpfister/.cache/huggingface/datasets/medmcqa/default/0.0.0/91c6572c454088bf71b679ad90aa8dffcd0d5868 (last modified on Thu Aug 29 19:38:14 2024).
Loading checkpoint shards: 3%|â–Ž | 1/30 [00:01<00:56, 1.93s/it]Using the latest cached version of the dataset since medmcqa couldn't be found on the Hugging Face Hub
Found the latest cached dataset configuration 'default' at /leonardo/home/userexternal/mpfister/.cache/huggingface/datasets/medmcqa/default/0.0.0/91c6572c454088bf71b679ad90aa8dffcd0d5868 (last modified on Thu Aug 29 19:38:14 2024).
Loading checkpoint shards: 0%| | 0/30 [00:00<?, ?it/s]Using the latest cached version of the dataset since medmcqa couldn't be found on the Hugging Face Hub
Found the latest cached dataset configuration 'default' at /leonardo/home/userexternal/mpfister/.cache/huggingface/datasets/medmcqa/default/0.0.0/91c6572c454088bf71b679ad90aa8dffcd0d5868 (last modified on Thu Aug 29 19:38:14 2024).
Loading checkpoint shards: 3%|â–Ž | 1/30 [00:01<00:57, 1.97s/it]Using the latest cached version of the dataset since medmcqa couldn't be found on the Hugging Face Hub
Found the latest cached dataset configuration 'default' at /leonardo/home/userexternal/mpfister/.cache/huggingface/datasets/medmcqa/default/0.0.0/91c6572c454088bf71b679ad90aa8dffcd0d5868 (last modified on Thu Aug 29 19:38:14 2024).
Loading checkpoint shards: 0%| | 0/30 [00:00<?, ?it/s]Using the latest cached version of the dataset since medmcqa couldn't be found on the Hugging Face Hub
Found the latest cached dataset configuration 'default' at /leonardo/home/userexternal/mpfister/.cache/huggingface/datasets/medmcqa/default/0.0.0/91c6572c454088bf71b679ad90aa8dffcd0d5868 (last modified on Thu Aug 29 19:38:14 2024).
Loading checkpoint shards: 3%|â–Ž | 1/30 [00:01<00:54, 1.87s/it]Using the latest cached version of the dataset since medmcqa couldn't be found on the Hugging Face Hub
Found the latest cached dataset configuration 'default' at /leonardo/home/userexternal/mpfister/.cache/huggingface/datasets/medmcqa/default/0.0.0/91c6572c454088bf71b679ad90aa8dffcd0d5868 (last modified on Thu Aug 29 19:38:14 2024).
Loading checkpoint shards: 3%|â–Ž | 1/30 [00:24<11:56, 24.69s/it]Using the latest cached version of the dataset since medmcqa couldn't be found on the Hugging Face Hub
Found the latest cached dataset configuration 'default' at /leonardo/home/userexternal/mpfister/.cache/huggingface/datasets/medmcqa/default/0.0.0/91c6572c454088bf71b679ad90aa8dffcd0d5868 (last modified on Thu Aug 29 19:38:14 2024).
Loading checkpoint shards: 100%|██████████| 30/30 [12:55<00:00, 25.86s/it]–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–‹ | 23/30 [09:55<03:45, 32.21s/it]
Loading checkpoint shards: 100%|██████████| 30/30 [13:11<00:00, 26.39s/it]
Loading checkpoint shards: 100%|██████████| 30/30 [12:51<00:00, 25.70s/it]
Loading checkpoint shards: 100%|██████████| 30/30 [12:35<00:00, 25.20s/it]
Loading checkpoint shards: 100%|██████████| 30/30 [13:10<00:00, 26.34s/it]ˆâ–Ž | 25/30 [11:41<02:51, 34.33s/it]1s/it]
Loading checkpoint shards: 100%|██████████| 30/30 [12:51<00:00, 25.72s/it]
Loading checkpoint shards: 100%|██████████| 30/30 [13:00<00:00, 26.01s/it]
Loading checkpoint shards: 100%|██████████| 30/30 [13:30<00:00, 27.01s/it]
Loading checkpoint shards: 100%|██████████| 30/30 [13:46<00:00, 27.55s/it]
Loading checkpoint shards: 100%|██████████| 30/30 [12:56<00:00, 25.87s/it]
Loading checkpoint shards: 100%|██████████| 30/30 [13:29<00:00, 26.99s/it]
Loading checkpoint shards: 100%|██████████| 30/30 [13:27<00:00, 26.92s/it]
Loading checkpoint shards: 100%|██████████| 30/30 [14:16<00:00, 28.56s/it]
Loading checkpoint shards: 100%|██████████| 30/30 [14:10<00:00, 28.36s/it]
Loading checkpoint shards: 100%|██████████| 30/30 [13:50<00:00, 27.69s/it]
Loading checkpoint shards: 100%|██████████| 30/30 [14:16<00:00, 28.56s/it]
max_steps is given, it will override any value given in num_train_epochs
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
max_steps is given, it will override any value given in num_train_epochs
trainable params: 207,093,760 || all params: 70,760,800,256 || trainable%: 0.2927
max_steps is given, it will override any value given in num_train_epochs
max_steps is given, it will override any value given in num_train_epochs
/leonardo/home/userexternal/mpfister/.conda/envs/finetuning/lib/python3.11/site-packages/accelerate/accelerator.py:1558: UserWarning: Upcasted low precision parameters in Linear because mixed precision turned on in FSDP. Affects: weight.
warnings.warn(
/leonardo/home/userexternal/mpfister/.conda/envs/finetuning/lib/python3.11/site-packages/accelerate/accelerator.py:1564: UserWarning: FSDP upcast of low precision parameters may affect the precision of model checkpoints.
warnings.warn(
38%|███▊ | 38/100 [04{'loss': 2.1152, 'grad_norm': 0.8707819581031799, 'learning_rate': 2.5e-05, 'epoch': 0.03}
{'loss': 0.7285, 'grad_norm': 0.4297076463699341, 'learning_rate': 5e-05, 'epoch': 0.07}
{'train_runtime': 701.4784, 'train_samples_per_second': 18.247, 'train_steps_per_second': 0.143, 'train_loss': 1.4218804931640625, 'epoch': 0.07}
Memory occupied on GPUs: 21.6 + 21.6 + 21.6 + 21.6 GB.
100%|██████████| 100/100 [11:41<00:00, 7.01s/it]
Run time: 701.48 seconds
16 GPUs used.
Training speed: 18.2 samples/s (=1.1 samples/s/GPU)
Memory occupied on GPUs: 21.6 + 21.6 + 21.6 + 21.6 GB.
Memory occupied on GPUs: 21.6 + 21.6 + 21.6 + 21.6 GB.
Memory occupied on GPUs: 21.6 + 21.6 + 21.6 + 21.6 GB.
[W socket.cpp:432] [c10d] While waitForInput, poolFD failed with (errno: 0 - Success).
[2024-08-30 19:26:06,714] torch.distributed.elastic.agent.server.api: [ERROR] Error waiting on exit barrier. Elapsed: 300.139240026474 seconds
[2024-08-30 19:26:06,714] torch.distributed.elastic.agent.server.api: [ERROR] Traceback (most recent call last):
[2024-08-30 19:26:06,714] torch.distributed.elastic.agent.server.api: [ERROR] File "/leonardo/home/userexternal/mpfister/.conda/envs/finetuning/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py", line 929, in _exit_barrier
[2024-08-30 19:26:06,714] torch.distributed.elastic.agent.server.api: [ERROR] store_util.barrier(
[2024-08-30 19:26:06,714] torch.distributed.elastic.agent.server.api: [ERROR] File "/leonardo/home/userexternal/mpfister/.conda/envs/finetuning/lib/python3.11/site-packages/torch/distributed/elastic/utils/store.py", line 78, in barrier
[2024-08-30 19:26:06,714] torch.distributed.elastic.agent.server.api: [ERROR] synchronize(store, data, rank, world_size, key_prefix, barrier_timeout)
[2024-08-30 19:26:06,714] torch.distributed.elastic.agent.server.api: [ERROR] File "/leonardo/home/userexternal/mpfister/.conda/envs/finetuning/lib/python3.11/site-packages/torch/distributed/elastic/utils/store.py", line 64, in synchronize
[2024-08-30 19:26:06,714] torch.distributed.elastic.agent.server.api: [ERROR] agent_data = get_all(store, rank, key_prefix, world_size)
[2024-08-30 19:26:06,714] torch.distributed.elastic.agent.server.api: [ERROR] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2024-08-30 19:26:06,714] torch.distributed.elastic.agent.server.api: [ERROR] File "/leonardo/home/userexternal/mpfister/.conda/envs/finetuning/lib/python3.11/site-packages/torch/distributed/elastic/utils/store.py", line 34, in get_all
[2024-08-30 19:26:06,714] torch.distributed.elastic.agent.server.api: [ERROR] data = store.get(f"{prefix}{idx}")
[2024-08-30 19:26:06,714] torch.distributed.elastic.agent.server.api: [ERROR] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2024-08-30 19:26:06,714] torch.distributed.elastic.agent.server.api: [ERROR] torch.distributed.DistStoreError: Socket Timeout
[W socket.cpp:432] [c10d] While waitForInput, poolFD failed with (errno: 0 - Success).
[W socket.cpp:432] [c10d] While waitForInput, poolFD failed with (errno: 0 - Success).
[2024-08-30 19:26:06,809] torch.distributed.elastic.agent.server.api: [ERROR] Error waiting on exit barrier. Elapsed: 300.15521478652954 seconds
[2024-08-30 19:26:06,809] torch.distributed.elastic.agent.server.api: [ERROR] Traceback (most recent call last):
[2024-08-30 19:26:06,809] torch.distributed.elastic.agent.server.api: [ERROR] File "/leonardo/home/userexternal/mpfister/.conda/envs/finetuning/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py", line 929, in _exit_barrier
[2024-08-30 19:26:06,809] torch.distributed.elastic.agent.server.api: [ERROR] store_util.barrier(
[2024-08-30 19:26:06,809] torch.distributed.elastic.agent.server.api: [ERROR] File "/leonardo/home/userexternal/mpfister/.conda/envs/finetuning/lib/python3.11/site-packages/torch/distributed/elastic/utils/store.py", line 78, in barrier
[2024-08-30 19:26:06,809] torch.distributed.elastic.agent.server.api: [ERROR] synchronize(store, data, rank, world_size, key_prefix, barrier_timeout)
[2024-08-30 19:26:06,809] torch.distributed.elastic.agent.server.api: [ERROR] File "/leonardo/home/userexternal/mpfister/.conda/envs/finetuning/lib/python3.11/site-packages/torch/distributed/elastic/utils/store.py", line 64, in synchronize
[2024-08-30 19:26:06,809] torch.distributed.elastic.agent.server.api: [ERROR] agent_data = get_all(store, rank, key_prefix, world_size)
[2024-08-30 19:26:06,809] torch.distributed.elastic.agent.server.api: [ERROR] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2024-08-30 19:26:06,809] torch.distributed.elastic.agent.server.api: [ERROR] File "/leonardo/home/userexternal/mpfister/.conda/envs/finetuning/lib/python3.11/site-packages/torch/distributed/elastic/utils/store.py", line 34, in get_all
[2024-08-30 19:26:06,809] torch.distributed.elastic.agent.server.api: [ERROR] data = store.get(f"{prefix}{idx}")
[2024-08-30 19:26:06,809] torch.distributed.elastic.agent.server.api: [ERROR] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2024-08-30 19:26:06,809] torch.distributed.elastic.agent.server.api: [ERROR] torch.distributed.DistStoreError: Socket Timeout
[2024-08-30 19:26:06,821] torch.distributed.elastic.agent.server.api: [ERROR] Error waiting on exit barrier. Elapsed: 300.1139500141144 seconds
[2024-08-30 19:26:06,821] torch.distributed.elastic.agent.server.api: [ERROR] Traceback (most recent call last):
[2024-08-30 19:26:06,821] torch.distributed.elastic.agent.server.api: [ERROR] File "/leonardo/home/userexternal/mpfister/.conda/envs/finetuning/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py", line 929, in _exit_barrier
[2024-08-30 19:26:06,821] torch.distributed.elastic.agent.server.api: [ERROR] store_util.barrier(
[2024-08-30 19:26:06,821] torch.distributed.elastic.agent.server.api: [ERROR] File "/leonardo/home/userexternal/mpfister/.conda/envs/finetuning/lib/python3.11/site-packages/torch/distributed/elastic/utils/store.py", line 78, in barrier
[2024-08-30 19:26:06,821] torch.distributed.elastic.agent.server.api: [ERROR] synchronize(store, data, rank, world_size, key_prefix, barrier_timeout)
[2024-08-30 19:26:06,821] torch.distributed.elastic.agent.server.api: [ERROR] File "/leonardo/home/userexternal/mpfister/.conda/envs/finetuning/lib/python3.11/site-packages/torch/distributed/elastic/utils/store.py", line 64, in synchronize
[2024-08-30 19:26:06,821] torch.distributed.elastic.agent.server.api: [ERROR] agent_data = get_all(store, rank, key_prefix, world_size)
[2024-08-30 19:26:06,821] torch.distributed.elastic.agent.server.api: [ERROR] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2024-08-30 19:26:06,821] torch.distributed.elastic.agent.server.api: [ERROR] File "/leonardo/home/userexternal/mpfister/.conda/envs/finetuning/lib/python3.11/site-packages/torch/distributed/elastic/utils/store.py", line 34, in get_all
[2024-08-30 19:26:06,821] torch.distributed.elastic.agent.server.api: [ERROR] data = store.get(f"{prefix}{idx}")
[2024-08-30 19:26:06,821] torch.distributed.elastic.agent.server.api: [ERROR] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2024-08-30 19:26:06,821] torch.distributed.elastic.agent.server.api: [ERROR] torch.distributed.DistStoreError: Socket Timeout
This diff is collapsed.
login05-ext.leonardo.cineca.it
Unloading profile/base
ERROR: Module evaluation aborted
+ date
Fri Aug 30 17:26:03 CEST 2024
+ hostname
lrdn0760.leonardo.local
+ nvidia-smi
Fri Aug 30 17:26:03 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02 Driver Version: 530.30.02 CUDA Version: 12.1 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A100-SXM-64GB On | 00000000:1D:00.0 Off | 0 |
| N/A 44C P0 67W / 484W| 0MiB / 65536MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
+ export MASTER_PORT=24998
+ MASTER_PORT=24998
++ scontrol show hostnames lrdn0760
++ head -n 1
+ export MASTER_ADDR=lrdn0760
+ MASTER_ADDR=lrdn0760
+ echo 'Using 1 GPUs on 1 nodes.'
Using 1 GPUs on 1 nodes.
+ srun bash -c 'conda run -n finetuning --no-capture-output accelerate launch --num_machines 1 --num_processes 1 --num_cpu_threads_per_process 8 --main_process_ip lrdn0760 --main_process_port 24998 --machine_rank $SLURM_PROCID --config_file "fsdp_config.yml" llama3.1-70b_train.py'
Loading checkpoint shards: 100%|██████████| 30/30 [05:57<00:00, 11.91s/it]
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
max_steps is given, it will override any value given in num_train_epochs
trainable params: 207,093,760 || all params: 70,760,800,256 || trainable%: 0.2927
/leonardo/home/userexternal/mpfister/.conda/envs/finetuning/lib/python3.11/site-packages/accelerate/accelerator.py:1558: UserWarning: Upcasted low precision parameters in Linear because mixed precision turned on in FSDP. Affects: weight.
warnings.warn(
/leonardo/home/userexternal/mpfister/.conda/envs/finetuning/lib/python3.11/site-packages/accelerate/accelerator.py:1564: UserWarning: FSDP upcast of low precision parameters may affect the precision of model checkpoints.
warnings.warn(
38%|███▊ | 38/100 [02{'loss': 2.186, 'grad_norm': 1.1888428926467896, 'learning_rate': 2.5e-05, 'epoch': 0.0}
{'loss': 0.7615, 'grad_norm': 0.5940068364143372, 'learning_rate': 5e-05, 'epoch': 0.0}
{'train_runtime': 439.7792, 'train_samples_per_second': 1.819, 'train_steps_per_second': 0.227, 'train_loss': 1.4737699508666993, 'epoch': 0.0}
100%|██████████| 100/100 [07:19<00:00, 4.40s/it]
Run time: 439.78 seconds
1 GPUs used.
Training speed: 1.8 samples/s (=1.8 samples/s/GPU)
Memory occupied on GPUs: 51.6 GB.
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment