diff --git a/README.md b/README.md
index a0ad5a2bc51baac6681d9a5c8d55f28ecaf041f5..c65757b74f86a9df84b3c1fd80a0153993ba9f08 100644
--- a/README.md
+++ b/README.md
@@ -133,7 +133,7 @@ CONDA_OVERRIDE_CUDA=12.0 conda env create -n finetuning -f conda_env_pinned_vers
On [LUMI](https://lumi-supercomputer.eu), it is [strongly discouraged](https://docs.lumi-supercomputer.eu/software/installing/python/) to install Python packages directly into the filesystem, because the large number of small files does not play well with the cluster filesystem. Instead, the documentation recommends to install Singularity/Apptainer containers with the Python environment. There is a tool called [cotainr](https://docs.lumi-supercomputer.eu/software/containers/singularity/#building-containers-using-the-cotainr-tool) that helps to do this.
-As LUMI has AMD GPUs instead of the more common Nvidia GPUs, libraries that make use of ROCm instead of CUDA need to be installed. Since the [ROCm-enabled version of bitsandbytes](https://github.com/ROCm/bitsandbytes) is currently only available as source code, compiled files are included in this repository. If one wants to compile bitsandbytes themselves, instructions can be found in the folder [lumi_compile_bnb](lumi_compile_bnb).
+As LUMI has AMD GPUs instead of the more common Nvidia GPUs, libraries that make use of ROCm instead of CUDA need to be installed. Since the [ROCm-enabled version of bitsandbytes](https://github.com/ROCm/bitsandbytes) was only available as source code when this project was started, compiled files are included in this repository. If one wants to compile bitsandbytes themselves, instructions can be found in the folder [lumi_compile_bnb](lumi_compile_bnb).
To generate the Singularity container with the conda environment for LUMI, execute the following commands:
@@ -143,6 +143,14 @@ module load LUMI/23.03 cotainr
srun --pty --time=00:30:00 --mem=128G --cpus-per-task=32 --partition=debug cotainr build lumi_container.sif --system=lumi-g --conda-env=conda_env_lumi_pinned_versions.yml --accept-licenses
```
+**Update November 2024**: A pre-release version of bitsandbytes for AMD GPUs is now available as pre-compiled wheel. This version of bitsandbytes also needs ROCm>=6.1 that can only be used on LUMI since a newer ROCm driver was installed on LUMI during the September 2024 service break. For ROCm 6.1, PyTorch 2.2 that was initially used in this project is not available pre-compiled any more. If you want to use bitsandbytes ROCm pre-release together with ROCm 6.2 and PyTorch 2.5.1, you can find `conda` environment specifications (without pinned versions) in [conda_env_lumi_torch251.yml](conda_env_lumi_torch251.yml). To install it on LUMI, execute the following commands:
+
+``` bash
+module purge
+module load LUMI cotainr
+srun --pty --time=00:30:00 --mem=192G --cpus-per-task=32 --partition=debug cotainr build lumi_container.sif --base-image=/appl/local/containers/sif-images/lumi-rocm-rocm-6.2.2.sif --conda-env=conda_env_lumi_torch251.yml --accept-licenses
+```
+
### Huggingface login
Before the scripts can run, it is necessary to accept the conditions for using [Mistral 7B Instruct v0.3](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3) on the huggingface website. Additionally, it is necessary to log in to huggingface on the supercomputer so that the scripts can access the pretrained model. To do so, execute on VSC and Leonardo:
diff --git a/conda_env_lumi_torch251.yml b/conda_env_lumi_torch251.yml
new file mode 100644
index 0000000000000000000000000000000000000000..f478a58d610c05d5317fb75eb6824762c484b1b2
--- /dev/null
+++ b/conda_env_lumi_torch251.yml
@@ -0,0 +1,32 @@
+channels:
+ - conda-forge
+ - nodefaults
+dependencies:
+ - python=3.11
+ - pip
+ - pip:
+ - ipython
+ - numpy
+ - scipy
+ - accelerate
+ - peft
+ - transformers
+ - datasets
+ - py7zr
+ - optimum
+ - gekko
+ - wandb
+ - jellyfish # to calculate levenshtein distance
+ - trl
+ # - bitsandbytes # quantization (option #1)
+ # For AMD GPUs, use the following wheel instead of bitsandbytes from pypi repo:
+ - https://github.com/bitsandbytes-foundation/bitsandbytes/releases/download/continuous-release_multi-backend-refactor/bitsandbytes-0.44.1.dev0-py3-none-manylinux_2_24_x86_64.whl
+ - auto-gptq # quantization (option #2)
+ # - nvidia-ml-py3 # to read GPU memory usage on Nvidia GPUs
+ - pyrsmi # to read GPU memory usage on AMD GPUs
+ - torch==2.5.1
+ # - unsloth[cu121-ampere-torch220] @ git+https://github.com/unslothai/unsloth.git
+ # - --extra-index-url https://download.pytorch.org/whl/rocm5.6 # ROCm 5.6
+ - --extra-index-url https://download.pytorch.org/whl/rocm6.2 # ROCm 6.2
+ # - --extra-index-url https://download.pytorch.org/whl/cu118 # CUDA 11.8
+ # - --extra-index-url https://download.pytorch.org/whl/cu121 # CUDA 12.1
diff --git a/mistral7b-bnb/output_lumi_20241111_torch251.txt b/mistral7b-bnb/output_lumi_20241111_torch251.txt
new file mode 100644
index 0000000000000000000000000000000000000000..b008f404462cd05299980bf357e0a0f2752f6927
--- /dev/null
+++ b/mistral7b-bnb/output_lumi_20241111_torch251.txt
@@ -0,0 +1,66 @@
++ date
+Mon Nov 11 16:57:40 EET 2024
++ hostname
+nid005054
++ export HF_HOME=/pfs/lustrep1/scratch/project_465001276/mpfister/huggingface
++ HF_HOME=/pfs/lustrep1/scratch/project_465001276/mpfister/huggingface
++ export HF_TOKEN_PATH=/users/pfisterm/.huggingface_token
++ HF_TOKEN_PATH=/users/pfisterm/.huggingface_token
+++ git rev-parse --show-toplevel
++ CONTAINER=/pfs/lustrep1/scratch/project_465001276/mpfister/llm-finetuning/lumi_container.sif
++ export SINGULARITY_BIND=/pfs,/scratch,/projappl,/project,/flash,/appl,
++ SINGULARITY_BIND=/pfs,/scratch,/projappl,/project,/flash,/appl,
++ rocm-smi
+
+
+======================================== ROCm System Management Interface ========================================
+================================================== Concise Info ==================================================
+Device [Model : Revision] Temp Power Partitions SCLK MCLK Fan Perf PwrCap VRAM% GPU%
+ Name (20 chars) (Edge) (Avg) (Mem, Compute)
+==================================================================================================================
+0 [0x0b0c : 0x00] 45.0°C 160.0W N/A, N/A 800Mhz 1600Mhz 0% manual 500.0W 0% 0%
+ AMD INSTINCT MI200 (
+==================================================================================================================
+============================================== End of ROCm SMI Log ===============================================
++ srun singularity exec /pfs/lustrep1/scratch/project_465001276/mpfister/llm-finetuning/lumi_container.sif python mistral7b_train.py
+Map: 100%|██████████| 182822/182822 [00:37<00:00, 4843.04 examples/s]
+/opt/conda/envs/conda_container_env/lib/python3.11/site-packages/bitsandbytes/backends/cpu_xpu_common.py:28: UserWarning: g++ not found, torch.compile disabled for CPU/XPU.
+ warnings.warn("g++ not found, torch.compile disabled for CPU/XPU.")
+Loading checkpoint shards: 100%|██████████| 3/3 [02:51<00:00, 57.25s/it]
+/opt/conda/envs/conda_container_env/lib/python3.11/site-packages/huggingface_hub/utils/_deprecation.py:100: FutureWarning: Deprecated argument(s) used in '__init__': dataset_text_field, max_seq_length. Will not be supported from version '0.13.0'.
+
+Deprecated positional argument(s) used in SFTTrainer, please use the SFTConfig to set these arguments instead.
+ warnings.warn(message, FutureWarning)
+/opt/conda/envs/conda_container_env/lib/python3.11/site-packages/trl/trainer/sft_trainer.py:300: UserWarning: You passed a `max_seq_length` argument to the SFTTrainer, the value you passed will override the one in the `SFTConfig`.
+ warnings.warn(
+/opt/conda/envs/conda_container_env/lib/python3.11/site-packages/trl/trainer/sft_trainer.py:328: UserWarning: You passed a `dataset_text_field` argument to the SFTTrainer, the value you passed will override the one in the `SFTConfig`.
+ warnings.warn(
+Map: 100%|██████████| 182822/182822 [00:10<00:00, 17823.96 examples/s]
+max_steps is given, it will override any value given in num_train_epochs
+trainable params: 41,943,040 || all params: 7,289,966,592 || trainable%: 0.5754
+100%|██████████| 100/100 [02:36<00:00, 1.57s/it]
+{'loss': 3.0269, 'grad_norm': 0.0, 'learning_rate': 2.9e-05, 'epoch': 0.0}
+{'loss': 2.3382, 'grad_norm': 0.0, 'learning_rate': 7.7e-05, 'epoch': 0.0}
+{'train_runtime': 156.9882, 'train_samples_per_second': 5.096, 'train_steps_per_second': 0.637, 'train_loss': 2.682542419433594, 'epoch': 0.0}
+Run time: 156.99 seconds
+1 GPUs used.
+Training speed: 5.1 samples/s (=5.1 samples/s/GPU)
+Memory occupied on GPUs: 10.6 GB.
+
+real 7m24.304s
+user 0m0.016s
+sys 0m0.027s
++ srun singularity exec /pfs/lustrep1/scratch/project_465001276/mpfister/llm-finetuning/lumi_container.sif python mistral7b_test.py
+Loading checkpoint shards: 100%|██████████| 3/3 [00:09<00:00, 3.21s/it]
+/opt/conda/envs/conda_container_env/lib/python3.11/site-packages/bitsandbytes/backends/cpu_xpu_common.py:28: UserWarning: g++ not found, torch.compile disabled for CPU/XPU.
+ warnings.warn("g++ not found, torch.compile disabled for CPU/XPU.")
+Map: 100%|██████████| 4183/4183 [00:00<00:00, 4872.57 examples/s]
+100%|██████████| 66/66 [10:36<00:00, 9.64s/it]
+22.88% (957 out of 4183) answers correct.
+Run time: 636.51 seconds
+Samples/second: 6.6
+Memory occupied on GPUs: 24.6 GB.
+
+real 11m13.774s
+user 0m0.018s
+sys 0m0.016s