dfurman
/

Falcon-40B-Chat-v0.1

+---
+datasets:
+- OpenAssistant/oasst1
+pipeline_tag: text-generation
+---
+# Falcon-40b-chat-oasst1
+Falcon-40b-chat-oasst1 is a chatbot-like model for dialogue generation. It was built by fine-tuning [Falcon-40B](https://huggingface.co/tiiuae/falcon-40b) on the [OpenAssistant/oasst1](https://huggingface.co/datasets/OpenAssistant/oasst1) dataset.
+This model was fine-tuned in 4-bit using 🤗 [peft](https://github.com/huggingface/peft) adapters, [transformers](https://github.com/huggingface/transformers), and [bitsandbytes](https://github.com/TimDettmers/bitsandbytes).
+- The training relied on a recent method called "Low Rank Adapters" ([LoRA](https://arxiv.org/pdf/2106.09685.pdf)), instead of fine-tuning the entire model you just have to fine-tune adapters and load them properly inside the model.
+- Training took approximately 10 hours and was executed on a workstation with a single NVIDIA A100-SXM 40GB GPU (via Google Colab).
+- See attached [Notebook](https://huggingface.co/dfurman/falcon-40b-chat-oasst1/blob/main/finetune_falcon40b_oasst1_with_bnb_peft.ipynb) for the code (and hyperparams) used to train the model.
+## Model Summary
+- **Model Type:** Causal decoder-only
+- **Language(s) (NLP):** English (primarily)
+- **Base Model:** [Falcon-40B](https://huggingface.co/tiiuae/falcon-40b) (License: [TII Falcon LLM License](https://huggingface.co/tiiuae/falcon-40b#license), commercial use ok-ed)
+- **Dataset:** [OpenAssistant/oasst1](https://huggingface.co/datasets/OpenAssistant/oasst1) (License: [Apache 2.0](https://huggingface.co/datasets/OpenAssistant/oasst1/blob/main/LICENSE), commercial use ok-ed)
+### Model Date
+May 30, 2023
+## Quick Start
+To prompt the chat model, use the following format:
+```
+<human>: [Instruction]
+<bot>:
+```
+### Example Dialogue
+**Prompter**:
+```
+"""<human>: My name is Daniel. Write a short email to my closest friends inviting them to come to my home on Friday for a dinner party, I will make the food but tell them to BYOB.
+<bot>:"""
+```
+**Falcon-40b-chat-oasst1**:
+>Coming
+**Prompter**:
+```
+<human>: Create a list of things to do in San Francisco.\n
+<bot>:
+```
+**Falcon-40b-chat-oasst1**:
+>Coming
+### Direct Use
+This model has been finetuned on conversation trees from [OpenAssistant/oasst1](https://huggingface.co/datasets/OpenAssistant/oasst1) and should only be used on data of a similar nature.
+### Out-of-Scope Use
+Production use without adequate assessment of risks and mitigation; any use cases which may be considered irresponsible or harmful.
+## Bias, Risks, and Limitations
+This model is mostly trained on English data, and will not generalize appropriately to other languages. Furthermore, as it is trained on a large-scale corpora representative of the web, it will carry the stereotypes and biases commonly encountered online.
+### Recommendations
+We recommend users of this model to develop guardrails and to take appropriate precautions for any production use.
+## How to Get Started with the Model
+### Setup
+```python
+# Install and import packages
+!pip install -q -U bitsandbytes loralib einops
+!pip install -q -U git+https://github.com/huggingface/transformers.git
+!pip install -q -U git+https://github.com/huggingface/peft.git
+!pip install -q -U git+https://github.com/huggingface/accelerate.git
+import torch
+from peft import PeftModel, PeftConfig
+from transformers import AutoModelForCausalLM, AutoTokenizer
+```
+### GPU Inference in 4-bit
+This requires a GPU with at least 27GB memory.
+```python
+# load the model
+peft_model_id = "dfurman/falcon-40b-chat-oasst1"
+config = PeftConfig.from_pretrained(peft_model_id)
+bnb_config = BitsAndBytesConfig(
+    load_in_4bit=True,
+    bnb_4bit_use_double_quant=True,
+    bnb_4bit_quant_type="nf4",
+    bnb_4bit_compute_dtype=torch.bfloat16
+)
+model = AutoModelForCausalLM.from_pretrained(
+    config.base_model_name_or_path,
+    return_dict=True,
+    quantization_config=bnb_config,
+    device_map={"":0},
+    use_auth_token=True,
+    trust_remote_code=True,
+)
+tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)
+tokenizer.pad_token = tokenizer.eos_token
+model = PeftModel.from_pretrained(model, peft_model_id)
+```
+```python
+# run the model
+prompt = """<human>: My name is Daniel. Write a long email to my closest friends inviting them to come to my home on Friday for a dinner party, I will make the food but tell them to BYOB.
+<bot>:"""
+batch = tokenizer(
+    prompt,
+    padding=True,
+    truncation=True,
+    return_tensors='pt'
+)
+batch = batch.to('cuda:0')
+with torch.cuda.amp.autocast():
+    output_tokens = model.generate(
+        input_ids = batch.input_ids,
+        max_new_tokens=200,
+        temperature=0.7,
+        top_p=0.7,
+        num_return_sequences=1,
+        pad_token_id=tokenizer.eos_token_id,
+        eos_token_id=tokenizer.eos_token_id,
+    )
+# Inspect outputs
+print('\n\n', tokenizer.decode(output_tokens[0], skip_special_tokens=True))
+```
+## Reproducibility
+- See attached [Notebook](https://huggingface.co/dfurman/falcon-40b-chat-oasst1/blob/main/finetune_falcon40b_oasst1_with_bnb_peft.ipynb) for the code (and hyperparams) used to train the model.
+### CUDA Info
+- CUDA Version: 12.0
+- GPU Name: NVIDIA A100-SXM
+- Max Memory: {0: "37GB"}
+- Device Map: {"": 0}
+### Package Versions Employed
+- `torch`==2.0.1+cu118
+- `transformers`==4.30.0.dev0
+- `peft`==0.4.0.dev0
+- `accelerate`==0.19.0
+- `bitsandbytes`==0.39.0
+- `einops`==0.6.1