\\n <string name=\\"name\\" description=\\"a unique pet name\\"/>\\n</output>\\n\\n\\nONLY return a valid JSON object (no other text is necessary), where the key of the field in JSON is the `name` attribute of the corresponding XML, and the value is of the type specified by the corresponding XML\'s tag. The JSON MUST conform to the XML format, including any types and format requests e.g. requests for lists, objects and specific types. Be correct and concise.\\n\\nHere are examples of simple (XML, JSON) pairs that show the expected behavior:\\n- `<string name=\'foo\' format=\'two-words lower-case\' />` => `{\'foo\': \'example one\'}`\\n- `<list name=\'bar\'><string format=\'upper-case\' /></list>` => `{\\"bar\\": [\'STRING ONE\', \'STRING TWO\', etc.]}`\\n- `<object name=\'baz\'><string name=\\"foo\\" format=\\"capitalize two-words\\" /><integer name=\\"index\\" format=\\"1-indexed\\" /></object>` => `{\'baz\': {\'foo\': \'Some String\', \'index\': 1}}`\\n
Followed by another call with this prompt:
\\nI was given the following response, which was not parseable as JSON.\\n\\n\\"{\\\\n \\\\\\"pet_type\\\\\\": \\\\\\"dog\\\\\\",\\\\n \\\\\\"name\\\\\\": \\\\\\"Buddy\\"\\n\\nHelp me correct this by making it valid JSON.\\n\\nGiven below is XML that describes the information to extract from this document and the tags to extract it into.\\n\\n<output>\\n <string name=\\"pet_type\\" description=\\"Species of pet\\"/>\\n <string name=\\"name\\" description=\\"a unique pet name\\"/>\\n</output>\\n\\n\\nONLY return a valid JSON object (no other text is necessary), where the key of the field in JSON is the `name` attribute of the corresponding XML, and the value is of the type specified by the corresponding XML\'s tag. The JSON MUST conform to the XML format, including any types and format requests e.g. requests for lists, objects and specific types. Be correct and concise. If you are unsure anywhere, enter `null`.
Woof. That’s a whole lot of ceremony to get structured output! We learned that this library’s approach to structured output uses XML schemas (while others use function calling). It’s worth considering if you can fashion a better or simpler approach now that the magic has been lifted. Either way, we now have insight into how it works without dragging you into unnecessary complexity, which is a win.
\\n\\nGuidance offers constrained generation and programming constructs for writing prompts. Let’s dive into a chat example from their tutorials:
\\nimport guidance\\ngpt35 = guidance.models.OpenAI(\\"gpt-3.5-turbo\\")\\n\\nimport re\\nfrom guidance import gen, select, system, user, assistant\\n\\n@guidance\\ndef plan_for_goal(lm, goal: str):\\n \\n # This is a helper function which we will use below\\n def parse_best(prosandcons, options):\\n best = re.search(r\'Best=(\\\\d+)\', prosandcons)\\n if not best:\\n best = re.search(r\'Best.*?(\\\\d+)\', \'Best= option is 3\')\\n if best:\\n best = int(best.group(1))\\n else:\\n best = 0\\n return options[best]\\n\\n # Some general instruction to the model\\n with system():\\n lm += \\"You are a helpful assistant.\\"\\n\\n # Simulate a simple request from the user\\n # Note that we switch to using \'lm2\' here, because these are intermediate steps (so we don\'t want to overwrite the current lm object)\\n with user():\\n lm2 = lm + f\\"\\"\\"\\\\\\n I want to {goal}\\n Can you please generate one option for how to accomplish this?\\n Please make the option very short, at most one line.\\"\\"\\"\\n\\n # Generate several options. Note that this means several sequential generation requests\\n n_options = 5\\n with assistant():\\n options = []\\n for i in range(n_options):\\n options.append((lm2 + gen(name=\'option\', temperature=1.0, max_tokens=50))[\\"option\\"])\\n\\n # Have the user request pros and cons\\n with user():\\n lm2 += f\\"\\"\\"\\\\\\n I want to {goal}\\n Can you please comment on the pros and cons of each of the following options, and then pick the best option?\\n ---\\n \\"\\"\\"\\n for i, opt in enumerate(options):\\n lm2 += f\\"Option {i}: {opt}\\\\n\\"\\n lm2 += f\\"\\"\\"\\\\\\n ---\\n Please discuss each option very briefly (one line for pros, one for cons), and end by saying Best=X, where X is the number of the best option.\\"\\"\\"\\n\\n # Get the pros and cons from the model\\n with assistant():\\n lm2 += gen(name=\'prosandcons\', temperature=0.0, max_tokens=600, stop=\\"Best=\\") + \\"Best=\\" + gen(\\"best\\", regex=\\"[0-9]+\\") \\n\\n # The user now extracts the one selected as the best, and asks for a full plan\\n # We switch back to \'lm\' because this is the final result we want\\n with user():\\n lm += f\\"\\"\\"\\\\\\n I want to {goal}\\n Here is my plan: {options[int(lm2[\\"best\\"])]}\\n Please elaborate on this plan, and tell me how to best accomplish it.\\"\\"\\"\\n\\n # The plan is generated\\n with assistant():\\n lm += gen(name=\'plan\', max_tokens=500)\\n\\n return lm
\\nsystemYou are a helpful assistant.userI want to read more books\\nHere is my plan: Set aside 30 minutes of dedicated reading time each day.\\nPlease elaborate on this plan, and tell me how to best accomplish it.assistantSetting aside 30 minutes of dedicated reading time each day is a great plan to read more books. Here are some tips to help you accomplish this goal:\\n\\n1. Establish a routine: Choose a specific time of day that works best for you, whether it\'s in the morning, during lunch break, or before bed. Consistency is key to forming a habit.\\n\\n2. Create a reading-friendly environment: Find a quiet and comfortable spot where you can focus on your reading without distractions. It could be a cozy corner in your home, a park bench, or a local library.\\n\\n3. Minimize distractions: Put away your phone, turn off the TV, and avoid any other potential interruptions during your dedicated reading time. This will help you stay focused and fully immerse yourself in the book.\\n\\n4. Choose books that interest you: Select books that align with your personal interests, hobbies, or goals. When you\'re genuinely interested in the subject matter, you\'ll be more motivated to read regularly.\\n\\n5. Start with manageable goals: If you\'re new to reading or have a busy schedule, start with a smaller time commitment, such as 15 minutes, and gradually increase it to 30 minutes or more as you become more comfortable.\\n\\n6. Set a timer: Use a timer or a reading app that allows you to track your reading time. This will help you stay accountable and ensure that you dedicate the full 30 minutes to reading.\\n\\n7. Make reading enjoyable: Create a cozy reading atmosphere by lighting a candle, sipping a cup of tea, or playing soft background music. Engaging all your senses can enhance your reading experience.\\n\\n8. Join a book club or reading group: Consider joining a book club or participating in a reading group to connect with fellow book lovers. This can provide additional motivation, discussion opportunities, and book recommendations.\\n\\n9. Keep a reading log: Maintain a record of the books you\'ve read, along with your thoughts and reflections. This can help you track your progress, discover patterns in your reading preferences, and serve as a source of inspiration for future reading.\\n\\n10. Be flexible: While it\'s important to have a dedicated reading time, be flexible and adaptable. Life can sometimes get busy, so if you miss a day, don\'t be discouraged. Simply pick up where you left off and continue with your reading routine.\\n\\nRemember, the goal is to enjoy the process of reading and make it a regular part of your life. Happy reading!
This looks pretty neat! But what is it doing exactly? This makes a total of 7 calls to OpenAI, which I have put in this gist. 5 of 7 of these API calls are “internal” thoughts asking the LLM to generate ideas. Even though the temperature is set to 1.0, these “ideas” are mostly redundant. The penultimate call to OpenAI enumerates these “ideas” which I’ve included below:
\\nI want to read more books\\nCan you please comment on the pros and cons of each of the following options, and then pick the best option?\\n---\\nOption 0: Set aside dedicated time each day for reading.\\nOption 1: Set aside 30 minutes of dedicated reading time each day.\\nOption 2: Set aside dedicated time each day for reading.\\nOption 3: Set aside dedicated time each day for reading.\\nOption 4: Join a book club.\\n---\\nPlease discuss each option very briefly (one line for pros, one for cons), and end by saying Best=X, where X is the number of the best option.
\\nI know from experience that you are likely to get better results if you tell the language model to generate ideas in one shot. That way, the LLM can reference previous ideas and achieve more diversity. This is a good example of accidental complexity: its very tempting to take this design pattern and apply it blindly. This is less of a critique of this particular framework, since the code makes it clear that 5 independent calls will happen. Either way, its good idea to check your work by inspecting API calls!.
\\nLangchain is a multi-tool for all things LLM. Lots of people rely on Langchain when get started with LLMs. The core LangChain library doesn’t generally hide prompts from you, however there are experimental features that do. Let’s take a look at one of these features called SmartLLMChain:
\\nfrom langchain.prompts import PromptTemplate\\nfrom langchain_experimental.smart_llm import SmartLLMChain\\nfrom langchain_openai import ChatOpenAI\\n\\nhard_question = \\"I have a 12 liter jug and a 6 liter jug.\\\\\\nI want to measure 6 liters. How do I do it?\\"\\nprompt = PromptTemplate.from_template(hard_question)\\nllm = ChatOpenAI(temperature=0, model_name=\\"gpt-3.5-turbo\\")
Idea 1: 1. Fill the 12 liter jug completely.\\n2. Pour the contents of the 12 liter jug into the 6 liter jug. This will leave you with 6 liters in the 12 liter jug.\\n3. Empty the 6 liter jug.\\n4. Pour the remaining 6 liters from the 12 liter jug into the now empty 6 liter jug.\\n5. You now have 6 liters in the 6 liter jug.\\n\\nIdea 2: 1. Fill the 12 liter jug completely.\\n2. Pour the contents of the 12 liter jug into the 6 liter jug. This will leave you with 6 liters in the 12 liter jug.\\n3. Empty the 6 liter jug.\\n4. Pour the remaining 6 liters from the 12 liter jug into the now empty 6 liter jug.\\n5. You now have 6 liters in the 6 liter jug.\\n\\nImproved Answer:\\n1. Fill the 12 liter jug completely.\\n2. Pour the contents of the 12 liter jug into the 6 liter jug until the 6 liter jug is full. This will leave you with 6 liters in the 12 liter jug and the 6 liter jug completely filled.\\n3. Empty the 6 liter jug.\\n4. Pour the remaining 6 liters from the 12 liter jug into the now empty 6 liter jug.\\n5. You now have 6 liters in the 6 liter jug.\\n\\nFull Answer:\\nTo measure 6 liters using a 12 liter jug and a 6 liter jug, follow these steps:\\n1. Fill the 12 liter jug completely.\\n2. Pour the contents of the 12 liter jug into the 6 liter jug until the 6 liter jug is full. This will leave you with 6 liters in the 12 liter jug and the 6 liter jug completely filled.\\n3. Empty the 6 liter jug.\\n4. Pour the remaining 6 liters from the 12 liter jug into the now empty 6 liter jug.\\n5. You now have 6 liters in the 6 liter jug.
\\nNeat! So what happened exactly? While this API emits logs that show you a lot of information (available on this gist), the API request pattern is interesting:
\\nTwo seperate api calls for each “idea”.
Another API call that incorporates the two ideas as context, with the prompt:
\\n\\nYou are a researcher tasked with investigating the 2 response options provided. List the flaws and faulty logic of each answer options. Let’w work this out in a step by step way to be sure we have all the errors:”
\\n
A final API call that that takes the critique from step 2 and generates an answer.
Its not clear that this approach is optimal. I am not sure it should take 4 separate API calls to accomplish this task. Perhaps the critique and the final answer could be generated in one step? Furthermore, the prompt has a spelling error (Let\'w
) and also overly focuses on the negative about identifying errors - which makes me skeptical that this prompt has been optimized or tested.
Instructor is a framework for structured outputs.
\\nHere is a basic example from the project’s README that allows you to extract structured data by using Pydantic to define your schema.
\\nimport instructor\\nfrom openai import OpenAI\\nfrom pydantic import BaseModel\\n\\nclient = instructor.patch(OpenAI())\\n\\nclass UserDetail(BaseModel):\\n name: str\\n age: int\\n\\nuser = client.chat.completions.create(\\n model=\\"gpt-3.5-turbo\\",\\n response_model=UserDetail,\\n messages=[{\\"role\\": \\"user\\", \\"content\\": \\"Extract Jason is 25 years old\\"}])
We can see how this works by inspecting the API call logged to mitmproxy:
\\n{\\n \\"function_call\\": {\\n \\"name\\": \\"UserDetail\\"\\n },\\n \\"functions\\": [\\n {\\n \\"description\\": \\"Correctly extracted `UserDetail` with all the required parameters with correct types\\",\\n \\"name\\": \\"UserDetail\\",\\n \\"parameters\\": {\\n \\"properties\\": {\\n \\"age\\": {\\n \\"title\\": \\"Age\\",\\n \\"type\\": \\"integer\\"\\n },\\n \\"name\\": {\\n \\"title\\": \\"Name\\",\\n \\"type\\": \\"string\\"\\n }\\n },\\n \\"required\\": [\\n \\"age\\",\\n \\"name\\"\\n ],\\n \\"type\\": \\"object\\"\\n }\\n }\\n ],\\n \\"messages\\": [\\n {\\n \\"content\\": \\"Extract Jason is 25 years old\\",\\n \\"role\\": \\"user\\"\\n }\\n ],\\n \\"model\\": \\"gpt-3.5-turbo\\"\\n}
This is great. For structured output - It does exactly what I want, and it correctly uses the OpenAI API the way I would use it if I were writing this manually (by defining a function schema). I would consider this specific API a zero-cost abstraction, meaning it does exactly what I expect it to with a minimal surface area.
\\nHowever, instructor has other APIs that are more agressive and write prompts for you. For example, consider this validation example. Running through that example should trigger similar questions to the exploration of Langchain’s SmartLLMChain above. In this example, you will observe 3 LLM API calls to get the right answer, with the final payload looking like this:
\\n{\\n \\"function_call\\": {\\n \\"name\\": \\"Validator\\"\\n },\\n \\"functions\\": [\\n {\\n \\"description\\": \\"Validate if an attribute is correct and if not,\\\\nreturn a new value with an error message\\",\\n \\"name\\": \\"Validator\\",\\n \\"parameters\\": {\\n \\"properties\\": {\\n \\"fixed_value\\": {\\n \\"anyOf\\": [\\n {\\n \\"type\\": \\"string\\"\\n },\\n {\\n \\"type\\": \\"null\\"\\n }\\n ],\\n \\"default\\": null,\\n \\"description\\": \\"If the attribute is not valid, suggest a new value for the attribute\\",\\n \\"title\\": \\"Fixed Value\\"\\n },\\n \\"is_valid\\": {\\n \\"default\\": true,\\n \\"description\\": \\"Whether the attribute is valid based on the requirements\\",\\n \\"title\\": \\"Is Valid\\",\\n \\"type\\": \\"boolean\\"\\n },\\n \\"reason\\": {\\n \\"anyOf\\": [\\n {\\n \\"type\\": \\"string\\"\\n },\\n {\\n \\"type\\": \\"null\\"\\n }\\n ],\\n \\"default\\": null,\\n \\"description\\": \\"The error message if the attribute is not valid, otherwise None\\",\\n \\"title\\": \\"Reason\\"\\n }\\n },\\n \\"required\\": [],\\n \\"type\\": \\"object\\"\\n }\\n }\\n ],\\n \\"messages\\": [\\n {\\n \\"content\\": \\"You are a world class validation model. Capable to determine if the following value is valid for the statement, if it is not, explain why and suggest a new value.\\",\\n \\"role\\": \\"system\\"\\n },\\n {\\n \\"content\\": \\"Does `According to some perspectives, the meaning of life is to find purpose, happiness, and fulfillment. It may vary depending on individual beliefs, values, and cultural backgrounds.` follow the rules: don\'t say objectionable things\\",\\n \\"role\\": \\"user\\"\\n }\\n ],\\n \\"model\\": \\"gpt-3.5-turbo\\",\\n \\"temperature\\": 0\\n}
Concretely, I’m curious if these steps could be collapsed into two LLM calls instead of three. Furthermore, I wonder if generic validation functions (as supplied in the above payload) are the right way to critique output? I don’t know the answer, but this is an interesting design pattern that is worth poking at.
\\nAs far as LLM frameworks go, I really like this one. The core functionality of defining schemas with Pydantic is very convenient. The code is also very readable and easy to understand. Despite this, I still found it helpful to intercept instructor’s API calls to get another perspective.
\\nThere is a way to set a logging level in instructor to see the raw API calls, however, I like using a framework agnostic approach :)
\\nDSPy is the framework that helps you optimize your prompts to optimize any arbitrary metric. There is a fairly steep learning curve to DSPy, partly because it introduces many new technical terms specific to its framework like compilers and teleprompters. However, we can quickly peel back the complexity by looking at the API calls that it makes!
\\nLet’s run the minimal working example:
\\nimport time\\nimport dspy\\nfrom dspy.datasets.gsm8k import GSM8K, gsm8k_metric\\nstart_time = time.time()\\n\\n# Set up the LM\\nturbo = dspy.OpenAI(model=\'gpt-3.5-turbo-instruct\', max_tokens=250)\\ndspy.settings.configure(lm=turbo)\\n\\n# Load math questions from the GSM8K dataset\\ngms8k = GSM8K()\\ntrainset, devset = gms8k.train, gms8k.dev
from dspy.teleprompt import BootstrapFewShotWithRandomSearch\\n\\n# Set up the optimizer: we want to \\"bootstrap\\" (i.e., self-generate) 8-shot examples of our CoT program.\\n# The optimizer will repeat this 10 times (plus some initial attempts) before selecting its best attempt on the devset.\\nconfig = dict(max_bootstrapped_demos=8, max_labeled_demos=8, num_candidate_programs=10, num_threads=4)\\n\\n# Optimize! Use the `gms8k_metric` here. In general, the metric is going to tell the optimizer how well it\'s doing.\\nteleprompter = BootstrapFewShotWithRandomSearch(metric=gsm8k_metric, **config)\\noptimized_cot = teleprompter.compile(CoT(), trainset=trainset, valset=devset)
Despite this being the official quick-start/minimal working example, this code took more than 30 minutes to run, and made hundreds of calls to OpenAI! This cost non-trivial time (and money), especially as an entry-point to the library for someone trying to take a look. There was no prior warning that this would happen.
\\nDSPy made 100s of API calls because it was iteratively sampling examples for a few-shot prompt and selecting the best ones according to the gsm8k_metric
on a validation set. I was able to quickly understand this by scanning through the API requests logged to mitmproxy.
DSPy offers an inspect_history
method which allows you to see the the last n
prompts and their completions:
I was able to verify that these prompts matched the last few API calls being made in mitmproxy. Overall, I would be motivated to potentially keep the prompt and and jettison the library. That being said, I think I am curious to see how this library evolves.
\\nDo I hate LLM libraries? No! I think many of the libraries in this blog post could be helpful if used thoughtfully in the right situations. However, I’ve witnessed too many people fall into the trap of using these libraries without understanding what they are doing.
\\nOne thing I focus on as an independent consultant is to make sure my clients don’t take on accidental complexity. It’s very tempting to adopt additional tools given all the excitement around LLMs. Looking at prompts is one way to mitigate that temptation.
\\nI’m wary of frameworks that distance the human too far from LLMs. By whispering “Fuck you, show me the prompt!” when using these tools, you are empowered to decide for yourself.1
\\n
Acknowledgments: Thanks to Jeremy Howard and Ben Clavie for thoughtfully reviewing this post.
\\nYou don’t have to whisper. Saying it out loud is fine too - let others know!↩︎
Hamel Husain
\\nJanuary 11, 2024
\\nAxolotl is a great project for fine-tuning LLMs. I started contributing to the project, and I found that it was difficult to debug. I wanted to share some tips and tricks I learned along the way, along with configuration files for debugging with VSCode. Moreover, I think being able to debug axolotl empowers developers who encounter bugs or want to understand how the code works. I hope this document helps you get started.
\\nI contributed this blog post’s contents as documentation for the axolotl project. You can find this content in the axolotl repo here.
\\nWhile debugging, it’s helpful to simplify your test scenario as much as possible. Here are some tips for doing so:
\\nAll of these tips are incorporated into the example configuration for debugging with VSCode below.
\\nMake sure you are using the latest version of axolotl: This project changes often and bugs get fixed fast. Check your git branch and make sure you have pulled the latest changes from main
.
Eliminate Concurrency: Restrict the number of processes to 1 for both training and data preprocessing:
\\nCUDA_VISIBLE_DEVICES
to a single GPU, ex: export CUDA_VISIBLE_DEVICES=0
.dataset_processes: 1
in your axolotl config or run the training command with --dataset_processes=1
.Use a small dataset: Construct or use a small dataset from HF Hub. When using a small dataset, you will often have to make sure sample_packing: False
and eval_sample_packing: False
to avoid errors. If you are in a pinch and don’t have time to construct a small dataset but want to use from the HF Hub, you can shard the data (this will still tokenize the entire dataset but will only use a fraction of the data for training. For example, to shard the dataset into 20 pieces, add the following to your axolotl config):
Use a small model: A good example of a small model is TinyLlama/TinyLlama-1.1B-Chat-v1.0.
Minimize iteration time: Make sure the training loop finishes as fast as possible, with these settings.
\\nmicro_batch_size: 1
max_steps: 1
val_set_size: 0
Clear Caches: Axolotl caches certain steps and so does the underlying HuggingFace trainer. You may want to clear some of these caches when debugging.
\\ndataset_prepared_path:
in your axolotl config. If you didn’t set this value, the default is last_run_prepared
.~/.cache/huggingface/datasets/...
folder(s).The below example shows how to configure VSCode to debug data preprocessing of the sharegpt
format. This is the format used when you have the following in your axolotl config:
datasets:\\n - path: <path to your sharegpt formatted dataset> # example on HF Hub: philschmid/guanaco-sharegpt-style\\n type: sharegpt
If you are already familiar with advanced VSCode debugging, you can skip the below explanation and look at the files .vscode/launch.json and .vscode/tasks.json for an example configuration.
\\nIf you prefer to watch a video, rather than read, you can skip to the video tutorial below (but doing both is recommended).
\\nMake sure you have an editable install of Axolotl, which ensures that changes you make to the code are reflected at runtime. Run the following commands from the root of this project:
\\n\\nIf you developing on a remote host, you can easily use VSCode to debug remotely. To do so, you will need to follow this remote - SSH guide. You can also see the video below on Docker and Remote SSH debugging.
\\nThe easiest way to get started is to modify the .vscode/launch.json file in the axolotl GitHub repo. This is just an example configuration, so you may need to modify or copy it to suit your needs.
\\nFor example, to mimic the command cd devtools && CUDA_VISIBLE_DEVICES=0 accelerate launch -m axolotl.cli.train dev_sharegpt.yml
, you would use the below configuration1. Note that we add additional flags that override the axolotl config and incorporate the tips above (see the comments). We also set the working directory to devtools
and set the env
variable HF_HOME
to a temporary folder that is later partially deleted. This is because we want to delete the HF dataset cache before each run in order to ensure that the data preprocessing code is run from scratch.
// https://github.com/OpenAccess-AI-Collective/axolotl/blob/main/.vscode/launch.json\\n{\\n \\"version\\": \\"0.2.0\\",\\n \\"configurations\\": [\\n {\\n \\"name\\": \\"Debug axolotl prompt - sharegpt\\",\\n \\"type\\": \\"python\\",\\n \\"module\\": \\"accelerate.commands.launch\\",\\n \\"request\\": \\"launch\\",\\n \\"args\\": [\\n \\"-m\\", \\"axolotl.cli.train\\", \\"dev_sharegpt.yml\\",\\n // The flags below simplify debugging by overriding the axolotl config \\n // with the debugging tips above. Modify as needed.\\n \\"--dataset_processes=1\\", // limits data preprocessing to one process\\n \\"--max_steps=1\\", // limits training to just one step\\n \\"--batch_size=1\\", // minimizes batch size\\n \\"--micro_batch_size=1\\", // minimizes batch size\\n \\"--val_set_size=0\\", // disables validation\\n \\"--sample_packing=False\\", // disables sample packing which is necessary for small datasets\\n \\"--eval_sample_packing=False\\",// disables sample packing on eval set\\n \\"--dataset_prepared_path=temp_debug/axolotl_outputs/data\\", // send data outputs to a temp folder\\n \\"--output_dir=temp_debug/axolotl_outputs/model\\" // send model outputs to a temp folder\\n ],\\n \\"console\\": \\"integratedTerminal\\", // show output in the integrated terminal\\n \\"cwd\\": \\"${workspaceFolder}/devtools\\", // set working directory to devtools from the root of the project\\n \\"justMyCode\\": true, // step through only axolotl code\\n \\"env\\": {\\"CUDA_VISIBLE_DEVICES\\": \\"0\\", // Since we aren\'t doing distributed training, we need to limit to one GPU\\n \\"HF_HOME\\": \\"${workspaceFolder}/devtools/temp_debug/.hf-cache\\"}, // send HF cache to a temp folder\\n \\"preLaunchTask\\": \\"cleanup-for-dataprep\\", // delete temp folders (see below)\\n }\\n ]\\n}
Additional notes about this configuration:
\\njustMyCode
is set to true
such that you step through only the axolotl code. If you want to step into dependencies, set this to false
.preLaunchTask
: cleanup-for-dataprep
is defined in .vscode/tasks.json and is used to delete the following folders before debugging, which is essential to ensure that the data pre-processing code is run from scratch:\\n./devtools/temp_debug/axolotl_outputs
./devtools/temp_debug/.hf-cache/datasets
You may not want to delete these folders. For example, if you are debugging model training instead of data pre-processing, you may NOT want to delete the cache or output folders. You may also need to add additional tasks to the tasks.json
file depending on your use case.
Below is the ./vscode/tasks.json file that defines the cleanup-for-dataprep
task. This task is run before each debugging session when you use the above configuration. Note how there are two tasks that delete the two folders mentioned above. The third task cleanup-for-dataprep
is a composite task that combines the two tasks. A composite task is necessary because VSCode does not allow you to specify multiple tasks in the preLaunchTask
argument of the launch.json
file.
// https://github.com/OpenAccess-AI-Collective/axolotl/blob/main/.vscode/tasks.json\\n// this file is used by launch.json\\n{\\n \\"version\\": \\"2.0.0\\",\\n \\"tasks\\": [\\n // this task changes into the devtools directory and deletes the temp_debug/axolotl_outputs folder\\n {\\n \\"label\\": \\"delete-outputs\\",\\n \\"type\\": \\"shell\\",\\n \\"command\\": \\"rm -rf temp_debug/axolotl_outputs\\",\\n \\"options\\":{ \\"cwd\\": \\"${workspaceFolder}/devtools\\"},\\n \\"problemMatcher\\": []\\n },\\n // this task changes into the devtools directory and deletes the `temp_debug/.hf-cache/datasets` folder\\n {\\n \\"label\\": \\"delete-temp-hf-dataset-cache\\",\\n \\"type\\": \\"shell\\",\\n \\"command\\": \\"rm -rf temp_debug/.hf-cache/datasets\\",\\n \\"options\\":{ \\"cwd\\": \\"${workspaceFolder}/devtools\\"},\\n \\"problemMatcher\\": []\\n },\\n // this task combines the two tasks above\\n {\\n \\"label\\": \\"cleanup-for-dataprep\\",\\n \\"dependsOn\\": [\\"delete-outputs\\", \\"delete-temp-hf-dataset-cache\\"],\\n }\\n ]\\n}
Your debugging use case may differ from the example above. The easiest thing to do is to put your own axolotl config in the devtools
folder and modify the launch.json
file to use your config. You may also want to modify the preLaunchTask
to delete different folders or not delete anything at all.
The following video tutorial walks through the above configuration and demonstrates how to debug with VSCode:
\\n\\nUsing official Axolotl Docker images is a great way to debug your code, and is a very popular way to use Axolotl. Attaching VSCode to Docker takes a few more steps.
\\nOn the host that is running axolotl (ex: if you are using a remote host), clone the axolotl repo and change your current directory to the root:
\\n\\nIf you already have axolotl cloned on your host, make sure you have the latest changes and change into the root of the project.
\\nNext, run the desired docker image and mount the current directory. Below is a docker command you can run to do this:2
\\ndocker run --privileged --gpus \'\\"all\\"\' --shm-size 10g --rm -it --name axolotl --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --mount type=bind,src=\\"${PWD}\\",target=/workspace/axolotl -v ${HOME}/.cache/huggingface:/root/.cache/huggingface winglian/axolotl:main-py3.10-cu118-2.0.1
To understand which containers are available, see the Docker section of the README and the DockerHub repo. For details of how the Docker containers are built, see axolotl’s Docker CI builds.
\\nYou will now be in the container. Next, perform an editable install of Axolotl:
\\n\\nNext, if you are using a remote host, Remote into this host with VSCode. If you are using a local host, you can skip this step.
\\nNext, select Dev Containers: Attach to Running Container...
using the command palette (CMD + SHIFT + P
) in VSCode. You will be prompted to select a container to attach to. Select the container you just created. You will now be in the container with a working directory that is at the root of the project. Any changes you make to the code will be reflected both in the container and on the host.
Now you are ready to debug as described above (see Debugging with VSCode).
\\nHere is a short video that demonstrates how to attach to a Docker container on a remote host:
\\n\\n\\n\\nThe config actually mimics the command CUDA_VISIBLE_DEVICES=0 python -m accelerate.commands.launch -m axolotl.cli.train devtools/sharegpt.yml
, but this is the same thing.↩︎
Many of the below flags are recommended best practices by Nvidia when using nvidia-container-toolkit. You can read more about these flags here.↩︎
Hamel Husain
\\nJanuary 9, 2024
\\nDokku is an open-source Platform as a Service (PaaS) that runs on a single server of your choice. It’s like Heroku, but you own it. It is a great way to get the benefits of Heroku without the costs (Heroku can get quite expensive!). I need to deploy many applications for my LLM consulting work. Having a cost-effective, easy-to-use serverless platform is essential for me.
\\nI run a Dokku server on a $7/month VPS on OVHcloud for non-gpu workloads. These applications include things like nbsanity and data cleaning tools for LLMs.
\\nSome of the features I love about Dokku:
\\nMake sure you install Dokku on your VPS. As I mentioned, I use OVH.
\\nAn easy way to deploy applications is with a Docker container.
\\nTo deploy a Docker container, I put a Dockerfile in the root of my git repo like this:
\\nDockerfile\\n
On the Dokku host, create the app:
\\n\\nLocally, set up access to the Dokku host and name it dokku
in your ~/.ssh/config
file. For example, here is mine:
Host dokku\\n HostName <The external IP address of your Dokku host>\\n User ubuntu\\n IdentityFile /Users/hamel/.ssh/dokku
\\nLocally, add the Dokku host as a remote and push to it:
\\n\\nThat’s it - your app should be running on the Dokku host! Your local logs will print the URL that your application is served on, which by default will be myapp.yourdomain.com
. You can also scale it up/down with the following command:
We are just scratching the surface. For more details, see the Dokku docs.
\\nGitHub Pages is annoying in that you can’t easily deploy private static sites without paying for an expensive Enterprise account. With Dokku, you can easily deploy a static site from a private GitHub Repo and password-protect it.
\\nWe will assume that you have a static site in a git repo in a folder named _site
.
On the Dokku host, create an app named mysite
and set the NGINX_ROOT
environment variable to _site
:
Also on the Dokku host, install basic auth and set permissions so the plugin can work properly.
\\n# do setup for the auth plugin that we will use later\\nsudo dokku plugin:install https://github.com/dokku/dokku-http-auth.git\\nsudo chmod +x /home/dokku
Then execute the following commands from the root of your git repo that contains the static site. :
\\n1touch .static\\n2echo BUILDPACK_URL=https://github.com/dokku/buildpack-nginx > .env\\n3git remote add dokku dokku@dokku:mysite
dokku
that this is a static site\\ndokku
to use the nginx buildpack for static sites (it will usually automatically detect this, but if you have a project with code and a static site, you need to tell it to use the nginx buildpack so it doesn’t get confused).\\ndokku
host as a remote. For this to work, make sure dokku
is a hostname in your ~/.ssh/config
file as described in the previous section.\\nFinally, deploy your application:
\\n\\nYou can now add auth by running the following command on the Dokku host:
\\n\\nYou can add multiple usernames/passwords and even filter specific IPs. See the docs.
\\nIt’s often desirable to have HTTPS for your site. Dokku makes this easy with the Let’s Encrypt Plugin, which will even auto-renew for you. I don’t use this, because I’m letting Cloudflare handle this with its proxy.
\\nIf you are using Cloudflare this way, activating this plugin will mess things up (don’t worry its easy to disable). Honestly, I think it’s easier to let Cloudflare handle it if you are already doing so.
\\nYou can automatically deploy Dokku apps with GitHub Actions, which is helpful if you don’t want to fiddle with pushing to the Dokku host. Here is an example GitHub Action workflow that does this:
\\ndeploy-dokku.yml\\n
name: CI\\non:\\n workflow_dispatch:\\n push:\\n branches: [main]\\n\\nconcurrency: # Cancel previous jobs to avoid deploy locks on dokku\\n group: ${{ github.ref }}\\n cancel-in-progress: true\\n\\njobs:\\n deploy-dokku:\\n runs-on: ubuntu-latest\\n steps:\\n - name: Checkout code\\n uses: actions/checkout@v2\\n with:\\n fetch-depth: 0\\n \\n - name: Install SSH key\\n run: |\\n echo \\"${{ secrets.DOKKU_SSH_PRIVATE_KEY }}\\" > private_key.pem\\n chmod 600 private_key.pem\\n\\n - name: Add remote and push\\n run: |\\n git remote add dokku dokku@rechat.co:llm-eval\\n GIT_SSH_COMMAND=\\"ssh -i private_key.pem -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no\\" git push dokku main -f
These are things I often forget, so I’m writing them down here. For these examples, assume my app is named llm-eval
and my host is rechat.co
.
You don’t have to ssh into the Dokku host just to execute commands. You can execute them remotely via the dokku
user like this:
This is how you can invalidate the docker cache for a fresh build:
\\n\\nSometimes you want to rebuild without pushing. There are many ways to do this, but one way is like this:
\\n\\nI had to dig up these details whenever I wanted to deploy a new app, so I had to write it up anyway. I hope you find it useful, too!
\\n\\n\\nLots of people experience fiddly behavior when using LLMs. For example:
\\n\\n\\n\\n\\nUnironically I found this to be very helpful when prompting LLMs. Giving them spaces and new lines pic.twitter.com/vVuxcCuDzB\\n
\\n— anton (@abacaj) November 24, 2023\\n
If you aren’t careful, these can be very hard to debug. This is because of the subtle ways tokenizers work that is not always easy to see by looking at the text.
\\nThe below example demonstrates how things can get confusing and can drift between training and inference time.
\\nPopular frameworks like axolotl construct prompts by concatenating tokens instead of strings.1 It is reasonable to decode the training data to check what the prompt template is:
\\nFor example, a prompt may be constructed like this:
\\n\\nIt’s common for inference servers to assemble the prompt for you. The below looks like it should be fine, right?
\\n\\nWrong! Notice the difference in the decoding of the prompt vs the training data. This is a subtle problem that can be hard to debug.
\\n\\nDecode your inference data right before your forward pass. For example, you’ll notice the newline is missing if you do this. This is one way to tell that something fishy is going on.
\\n\\nUse the new HuggingFace chat template when possible. This will help avoid these issues (however, I would still check using method #1 to be sure!). Related GitHub Issue comment.
\\nThis is real example of how tokenization drift can bite you.
\\nfrom transformers import AutoTokenizer\\ntokenizer = AutoTokenizer.from_pretrained(\\"NousResearch/Llama-2-7b-chat-hf\\")\\n\\nchat = [\\n {\\"role\\": \\"system\\", \\"content\\": \\"lorem\\"},\\n {\\"role\\": \\"user\\", \\"content\\": \\"abc\\"},\\n {\\"role\\": \\"assistant\\", \\"content\\": \\"ipsum\\"},\\n {\\"role\\": \\"user\\", \\"content\\": \\"123\\"},\\n {\\"role\\": \\"assistant\\", \\"content\\": \\"sit\\"},\\n]\\n\\nids = tokenizer.apply_chat_template(chat)\\nprint(tokenizer.decode(ids))
<s>[INST] <<SYS>>\\nlorem\\n<</SYS>>\\n\\nabc [/INST] ipsum</s><s>[INST] 123 [/INST] sit</s>
\\n<s>
)Got the token ids from this test.
\\naxolotl_ids = [1, 518, 25580, 29962, 3532, 14816, 29903, 6778, 13, \\n 29880, 3668, 13, 29966, 829, 14816, 29903, 6778, 13, \\n 13, 10736, 518, 29914, 25580, 29962, 23421, 2, 1, \\n 518, 25580, 29962, 29871, 29896, 29906, 29941, 518, \\n 29914, 25580, 29962, 7845, 2]\\nprint(tokenizer.decode(axolotl_ids))
<s> [INST] <<SYS>>\\nlorem\\n<</SYS>>\\n\\nabc [/INST] ipsum</s><s> [INST] 123 [/INST] sit</s>
\\nSee the second token 518
this is a mismatch with the HF Chat template which is 29961
Axolotl assembles prompts in token space rather than string space.
\\ntokenizer.encode(\'<s>\', add_special_tokens=False) + tokenizer.encode(\'[INST]\', add_special_tokens=False)
[1, 518, 25580, 29962]
\\nHF Chat templates interpolate strings instead
\\n\\nThese are other examples of people being bitten by drift between differences in tokenization between training and inference time:
\\nThis is for good reason, as masking must also be done at the token level.↩︎
October 30, 2024
\\nI think many people should build their own data annotation/curation tools for LLMs. The benefits far outweigh costs in many situations, especially when using general-purpose front-end frameworks. It’s too critical of a task to outsource without careful consideration. Furthermore, you don’t want to get constrained by the limitations of a vendor’s tool early on.
\\nI recommend using Shiny For Python for reasons discussed here. I wouldn’t recommend Streamlit for reasons discussed here.
\\n\\n\\nOne pattern I noticed is that great AI researchers are willing to manually inspect lots of data. And more than that, they build infrastructure that allows them to manually inspect data quickly. Though not glamorous, manually examining data gives valuable intuitions about the problem. The canonical example here is Andrej Karpathy doing the ImageNet 2000-way classification task himself.
\\n\\n
I couldn’t agree with Jason more. I don’t think people look at their data enough. Building your own tools so you can quickly sort through and curate your data is one of the highest-impact activities you can do when working with LLMs. Looking at and curating your own data is critical for both evaluation and fine-tuning.
\\nAt the outset, I tried to avoid building something myself. I tried the following vendors who provide tools for data curation/review:
\\nThese tools are at varying levels of maturity. I interacted with the developers on all these products, and they were super responsive, kind and aware of these limitations. I expect that these tools will improve significantly over time.
\\nOne thing that became clear to me while trying these vendors is the importance of being able to hack these tools to fit your specific needs. Every company you work with will have an idiosyncratic tech stack and tools that you might want to integrate into this data annotation tool. This led me to build my own tools using general-purpose frameworks.
\\nPython has really great front-end frameworks that are easy to use like Gradio or Panel and Streamlit. There is a new kid on the block, Shiny For Python, was my favorite after evaluating all of them.
\\nReasons I liked Shiny the most:
\\nI found that Shiny apps always required much less code and were easier to understand than the other frameworks.
\\n\\n\\nOctober 30, 2024
\\nA previous version of this note suggested that you could run Llama 70b on a single A100. This was incorrect. The Modal container was caching the download of the much smaller 7b model. I have updated the post to reflect this. h/t to Cade Daniel for finding the mistake.
\\nLet’s paste an image below:
\\nLarge models like Llama-2-70b may not fit in a single GPU. I previously profiled the smaller 7b model against various inference tools. When a model is too big to fit on a single GPU, we can use various techniques to split the model across multiple GPUs.
\\nI used Modal Labs for serverless compute. Modal is very economical and built for machine learning use cases. Unlike other clouds, there are plenty of A100s available. They even give you $30 of free credits, which is more than enough to run the experiments in this note. Thanks to Modal, the scripts I reference in this note are reproducible.
\\nIn this note, I’m using modal client
version: 0.50.2889
vLLM
vLLM
supports tensor parallelism, which you can enable by passing the tensor_parallel_size
argument to the LLM
constructor.
I modified this example Modal code for Llama v2 13b
to run Llama v2 70b
on 4 GPUs with tensor parallelism. Below is a simplified diff with the most important changes:
def download_model_to_folder():\\n from huggingface_hub import snapshot_download\\n\\n snapshot_download(\\n- \\"meta-llama/Llama-2-13b-chat-hf\\",\\n+ \\"meta-llama/Llama-2-70b-chat-hf\\",\\n local_dir=\\"/model\\",\\n token=os.environ[\\"HUGGINGFACE_TOKEN\\"],\\n )\\n\\nimage = (\\n Image.from_dockerhub(\\"nvcr.io/nvidia/pytorch:22.12-py3\\")\\n .pip_install(\\"torch==2.0.1\\", index_url=\\"https://download.pytorch.org/whl/cu118\\")\\n+ # Pin vLLM to 8/2/2023\\n+ .pip_install(\\"vllm @ git+https://github.com/vllm-project/vllm.git@79af7e96a0e2fc9f340d1939192122c3ae38ff17\\")\\n- # Pin vLLM to 07/19/2023\\n- .pip_install(\\"vllm @ git+https://github.com/vllm-project/vllm.git@bda41c70ddb124134935a90a0d51304d2ac035e8\\")\\n # Use the barebones hf-transfer package for maximum download speeds. No progress bar, but expect 700MB/s.\\n- .pip_install(\\"hf-transfer~=0.1\\")\\n+ #Force a rebuild to invalidate the cache (you can remove `force_build=True` after the first time)\\n+ .pip_install(\\"hf-transfer~=0.1\\", force_build=True)\\n .run_function(\\n download_model_to_folder,\\n secret=Secret.from_name(\\"huggingface\\"),\\n timeout=60 * 20)\\n)\\n...\\n\\n-@stub.cls(gpu=\\"A100\\", secret=Secret.from_name(\\"huggingface\\"))\\n+# You need a minimum of 4 A100s that are the 40GB version\\n+@stub.cls(gpu=gpu.A100(count=4, memory=40), secret=Secret.from_name(\\"huggingface\\"))\\nclass Model:\\n def __enter__(self):\\n from vllm import LLM\\n\\n # Load the model. Tip: MPT models may require `trust_remote_code=true`.\\n- self.llm = LLM(MODEL_DIR)\\n+ self.llm = LLM(MODEL_DIR, tensor_parallel_size=4)\\n...
See big-inference-vllm.py for the actual script I used.
\\nI found that when I ran the above code and changed the model name, I had to force a rebuild of the image to invalidate the cache. Otherwise, the old version of the model would be used. You can force a rebuild by adding force_build=True
to the .pip_install
call.
When I initially wrote this note, I was fooled into believing I could load meta-llama/Llama-2-70b-chat-hf
on a single A100. It was this tricky issue of the container that cached the download of the much smaller 7b
model. 🤦
After setting the appropriate secrets for HuggingFace and Weights & Biases, You can run this code on Modal with the following command:
\\n\\nYou need at least 4 A100 GPUs to serve Llama v2 70b.
\\nEven though distributed inference is interesting for big models that do not fit on a single GPU, interesting things happen when you serve smaller models this way. Below, I test throughput for Llama v2 7b
on 1, 2, and 4 GPUs. The throughput is measured by passsing these 59 prompts to llm.generate
. llm.generate
is described in the vLLM documentation:
\\n\\nCall
\\nllm.generate
to generate the outputs. It adds the input prompts to vLLM engine’s waiting queue and executes the vLLM engine to generate the outputs with high throughput.
Here are the results, averaged over 5 runs for each row:
\\n\\n | \\n | \\n | avg tok/sec | \\n
---|---|---|---|
model | \\nGPU | \\nnum_gpus | \\n\\n |
Llama-2-70b-chat-hf | \\nNVIDIA A100-SXM4-40GB | \\n4 | \\n380.9 | \\n
Llama-2-7b-chat-hf | \\nNVIDIA A10 | \\n1 | \\n458.8 | \\n
2 | \\n497.3 | \\n||
4 | \\n543.6 | \\n||
NVIDIA A100-SXM4-40GB | \\n1 | \\n842.9 | \\n|
2 | \\n699.1 | \\n||
4 | \\n650.7 | \\n
You can see all the individual runs here. In my experiments, the 70b model needed a minimum of 4 A100s to run, so that’s why there is only one row for that model (Modal only has instances with 1, 2, or 4 GPUs).
\\n```{python}
\\nThe tok/sec number you see here is VERY different than the latency benchmark shown on this note. This particular benchmark maximizes throughput by running multiple requests in parallel. The previous latency benchmark measures the time it takes to process a single request.
\\nLlama v2 70b
model is only ~2x slower than its 7b counterpart.In theory, Pipeline Parallelism (“PP”) is slower than Tensor Parallelism, but tools for PP are compatible with a wider range of models from the HuggingFace Hub. By default, HuggingFace accelerate will automatically split the model across multiple GPUs when you pass device_map=\\"auto\\"
. (Accelerate offers other kinds of parallelism as well, like integrations with DeepSpeed).
This blog post and these docs are an excellent place to start. I will explore this and other kinds of parallelism in future notes.
\\n\\n\\nAs of 8/6/2023 2 A10s costs .000612 / sec
on Modal, whereas 1 A100 40GB will cost 0.001036 / sec
. See this pricing chart↩︎
For A10 and A100s you can only get up to 4 GPUs. Furthermore, I ran into an issue with vLLM and llama 70b, where it doesn’t like an odd number of GPUs.↩︎
October 30, 2024
\\nBelow is a summary of my findings:
\\nvLLM
) if you want to deploy HuggingFace LLMs in a standard way. TGI has some nice features like telemetry baked in (via OpenTelemetry) and integration with the HF ecosystem like inference endpoints. One thing to note that as of 7/28/2023, the license for TGI was changed to be more restrictive that may interfere with certain commercial uses. I am personally not a fan of the license.This study focuses on various approaches to optimizing latency. Specifically, I want to know which tools are the most effective at optimizing latency for open source LLMs. In order to focus on latency, I hold the following variables constant:
\\nn = 1
for all prediction requests (holding throughput constant).1Nvidia A6000
GPU, unless otherwise noted.200
.In addition to batch size of n = 1
and using a A6000
GPU (unless noted otherwise), I also made sure I warmed up the model by sending an initial inference request before measuring latency.
\\n | \\n | \\n | avg tok/sec | \\navg time (seconds) | \\navg output token count | \\n
---|---|---|---|---|---|
platform | \\noptions | \\ngpu | \\n\\n | \\n | \\n |
CTranslate2 | \\nfloat16 quantization | \\nA6000 | \\n44.8 | \\n4.5 | \\n200.0 | \\n
int8 quantization | \\nA6000 | \\n62.6 | \\n3.2 | \\n200.0 | \\n|
HF Hosted Inference Endpoint | \\n- | \\nA10G | \\n30.4 | \\n6.6 | \\n202.0 | \\n
HuggingFace Transformers (no server) | \\n- | \\nA6000 | \\n24.6 | \\n7.5 | \\n181.4 | \\n
nf4 4bit quantization bitsandbytes | \\nA6000 | \\n24.3 | \\n7.6 | \\n181.4 | \\n|
TGI | \\n- | \\nA6000 | \\n21.1 | \\n9.5 | \\n200.0 | \\n
quantized w/ GPTQ | \\nA6000 | \\n23.6 | \\n8.8 | \\n200.0 | \\n|
quantized w/ bitsandbytes | \\nA6000 | \\n1.9 | \\n103.0 | \\n200.0 | \\n|
mlc | \\nq4f16 | \\nA6000 | \\n117.1 | \\n1.3 | \\n153.9 | \\n
text-generation-webui | \\nexllama | \\nA6000 | \\n77.0 | \\n1.7 | \\n134.0 | \\n
vllm | \\n- | \\nA100 (on Modal Labs) | \\n41.5 | \\n3.4 | \\n143.1 | \\n
A6000 | \\n46.4 | \\n3.8 | \\n178.0 | \\n
In some cases I did not use an A6000
b/c the platform didn’t have that particular GPU available. You can ignore these rows if you like, but I still think it is valuable information. I had access to a A6000, so I just used what I had.
I noticed that the output of the LLM was quite different (less tokens) when using vLLM. I am not sure if I did something wrong here, or it changes the behavior of the LLM.
\\nFurthermore, the goal was not to be super precise on these benchmarks but rather to get a general sense of how things work and how they might compare to each other out of the box. Some of the tools above are inference servers which perform logging, tracing etc. in addition to optimizing models which effect latency. The idea is to see where there are significant differences between tools. I discussed this more here.
\\nOne capability you need to be successful with open source LLMs is the ability to serve models efficiently. There are two categories of tools for model inference:
\\nInference servers: these help with providing a web server that can provide a REST/grpc or other interface to interact with your model as a service. These inference servers usually have parameters to help you make trade-offs between throughput and latency. Additionally, some inference servers come with additional features like telemetry, model versioning and more. You can learn more about this topic the serving section of these notes. For LLMs, popular inference servers are the Text Generation Inference (TGI) and vLLM.
Model Optimization: These modify your model to make them faster for inference. Examples include quantization, Paged Attention, Exllama and more.
It is common to use both Inference servers and Model Optimization techniques in conjunction. Some inference servers like TGIand vLLM even help you apply optimization techniques.3
\\nOther than benchmarking, an important goal of this study was to understand how to use different platforms & tools.
\\nStart with compiling the model as shown in these docs
\\nAfter installing MLC, you can compile meta-llama/Llama-2-7b-chat-hf
like so:
python3 -m mlc_llm.build \\\\\\n--hf-path meta-llama/Llama-2-7b-chat-hf \\\\\\n--target cuda --quantization q4f16_1
The arguments for the compliation are documented here. This puts the model in the ./dist/
folder with the name Llama-2-7b-chat-hf-q4f16_1
.
You can use their python client to interact with the compiled model:
\\nfrom mlc_chat import ChatModule, ChatConfig\\ncfg = ChatConfig(max_gen_len=200)\\ncm = ChatModule(model=\\"Llama-2-7b-chat-hf-q4f16_1\\", chat_config=cfg)\\noutput = cm.generate(prompt=prompt)
You can see the full benchmarking code here.
\\nI wasn’t able to get meta-llama/Llama-2-7b-hf
to run correctly with the supplied python client so I am using the chat variant (Llama-2-7b-chat-hf
) as a proxy. I asked the kind folks who work on the mlc project and they said the python client is currently designed for chat, such that they have this system prompt that is hard coded for llama models:
conv.system =\\n (\\"[INST] <<SYS>>\\\\n\\\\nYou are a helpful, respectful and honest assistant. \\"\\n \\"Always answer as helpfully as possible, while being safe. \\"\\n \\"Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, \\"\\n \\"or illegal content. \\"\\n \\"Please ensure that your responses are socially unbiased and positive in nature.\\\\n\\\\n\\"\\n \\"If a question does not make any sense, or is not factually coherent, explain why instead \\"\\n \\"of answering something not correct. \\"\\n \\"If you don\'t know the answer to a question, please don\'t share false \\"\\n \\"information.\\\\n<</SYS>>\\\\n\\\\n \\");
\\nIf you want to fix this, you must edit mlc-chat-config.json
, changing conv_template
to LM
. These docs say more about the config.json
.
The config file is located in ./dist/<model-name>/params/mlc-chat-config.json
. For example:
> cat ./dist/Llama-2-7b-hf-q4f16_1/params/mlc-chat-config.json\\n\\n{\\n \\"model_lib\\": \\"Llama-2-7b-hf-q4f16_1\\",\\n \\"local_id\\": \\"Llama-2-7b-hf-q4f16_1\\",\\n \\"conv_template\\": \\"llama-2\\",\\n \\"temperature\\": 0.7,\\n \\"repetition_penalty\\": 1.0,\\n \\"top_p\\": 0.95,\\n \\"mean_gen_len\\": 128,\\n \\"max_gen_len\\": 512,\\n \\"shift_fill_factor\\": 0.3,\\n \\"tokenizer_files\\": [\\n \\"tokenizer.json\\",\\n \\"tokenizer.model\\"\\n ],\\n \\"model_category\\": \\"llama\\",\\n \\"model_name\\": \\"Llama-2-7b-hf\\"\\n}
\\nCTranslate2 is an optimization tool that can make models ridiculously fast. h/t to Anton. The documentation for CTranslate2 contains specific instructions for llama models.
\\nTo optimize llama v2
, we first need to quantize the model. This can be done like so:
ct2-transformers-converter --model meta-llama/Llama-2-7b-hf --quantization int8 --output_dir llama-2-7b-ct2 --force
meta-llama/Llama-2-7b-hf
refers to the HuggingFace repo for this model. The benchmarking code is as follows (can also be found here):
import time\\nimport ctranslate2\\nimport transformers\\nimport sys\\nsys.path.append(\'../common/\')\\nfrom questions import questions\\nimport pandas as pd\\n\\ngenerator = ctranslate2.Generator(\\"llama-2-7b-ct2\\", device=\\"cuda\\")\\ntokenizer = transformers.AutoTokenizer.from_pretrained(\\"meta-llama/Llama-2-7b-hf\\")\\n\\ndef predict(prompt:str):\\n \\"Generate text give a prompt\\"\\n start = time.perf_counter()\\n tokens = tokenizer.convert_ids_to_tokens(tokenizer.encode(prompt))\\n results = generator.generate_batch([tokens], sampling_topk=1, max_length=200, include_prompt_in_result=False)\\n tokens = results[0].sequences_ids[0]\\n output = tokenizer.decode(tokens)\\n request_time = time.perf_counter() - start\\n return {\'tok_count\': len(tokens),\\n \'time\': request_time,\\n \'question\': prompt,\\n \'answer\': output,\\n \'note\': \'CTranslate2 int8 quantization\'}\\n\\nif __name__ == \'__main__\':\\n counter = 1\\n responses = []\\n\\n for q in questions:\\n if counter >= 2: responses.append(predict(q))\\n counter += 1\\n\\n df = pd.DataFrame(responses)\\n df.to_csv(\'bench-ctranslate-int8.csv\', index=False)
The license for TGI was recently changed away from Apache 2.0 to be more restrictive. Be careful when using TGI in commercial applications.
\\nText generation inference which is often referred to as “TGI” was easy to use without any optimization. You can run it like this:
\\n“start_server.sh”\\n
#!/bin/bash\\n\\nif [ -z \\"$HUGGING_FACE_HUB_TOKEN\\" ]\\nthen\\n echo \\"HUGGING_FACE_HUB_TOKEN is not set. Please set it before running this script.\\"\\n exit 1\\nfi\\n\\nmodel=\\"TheBloke/Llama-2-7B-GPTQ\\"\\nvolume=$PWD/data\\n\\ndocker run --gpus all \\\\\\n -e HUGGING_FACE_HUB_TOKEN=$HUGGING_FACE_HUB_TOKEN \\\\\\n -e GPTQ_BITS=4 -e GPTQ_GROUPSIZE=128 \\\\\\n --shm-size 5g -p 8081:80 \\\\\\n -v $volume:/data ghcr.io/huggingface/text-generation-inference \\\\\\n --max-best-of 1 \\"$@\\"
We can then run the server with this command:
\\n\\nQuantization was very difficult to get working. There is a —quantize
flag with accepts bitsandbytes
and gptq
. The bitsandbytes
approach makes inference much slower, which others have reported.
To make gptq
work for llama v2 models requires a bunch of work, you have to install the text-generation-server which can take a while and is very brittle to get right. I had to step through the Makefile carefully. After that you have to download the weights with:
You can run the following command to perform the quantization (the last argument is the destination directory where the weights are stored).
\\n\\nHowever, this step is not needed for the most popular models, as someone will likely already have quantized and uploaded them to the Hub.
\\nAlternatively, you can use a pre-quantized model that has been uploaded to the Hub. TheBloke/Llama-2-7B-GPTQ is a good example of one. To get this to work, you have to be careful to set the GPTQ_BITS
and GPTQ_GROUPSIZE
environment variables to match the config. For example This config necessitates setting GPTQ_BITS=4
and GPTQ_GROUPSIZE=128
These are already set in start_server.sh
shown above. This PR will eventually fix that.
To use the TheBloke/Llama-2-7B-GPTQ with TGI, I can use the same bash script with the following arguments:
\\n\\nWhen I first drafted this study I got the following response on twitter:
\\n\\n\\n\\n\\nBased on your code (https://t.co/hSYaPTsEaK) it seems like you measure the full HTTP request, which is like comparing trees to an apple.\\n
\\n— Philipp Schmid (@_philschmid) July 29, 2023\\n
Phillip certainly has a point! I am indeed testing both! I’m looking for big differences in tools here, and since some inference servers have optimization tools, and some optimization tools do not have an inference server I cannot do a true apples to apples comparison. However, I think its still useful to try different things as advertised to see what is possible, and also take note of really significant gaps in latency between tools.
\\nTherefore, I ran the following tests to perform the similar optimizations as TGI, but without the server to see what happened:
\\nI was able to get slightly better performance without the TGI server as predicted by Phillip, but it did not account for the the massive gap between some tools (which is exactly the kind of thing I was looking for).
\\nTo benchmark quantization with bitsandbytes, I followed this blog post and wrote this benchmarking code. I quantized the model by loading it like this:
\\nmodel_id = \\"meta-llama/Llama-2-7b-hf\\"\\ntokenizer = AutoTokenizer.from_pretrained(model_id)\\nnf4_config = BitsAndBytesConfig(\\n load_in_4bit=True,\\n bnb_4bit_quant_type=\\"nf4\\",\\n bnb_4bit_compute_dtype=torch.bfloat16\\n)\\nmodel_nf4 = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=nf4_config)
Unlike TGI, I was able to get bitsandbytes to work properly here, but just like TGI it didn’t speed anything up for me with respect to inference latency. As reflected in the benchmark table, I got nearly the same results with transformers without any optimizations.
\\nI also quantized the model using AutoGPTQ without an inference server to compare against TGI. The code for that is here.
\\nThe results were so bad ~ 5 tok/sec that I decided not to put this in the table, because it seemed quite off to me.
\\nAman let me know about text-generation-web-ui, and also these instructions for quickly experimenting with ExLlama and ggml. I wasn’t able to get the ggml
variant to work properly, unfortunately. If you are really serious about using exllama, I recommend trying to use it without the text generation UI and look at the exllama repo, specifically at test_benchmark_inference.py. (I didn’t have time for this, but if I was going to use exllama for anything serious I would go this route).
From the root of the text-generation-web-ui repo, you can run the following commands to start an inference server optimized with ExLlama
:
python3 download-model.py TheBloke/Llama-2-7B-GPTQ\\npython3 server.py --listen --extensions openai --loader exllama_hf --model TheBloke_Llama-2-7B-GPTQ
After the server was started, I used this code to conduct the benchmark.
\\nOverall, I didn’t like this particular piece of software much. It’s bit bloated because its trying to do too many things at once (An inference server, Web UIs, and other optimizations). That being said, the documentation is good and it is easy to use.
\\nI don’t think there is any particular reason to use this unless you want an end-to-end solution that also comes with a web user-interface (which many people want!).
\\nvLLM only works with CUDA 11.8, which I configured using this approach. After configuring CUDA and installing the right version of PyTorch, you need to install the bleeding edge from git:
\\n\\nA good recipe to use for vLLM can be find on these Modal docs. Surprisingly, I had much lower latency when running on a local A6000
vs. a hosted A100
on Modal Labs. It’s possible that I did something wrong here. Currently, vLLM
is the fastest solution for when you need distributed inference (i.e. when your model doesn’t fit on a single GPU)..
vLLM
offers a server, but I benchmarked the model locally using their tools instead. The code for the benchmarking can be found here:
from vllm import SamplingParams, LLM\\n\\n#from https://modal.com/docs/guide/ex/vllm_inference\\n\\nquestions = [\\n # Coding questions\\n \\"Implement a Python function to compute the Fibonacci numbers.\\",\\n \\"Write a Rust function that performs binary exponentiation.\\",\\n \\"What are the differences between Javascript and Python?\\",\\n # Literature\\n \\"Write a story in the style of James Joyce about a trip to the Australian outback in 2083, to see robots in the beautiful desert.\\",\\n \\"Who does Harry turn into a balloon?\\",\\n \\"Write a tale about a time-traveling historian who\'s determined to witness the most significant events in human history.\\",\\n # Math\\n \\"What is the product of 9 and 8?\\",\\n \\"If a train travels 120 kilometers in 2 hours, what is its average speed?\\",\\n \\"Think through this step by step. If the sequence a_n is defined by a_1 = 3, a_2 = 5, and a_n = a_(n-1) + a_(n-2) for n > 2, find a_6.\\",\\n]\\n\\nMODEL_DIR = \\"/home/ubuntu/hamel-drive/vllm-models\\"\\n\\ndef download_model_to_folder():\\n from huggingface_hub import snapshot_download\\n import os\\n\\n snapshot_download(\\n \\"meta-llama/Llama-2-7b-hf\\",\\n local_dir=MODEL_DIR,\\n token=os.environ[\\"HUGGING_FACE_HUB_TOKEN\\"],\\n )\\n return LLM(MODEL_DIR)\\n\\n\\ndef generate(question, llm, note=None):\\n response = {\'question\': question, \'note\': note}\\n sampling_params = SamplingParams(\\n temperature=1.0,\\n top_p=1,\\n max_tokens=200,\\n )\\n \\n start = time.perf_counter()\\n result = llm.generate(question, sampling_params)\\n request_time = time.perf_counter() - start\\n\\n for output in result:\\n response[\'tok_count\'] = len(output.outputs[0].token_ids)\\n response[\'time\'] = request_time\\n response[\'answer\'] = output.outputs[0].text\\n \\n return response\\n\\nif __name__ == \'__main__\':\\n llm = download_model_to_folder()\\n counter = 1\\n responses = []\\n\\n for q in questions:\\n response = generate(question=q, llm=llm, note=\'vLLM\')\\n if counter >= 2:\\n responses.append(response)\\n counter += 1\\n \\n df = pd.DataFrame(responses)\\n df.to_csv(\'bench-vllm.csv\', index=False)
I deployed an inference endpoint on HuggingFace for meta-llama/Llama-2-7b-hf, on a Nvidia A10G
GPU. I didn’t try to turn on any optimizations like quantization and wanted to see what the default performance would be like.
The documentation for these interfaces can be found here. There is also a python client.
\\nTheir documentation says they are using TGI under the hood. However, my latency was significantly faster on their hosted inference platform than using TGI locally. This could be due to the fact that I used a A10G
with them but only a A6000
locally. It’s worth looking into why this discrepancy exists further.
The code for this benchmark can be found here.
\\n\\n\\nIt is common to explore the inference vs throughput frontier when conducting inference benchmarks. I did not do this, since I was most interested in latency. Here is an example of how to conduct inference benchmarks that consider both throughput and latency.↩︎
For Llama v2 models, you must be careful to use the models ending in -hf
as those are the ones that are compatible with the transformers library.↩︎
The Modular Inference Engine is another example of an inference server that also applies optimization techniques. At the time of this writing, this is proprietary technology, but its worth keeping an eye on this in the future.↩︎
Hamel Husain
\\nMay 30, 2023
\\nA few friends have asked me why I decided not to commercialize nbdev, especially after putting lots of work into the project, including leaving my full-time job to work on it. So I thought I would write a short post to explain my reasoning.
\\nnbdev is an innovative software development framework for Python that embraces literate and exploratory programming. I worked on nbdev from 2020-2023 with Jeremy Howard and, later, Wasim Lorgat. I had the privilege and excitement of exploring the boundaries of developer tools and exploratory programming while working with very talented software engineers. In addition to creating a tool many people enjoyed, I enjoyed using nbdev for personal and professional projects.
\\nWhile conducting product research, I interviewed many developers from different backgrounds to understand their pain points and needs. All developers I talked to struggled with one key challenge: it was difficult, if not impossible, to convince other engineers to use nbdev.
\\nThe following are the biggest reasons that prevented adoption:
\\nI viewed solving the above problems as potential opportunities for commercializing nbdev.
\\nJeremy, Wasim, and I eventually settled on the idea of “WordPress for developers,” a hosted site allowing people to create and share nbdev projects. We thought this would be an excellent way to get people to try nbdev without installing anything. The idea was to narrow the audience to people interested in hosting projects on a platform that promoted exploration and sharing, similar to Glitch that was as easy to use and pragmatic as Wordpress.
\\nAround the same time we began discussing hosted tools, the machine learning world experienced a tectonic shift due to the explosion of Generative AI, namely Stable Diffusion. fast.ai, the organization that created nbdev, was also changing its focus. fast.ai’s prime directive was to make deep learning accessible to as many people as possible, and generative AI was too important to ignore. Accordingly, Jeremy placed his full attention on a Stable Diffusion course.
\\nThis pivot caused some turbulence as we navigated the different priorities of nbdev, generative AI research, and making money. We eventually settled on offering consulting services for everything related to fast.ai in the form of fast.ai partners, which would allow us to bootstrap ourselves financially and embrace the larger mission of fast.ai (including generative AI and research). Eventually, I found the splintered focus across so many areas to be unproductive1 and decided to step away from everything except consulting to regain my footing.
\\nSoon after that, ChatGPT emerged onto the scene and caused further shifts in machine learning that were orders of magnitude larger than their text-to-image predecessors. Pretty soon, all of my clients were interested in language models, and I found myself working exclusively on operationalizing them (a skill that I have cultivated by working in machine learning for 20+ years). Additionally, LLMs profoundly changed the nature of software development, especially the kind of software development that nbdev was designed to support2. These factors and those discussed earlier suggested it was a good time to step away from nbdev and focus on other things.
\\nI learned some important lessons during this process:
\\nI suspect that I’m not completely finished with nbdev. I may revisit the project or related ideas when the time is right. I’m excited by the work Posit is doing in the areas of literate and exploratory programming, which include many of the ideas explored in nbdev. Wasim has even joined the team at Posit, so I’m excited to see what they come up with.7
\\nRegarding what I’m working on next – I’ll have to save my thoughts on that for another post 😊.
\\n\\n\\nI burned out several times during this process, but I didn’t realize why at the time. Not surprisingly, trying to focus on too many things at once was the root cause.↩︎
See this demo for ideas on how coding with LLMs might look like, especially with notebooks.↩︎
The problem with the hosted solution is that this is not something I would want to use. I can’t picture myself trying to host code on something other than GitHub/GitLab.↩︎
Without shared conviction, there is no glue holding everyone together and people can drift apart.↩︎
I’ll share more about this in a future post.↩︎
I don’t believe this is always the case, but it can be true depending on the dynamics of the group.↩︎
We previously partnered with Posit and JJ Allaire and built nbdev on top of Quarto. I’m currently advising Posit on their product and strategy. They have additional projects on their roadmap that I cannot disclose now.↩︎
Hamel Husain
\\nJanuary 16, 2023
\\nIf you came here looking for the course, feel free to jump ahead to: K8s For Data Scientists.
\\nKubernetes, known as K8s, is an open-source system for deploying and managing containerized applications in the cloud. An increasing amount of modern web applications are deployed on K8s. If you are an ML engineer, it is increasingly likely that either the infrastructure you use to train, monitor, or orchestrate your models is deployed on K8s, or downstream applications that consume your models are running on K8s. However, K8s is a complex system that can be intimidating to learn.
\\nI agree with Chip Huyen that, in theory, Data Scientists shouldn’t need to learn K8s. However, the truth is: Even though you shouldn’t have to, it’s really beneficial if you do! I’ve found that I’m often constrained by infrastructure and that infrastructure is increasingly hosted on Kubernetes.
\\nFor example, I’m rarely given access to a cloud provider’s console, and instead, I have access to a K8s cluster with some data tools already installed. When something goes awry, it’s beneficial to know enough about K8s to debug the issue. Additionally, familiarity with basic concepts allows me to have more productive conversations with my team about infrastructure.
\\nVicki Boykis seems to agree that the investment in learning this technology is worthwhile1:
\\nBelow, I outline several reasons why learning K8s is a good idea for machine learning engineers2.
\\nLarge cloud providers offer their flavors of ML infrastructure as hosted solutions3. However, there is often a gap between these offerings and the needs of machine learning teams. For example, I’ve seen the following tools deployed alongside or in place of hosted solutions:
\\nWhen open source isn’t enough, third-party vendors are happy to install their software on your cloud. However, you often need basic infrastructure skills to enable this. These skills often intersect with Kubernetes. While you may not be responsible for deploying the infrastructure yourself, it is helpful to understand the basics of how things work so that you can do basic debugging and troubleshooting. For example, knowing where to find logs or an API/HTTPS endpoint can unblock you in many cases.
\\nA typical first experience as a machine learning professional is that you don’t have the necessary tools to get started. This is incredibly frustrating, as making progress without the proper tools can be hard. This experience usually culminates in a conversation like this:
\\nML Eng: I’m excited to join ACME company! You’ve hired me to optimize marketing spending with predictive models. The issue is that we don’t have the basic infrastructure or tools necessary for me to work efficiently.
\\nManager: I’m confused. Can’t you install the tools you need? Isn’t that what you are for? I was expecting that you would figure it out.
\\nML Eng: No, I don’t know how to set up and deploy infrastructure. We need a special infrastructure or DevOps person for that.
\\nManager: It will be hard to ask for more resources if we don’t know the expected return on investment. Can you do the ML project first, demonstrate some value, and then we can invest in infrastructure?
\\nML Eng: I need some minimum tools to experiment more quickly and develop a proof of concept. Also, I need tools that might help me collaborate better with my team…
\\nMy experience is that DevOps teams are chronically understaffed and overworked. While it usually isn’t advisable to deploy enterprise software yourself on Kubernetes for security concerns, having basic skills can lift a tremendous burden off your DevOps counterparts and make it tractable for them to help you.
\\nK8s are not a panacea for all infrastructure problems. You must operate within the constraints of your organization and existing software stack.4 However, with its growing popularity, it is increasingly likely that learning this technology will help you.
\\nOne of the best ways to set yourself apart as a data scientist is through your skills. Traditional education often emphasizes learning the latest ML techniques. However, cutting-edge ML research is very competitive. It’s also an extremely crowded space.
\\nIn my experience, the bottleneck many teams face is not a lack of knowledge of cutting-edge ML techniques but software engineering skills and partners to help operationalize models. If you take some time to learn how to stand up tools and infrastructure, you will be invaluable to your team.
\\nMore importantly, deploying and integrating models into services and applications is critical to connecting ML to business problems. Learning K8s will help you do this.
\\nJust as Python is the lingua franca of data science, K8s is becoming the lingua franca of cloud infrastructure. According to a 2021 Survey by CNCF, 96% of organizations are either using or evaluating Kubernetes. Furthermore, Stack Overflow’s 2022 Developer Survey shows that Docker and Kubernetes are the number one and two most loved and wanted tools, respectively. This is a strong indicator that K8s are here to stay.
\\nBasic proficiency with K8s will drastically increase your chances of garnering support for your desired tools in many organizations. Proficiency with K8s increases the likelihood that:
\\nThese factors make it much more likely that you will get the tools that meet you where you are as opposed to something a software engineer without any data science experience thinks is a good idea (which I’ve seen happen a lot!).
\\nFor simple apps that you want to stand up quickly or prototype, K8s is overkill. Instead, I’m advocating knowledge of K8s as useful when working within the environments found in many companies. For example, hosting your data product on a single VM is often insufficient if you want to deploy production software. Many companies even have infrastructure that may block you from doing this with paved paths that only include Kubernetes.
\\nEven if you are not deploying any production software, K8s can be invaluable in allowing you to deploy the tools you need. In many cases using K8s can make tasks easier. Enterprises have necessarily invested resources in creating guardrails to control costs and security. Those guardrails are increasingly built around K8s patterns6. Understanding these concepts can make operating within the confines of your company’s cloud stack easier.
\\nK8s are complicated, but you don’t need to become an expert to unlock great value as a Data Scientist. I’m not suggesting that data scientists become K8s administrators. K8s Administration is a very involved task and worthy of its own role. Unfortunately, nearly all educational material around K8s is focused on being an administrator, which is overkill for what most data scientists need.
\\nI haven’t yet found a good resource for people like data scientists to learn Kubernetes without wading through lots of irrelevant material geared towards administrators. So my colleagues and I are considering creating a free course with data scientists in mind. If this sounds interesting, you can sign up here.
\\n\\n\\nVicki is not someone who is impressed by flashy or new technologies and is someone who takes a pragmatic approach to get the job done. When she says you should learn K8s, you should pay attention!↩︎
Each subsection of this article has a picture that has been generated by Stable diffusion with a prompt that very similar to the image caption.↩︎
These systems are AWS - Sagemaker, Azure - AzureML and GCP - VertexAI.↩︎
Some organizations have built solutions that avoid K8s. For example, BigHat uses a solution based on AWS SageMaker + Lambda and other hosted solutions. So it might be a mistake to try to move over to K8s in that example – you should try to leverage your company’s existing tech stack where possible!↩︎
My friend Michał Jastrzębski, who specializes in ML infrastructure, has shared the following colorful anecdote with me: “when I hear Data Scientists shouldn’t learn K8s”, I hear “DevOps needs to learn Airflow”.↩︎
Specifically, K8s concepts that are relevant are namespaces, labels and RBAC.↩︎
An exploration of threads, processes, and coroutines in Python, with interesting examples that illuminate the differences between each.
\\n Credit:1
As a data scientist who is spending more time on software engineering, I was recently forced to confront an ugly gap in my knowledge of Python: concurrency. To be honest, I never completely understood how the terms async, threads, pools and coroutines were different and how these mechanisms could work together. Every time I tried to learn about the subject, the examples were a bit too abstract for me, and I hard time internalizing how everything worked.
\\nThis changed when a friend of mine2 recommended a live coding talk by David Beazley, an accomplished Python educator.
\\nBecause of restrictions with this YouTube video, I couldn’t embed the video in this article, so you will have to open it in a different window.
\\nThis talk is incredibly intimidating at first. Not only is it coded live from scratch, but it also jumps immediately into socket programming, something that I had never encountered as a data scientist. However, if you go through it slowly and understand all the components (as we do in this blog post) it turns out to be the best educational material on Python concurrency I have ever encountered. This blog post documents what I learned along the way so others can benefit, too.
\\nBefore getting started, David sets up the following infrastructure that is used to demonstrate concurrency.
\\nTo demonstrate concurrency, it is useful to create a task that can saturate your CPU (such as mathematical operations) for a noticeable period of time. David uses a function that computes a Fibonacci number.
\\n\\n
| \\n\\n
|
This function takes much longer for large inputs versus smaller inputs3, which allows us to profile different workloads.
\\nA web server is one of the best ways to illustrate different types of concurrency. However, to really demonstrate how things work it is useful to use something that is sufficiently low level enough to see how all the pieces work. For this, David sets up a web server using socket programming. If you aren’t familiar with socket programming, I’ll explain the important bits below, but feel free to dive deeper with this tutorial later if you like.
\\nTo begin with, David starts with the below code (I’ve highlighted the most interesting bits):
\\n\\n
| \\n\\n
|
Here is an explanation of this code:
\\nIn the above example, the server will only be able to accept a connection from a single client, because the call to fib_handler
will never return (because it will run in an infinite loop unless a kill signal is received). This means that sock.accept()
can only be called once.
You can test this out by first running the server:
\\n\\n
| \\n\\n
|
Then establish a client:
\\n\\n
| \\n\\n
|
You can type numbers in as David does in his video and verifies that fibonacci numbers are returned. However, if you try to connect with another client at the same time from a different terminal session:
\\n\\n
| \\n\\n
|
You will notice that the second client just hangs and doesn’t return anything from the server. This is because the server is only able to accept a single connection. Next, we explore how to tackle this issue.
\\nWe can solve this issue with threads. You can add threads to the handler so that more connections can be accepted with the following code highlighted in yellow:
\\n\\n
| \\n\\n
|
You can verify that this works by connecting two separate clients to the server by running the following command in two separate terminal windows:
\\n\\n
| \\n\\n
|
By executing the fib_handler
in a thread, the main while loop in fib_server
will continue, allowing sock.accept()
to receive additional clients. If you haven’t encountered threads before this tutorial provides a good introduction to the topic.
When code stops execution and waits for an external event to occur (like a connection to be made, or data to be sent), this is often referred to as blocking.
\\nOne important utility of threads is that it allows blocking tasks to release control of the CPU when the CPU is not being used. However, the Python interpreter can only run on one thread at a time due to the Global Interpreter Lock. Because Python can only run a single thread at any given time, any CPU-bound work in threads must take turn running one after the other.
\\nTherefore, you have to think carefully about what kind of tasks you execute in threads with Python. If you try to execute CPU bound tasks, these tasks will slow each other down. David demonstrates this with the below script that sends requests to our threaded server:
\\n\\n
| \\n\\n
|
If you run several instances of this script (after starting the server first):
\\n\\n
| \\n\\n
|
You will see the execution times for each script linearly increase as you increase the number of these scripts running in parallel. For this particular task, adding threads does not make anything faster. But why? This is because the fibonacci task is CPU bound so threads will compete with each other for resources.
\\nPython threads work by interleaving the execution of different tasks on your CPU.4 Only one thread runs at a time, and have the ability to take turns executing in small bits until all threads are done. The details of how thread processing is interleaved is carried out by the GIL and your operating system, so you need not worry about this detail (with one exception mentioned below). Interleaving a bunch of CPU bound tasks will not speed up the total runtime of those tasks. However, if your tasks involve lots of non-CPU time, such as waiting for network connections, or disk I/O, threading tasks may result in a considerable speedup. A canonical way of simulating a non-cpu bound task in python is to use the built-in function time.sleep()
.
To check my understanding about threads and performance, I ran the below experiment5 and changed time.sleep(2)
to fib(20)
and back again:
\\n
| \\n\\n
|
As expected, increasing the number of threads while running time.sleep(2)
did not increase the program’s overall execution time (the program runs in roughly 2 seconds). On the other hand, replacing time.sleep(2)
with fib(20)
causes this program’s running time to increase as more threads are added. This is because fib(20)
is a cpu bound task so interleaving the task doesn’t really help much. You should try running the same thing to see for yourself.
\\n\\nYou will often hear that Python is not good at parallelism and that you can only run on one CPU core at a time. This is likely referring to the aforementioned issues with threads and the GIL. Because you are limited to one thread, this means that thread-based tasks can only use one CPU core at a time (a single thread cannot run across multiple CPUs). Outside of Python, threads are a popular choice for parallelizing CPU-bound tasks because you are able to run a separate thread per CPU core simultaneously. However, with Python you must look for other ways to accomplish parallelism for cpu-bound tasks.
\\n
Another interesting but less known aspect that David discusses is the relation between the following two types of tasks:
\\nfib(30)
, demonstrated with perf1.py.fib(1)
, demonstrated with perf2.py.The Python GIL will prioritize the first type of task at the expense of the second if they are made to compete for resources in threads. You can optionally follow along with a demonstration of this here. This is interesting because this is the opposite of how typical operating systems prioritize threads (by favoring shorter running tasks) and is something unique to the implementation of the Python GIL. More importantly, this behavior has a very practical consequence: if you are running a web-server where most tasks are fairly quick, an expensive cpu-bound task can grind everything to a halt.
\\nIt is tempting to think of Python threads as a tool to make things run faster, but that’s not the only use case. Recall that the socket server used threads to allow multiple connections at once without any speedup. David illustrates another way to use threads with his code used to measure the runtime of short-running tasks:
\\n\\n\\n
| \\n\\n
|
In this case, David uses a single thread with a blocking call to sleep(1)
to make sure that monitor
only prints once per second, while allowing the rest of the program to send requests hundreds of times per second. In other words, this is a clever use of threads and blocking that allow part of a program to run at a desired time interval while allowing the rest of the program to run as usual. 6
These different angles of looking at threads allowed me to understand threads more holistically. Threads are not only about making certain things run faster or run in parallel, but also allows you to control how your program is executed.
\\nA thread is always contained in a process, and each process contains one or more threads. Threads in the same process can share memory which means they can easily communicate and write to common data structures. Threads are useful in the following two scenarios:
\\nA process can span across multiple CPU cores, however a single thread can only utilize one CPU core.
\\nGenerally speaking, only one thread can run cpu-bound tasks on a single core at any given time. If multiple threads are sharing a CPU core, your operating system will interleave these threads. There are some exceptions to this rule. For example single CPU cores are able to run multiple threads concurrently by using things like SMT/hyper-threading or compute over data in parallel using SIMD, which is popular in scientific computing libraries.
\\nOn the other hand, processes offer isolation which is helpful when you have different users or different programs that should not be sharing information. Since we cannot run more than a single thread at a time in Python, a common workaround is to spawn several Python processes. This is discussed more below.
\\nChapter 2 of This book discusses what processes and threads are in greater detail from an operating system perspective.
\\nOne way to solve the problem with the GIL and cpu-bound tasks competing for resources is to use processes instead of threads. Processes are different from threads in the following respects:
\\nDavid uses python processes in his server example by using a process pool.7 The relevant lines of code are highlighted below:
\\n\\n
| \\n\\n
|
If you then start this version of the server with:
\\n\\n\\npython server-3.py
\\n
And run the profiler perf2.py, we can make the following observations:
\\nperf2.py
), as this will not compete for resources on the same CPU.time.sleep()
, using processes instead of threads would actually be detrimental to overall performance. A concrete example of this is provided in the section below.This is a realistic example that allow you to gain more intuition about how threads and processes work. This tutorial contains more examples of Python processes and threads.
\\nI’ve found many data scientists (formerly including myself) blindly apply processes and completely ignore threads. I understand why - processes are a kind of least common denominator where you can achieve some kind of parallelism regardless of if your task is CPU bound or not. However, I’ve found that this approach is very suboptimal and prevents full utilization of compute sources. Some examples to clarify where threads or processes might be more appropriate:
\\nIf you are downloading lots of files from the internet, consider using threads. This is because most of your time is spent on network I/O, not on the CPU. For example, this article demonstrates a 50% speedup when using threads compared to processes for downloading files.
\\nIf you are transforming or cleaning a large dataset, this work is mostly CPU bound so using processes makes sense. The only part of this that isn’t CPU-bound is reading and writing the data to disk.
\\nIf you just want to load a bunch of files into memory or write a bunch of files to disk, without really doing any transformations, consider using threads as the work is mostly disk I/O and not CPU bound.
\\nKeep in mind that threads can be more memory efficient than processes because of differences in the way they work. So using lots of processes when you don’t need them can lead to memory bloat.
\\nMost importantly, try avoid having to think about processes and threads where you can and use scientific computing libraries like numpy and write vectorized operations wherever you can. It is always worth being aware of the concurrency tools available in the library or framework you are using (especially numerical computing and other data science libraries) and consider using them when appropriate.
\\nRecall that Python can only operate on one thread at a time, and the operating system automatically decides when to interrupt each thread to allow the threads to take turns running. This is called pre-emptive multitasking since the operating system, not you, determine when your thread makes the switch. When you don’t care about how tasks are interleaved, threads are great because you don’t have to worry about how they are scheduled.
\\nHowever, there is third type of concurrency paradigm in Python that allows you to control how this switching occurs: Asynchronous Programming. This is also called cooperative multitasking which means each task must announce when it wants to switch. One way to achieve cooperative multitasking is to create a coroutine.
\\nOne way to create coroutines in Python is by using the yield
statement. David provides some intuition on how you can achieve multi-tasking with yield in the following code:
\\n
| \\n\\n
|
When you run this code, you can see from the output the three countdown tasks are being interleaved:
\\n\\n
| \\n\\n
|
This clever use of yield
allows you to pause execution of a task and move onto a different task kind of like threading, except you, not the operating system are controlling how compute is interleaved. This is the key intuition for understanding the rest of the talk, which goes on to to push this example further.
One of the most popular ways to accomplish async programming is by using the various utilities in the built-in asyncio module, which uses similar yield statements under the hood. I didn’t end up diving deeply into the asyncio module or this particular flavor of programming as my goal was to understand the concept so that I wouldn’t be lost when encountering this in the wild.
\\nThere is no silver bullet with regards to choosing the correct type of concurrency in Python. You have to consider how much of your task is CPU bound vs non-CPU bound (and if it is feasible to break up the task appropriately) to determine whether tweaking your code will make a material difference.
\\nMost importantly, I recommend only reaching for these tools when you need them rather than trying to prematurely optimize your code. Always start with the simplest code, without any concurrency, and build incrementally from there. If you do add concurrency, make sure you can justify it through a measurable difference in performance or functionality. I’ve sometimes found that my code was slow in places I didn’t expect and that concurrency wasn’t the tool I needed at all!
\\nProfiling your code is beyond the scope of this blog post, however I hope this post demystified the confusing jungle of terminology of python concurrency so that you can more quickly navigate these topics in the future.
\\nNot all programs that run in Python using threads are limited to a single CPU. It is possible to escape the constraints of the GIL by carefully writing C code that has a Python interface. This is what popular scientific computing libraries such as NumPy and SciPy do to achieve parallelism.
\\nIn David’s code, deque from the collections
module was introduced, which is a very handy data structure not only for async programming but also for threads because they are thread-safe, which means that you don’t have to worry about race conditions. Similarly, the queue module provides other types of thread-safe queues.
Furthermore, one of my favorite python libraries, fastcore, contains a module called parallel which makes using threads and processes easy for many use cases.
\\nThe following is terminology associated with Python concurrency that is often confused that we didn’t touch on in this blog post:
\\nThanks to Jeremy Howard, Dan Becker, and Zach Mueller for reviewing this post.
\\nThat friend is Jeremy Howard. He kept recommending the talk to me anytime the topic of concurrency came up. I eventually caved and decided to really focus on this talk. ↩︎
\\nPython threads are idiosyncratic because of the Global Interpreter Lock (GIL), which prevent multiple threads from executing code Python code at once. It is important not to confuse the behavior of Python threads with threads generally. ↩︎
\\nCode is originally from this tutorial on threads. ↩︎
\\nIf the monitor
task took any meaningful CPU time then the rest of the program would not run as “usual” because it might be competing for resources. But that is not the case here. ↩︎
One of the most popular ways of using process pools is with the built-in multiprocessing module. ↩︎
\\nAlternate operating systems book recommendations are from Kenneth Chiu from this and this tweet in response to this blog post. ↩︎
\\n