Scaling human feedback with fine-tuned open-source LLMs
Discover how Llama and Mistral can improve accuracy at scale
In our previous article, we demonstrated that a fine-tuned GPT-3.5 outperformed GPT-4 in generating feedback for article summarization. However, we found that several companies need to leverage open source LLMs for data privacy, model ownership, cost, latency or SLA concerns. An open question is whether open-source LLMs, such as Llama-2 and Mistral, could be used for feedback generation and how their accuracy compares with fine-tuned GPT-3.5. Here, we tested the capabilities of these fine-tuned open-source LLMs and shared our findings.
Setup
We used Axolotl and Modal for our fine-tuning tasks. Axolotl is an open-source tool designed for fine-tuning LLMs and offers significant flexibility to customize fine-tuning jobs with various models, training data formats, and techniques such as LoRA, QLoRA, full, etc. Modal simplifies the multi-GPU deployment process and streamlines the fine-tuning integration; they provide examples and deployment-ready code. We discovered that the combination of Axolotl and Modal provides flexibility, efficiency, and ease of use; definitely give axolotl + Modal a try 🚀!
We also evaluated fine-tune endpoint providers like Together and MosaicML. While their APIs are user-friendly and make it easy to start a fine-tuning job, they didn’t have Llama2-70B-chat 🦙🦙 available at this time. Additionally, the inference process may not align with your production technical stack. For instance, at the time we ran our experiments, MosaicML required users to serve the fine-tuned model on Databricks serving platform.
Results
We used the tldraxis2 dataset (5529 entries) and prepared it in alpaca format, (e.g. {"instruction": "...","output": "..."}) for fine-tune training. The instruction includes the system prompt and one example, and output is the corresponding feedback, evaluation scores of four axes (accuracy, coherence, coverage, and overall in range 1 to 7).
We fine-tuned two models: 1) mistral-7B-instruct-v0.2 with LoRA using fp16 and 2) Llama-2-70B-chat with QLoRA loading with 4 bits. You can find the configuration files here.
Accuracy evaluation shows that both fine-tuned models perform better than GPT-4 and are comparable with results of fine-tuned GPT-3.5. Notably, fine-tuning significantly improved the structure of the output, a JSON-compliant format without the inclusion of unrelated text. We rarely saw any additional text from fine-tuned Llama-2 70B chat, while still observing ~20% non-JSON output in our test cases using the Mistral 7B instruct model after fine-tuning1.
Discussion
Overall, our experiments here demonstrate that fine-tuned open source models can serve as a viable alternative to fine-tuned GPT-3.5 and base GPT-4 for scaling human feedback. In upcoming work, we aim to compare other dimensions beyond accuracy such as cost and impact of model size. We see further room for improving accuracy via model specific prompt engineering2, optimizing data selection, leveraging more advanced open source models (e.g. dolphin-mixtral), and synthetic data.
If you want to start fine-tuning or have been running into issues with your fine-tune tasks, especially to train custom evaluation models or for scaling human feedback, do reach out to us at ai@log10.io or sign up for Log10 here. We’d like to hear from you!
We have optimized prompts for the Llama-2 model and intend to explore prompts for Mistral models, such as Dolphin-mixtral, as part of our future work.
We view fine-tuning as a step following prompt engineering, instead of a standalone process. The prompt is included as part of the training data, affecting the quality of the output. Before diving into fine-tuning with a new model, it's a good idea to first check if the prompt works well with the model.