AutoBench

Organization Details

Organization Description

AutoBench is an organization dedicated to advancing the evaluation of Large Language Models (LLMs) through innovative, automated benchmarking systems. Our flagship project is the AutoBench 1.0 benchmark, a novel system that utilizes a "Collective-LLM-as-a-Judge" approach. This approach leverages LLMs themselves to assess the quality of both questions and answers generated by other LLMs. AutoBench aims to address the limitations of traditional, static benchmarks by providing a dynamic, scalable, cost-effective, and less human-biased evaluation framework.

Benchmarking System: AutoBench 1.0

Overview

AutoBench 1.0 is a fully automated, iterative benchmark system for evaluating LLMs. It dynamically generates questions, assesses their quality, and ranks LLM-generated answers using a collective of LLMs as judges. This system is designed to be:

Key Features

Intended Use

The AutoBench 1.0 benchmark is intended for:

The benchmark provides a standardized, automated, and cost-effective way to assess the performance of LLMs across a variety of tasks and topics.

Ethical Considerations

AutoBench is committed to the responsible development and use of LLMs. We encourage users of the benchmark to consider the potential ethical implications of their work and to use the benchmark results responsibly. The limitations and biases of AutoBench 1.0 should be carefully considered when interpreting the results.

Inference cost Support

Running a compute intensive benchmark like AutoBench can be expensive. We welcome all inference API providers to support us with free inference credits.

Citation

If you use AutoBench 1.0 in your research, please cite:

@misc{autobench2024, title={AutoBench 1.0: A Collective-LLM-as-a-Judge Benchmark System}, author={AutoBench}, year={2024}, publisher = {Hugging Face}, howpublished = {\url{https://huggingface.co/AutoBench}}, note = {Accessed: [Date Accessed]} }

Learn more and contribute