About
Petals brings BitTorrent-inspired distributed computing to large language model inference. Instead of requiring a single machine with hundreds of gigabytes of VRAM, Petals splits a model across a global network of volunteer GPU contributors. Each participant loads a shard of the model, and requests are routed through the network to complete a full forward passβenabling single-batch inference at up to 6 tokens/sec for Llama 2 (70B) and up to 4 tokens/sec for Falcon (180B), fast enough for chatbots and interactive applications. Supported models include Llama 3.1 (up to 405B parameters), Mixtral (8x22B), Falcon (40B and 180B), and BLOOM (176B). Unlike standard LLM APIs, Petals exposes the full flexibility of PyTorch and Hugging Face Transformers: you can apply custom fine-tuning methods, define custom sampling strategies, implement non-standard execution paths through the model, and inspect hidden states at any layer. Petals is ideal for researchers, ML engineers, and developers who want to experiment with frontier open-source models without access to a high-end GPU cluster. It is part of the BigScience research workshop and actively developed on GitHub with a community Discord. Anyone with a spare GPU can also contribute compute to the network, helping sustain the shared infrastructure.
Key Features
- BitTorrent-Style Distributed Inference: Each participant loads a shard of a large model; requests are routed through the network so the full model is served collectively without any single machine needing to hold it all.
- Massive Model Support: Supports frontier open-source models including Llama 3.1 (up to 405B), Mixtral (8x22B), Falcon (40B and 180B), and BLOOM (176B).
- Fine-Tuning & Custom Execution: Beyond simple generation, Petals supports fine-tuning, custom sampling methods, non-standard forward pass paths, and access to intermediate hidden states.
- PyTorch & Hugging Face Integration: Fully compatible with PyTorch and π€ Transformers, giving developers the flexibility of native ML frameworks alongside the convenience of a hosted API.
- Consumer Hardware & Colab Compatible: Runs on standard consumer-grade GPUs or free Google Colab instances, democratizing access to models that would otherwise require expensive server clusters.
Use Cases
- Researchers experimenting with frontier open-source LLMs like Llama 3.1 (405B) or BLOOM without access to a high-end GPU cluster
- ML engineers fine-tuning large language models using consumer hardware by leveraging distributed peer-to-peer compute
- Developers building chatbots or interactive AI applications powered by open-source models at a fraction of the usual infrastructure cost
- Academic teams running custom inference experiments that require access to hidden states or non-standard execution paths through large models
- GPU owners contributing spare compute to the Petals network to support open-source AI research globally
Pros
- Access to Massive Models Without Expensive Hardware: Enables running 100B+ parameter models using a single consumer GPU by distributing the workload across the network.
- Full Research Flexibility: Unlike black-box APIs, Petals exposes hidden states, custom paths, and fine-tuning hooksβmaking it ideal for ML research.
- Open Source & Community-Driven: Fully open source, part of the BigScience workshop, and sustained by a global community of GPU contributors.
Cons
- Network-Dependent Availability: Inference speed and reliability depend on the number and quality of active GPU contributors in the network at any given time.
- Slower Than Dedicated Clusters: At up to 6 tokens/sec for Llama 2 70B, throughput is suitable for interactive use but not for high-volume production workloads.
- Privacy Considerations: Prompts and activations pass through third-party volunteer nodes, which may be a concern for sensitive or proprietary data.
Frequently Asked Questions
Petals is an open-source system that lets users run large language models collaboratively over a peer-to-peer network, similar to BitTorrent. Each participant serves a portion of the model, enabling collective inference without any single machine needing the full model in memory.
Petals currently supports Llama 3.1 (up to 405B parameters), Mixtral (8x22B), Falcon (40B and 180B), and BLOOM (176B), with community-maintained support for additional models.
You can participate with a consumer-grade GPU (e.g., an RTX 3080/4090) or even a free Google Colab instance. The more VRAM you have, the larger the model shard you can host.
Yes. Petals supports fine-tuning by allowing gradient flow through the distributed model layers. You can attach trainable adapter layers (e.g., LoRA) on the client side while the backbone remains distributed across the network.
Yes. Petals is fully open source and free. The network is sustained by volunteer GPU contributors. You can also contribute your own GPU to help support the network for others.