PolyCoder

open_source

PolyCoder is an open-source LLM trained on source code, available in 160M, 0.4B, and 2.7B parameter sizes on HuggingFace. MIT licensed and free for research and commercial use.

Coding & Development

Foundation Models

LLM Developer Tools

About

PolyCoder is an open-source large language model specifically designed for source code, developed by Vincent Hellendoorn as part of the Code-LMs research project. It provides pre-trained neural language models for programming across multiple languages, offering three model sizes—160M, 0.4B, and 2.7B parameters—to accommodate a range of hardware configurations and performance needs. Trained on a diverse corpus of source code spanning many programming languages, PolyCoder supports code completion, generation, and understanding tasks. The models are fully available on the HuggingFace Hub, enabling developers to integrate them with just a few lines of Python using the widely-adopted transformers library. As a fully open-source alternative to proprietary code intelligence models, PolyCoder is ideal for researchers studying code LLMs, developers building self-hosted code AI tools, and organizations that require on-premise solutions without relying on third-party APIs. Its MIT license allows broad commercial and academic use. The Code-LMs repository also includes evaluation scripts for benchmarking code generation performance and data conversion utilities, making it a complete toolkit for code language model research and experimentation. With over 1.8k GitHub stars, it is a well-recognized contribution to the open-source AI coding ecosystem.

Key Features

Multiple Model Sizes: Available in 160M, 0.4B, and 2.7B parameter variants to balance performance with available compute resources.
HuggingFace Hub Integration: Fully published on the HuggingFace Hub, loadable in seconds via AutoTokenizer and AutoModelForCausalLM from the transformers library.
Multi-Language Code Training: Trained on a large, diverse corpus of source code spanning numerous programming languages for broad generalization.
MIT Open-Source License: Freely usable for research and commercial applications with full transparency into model architecture and training approach.
Evaluation & Benchmarking Tools: Includes evaluation scripts and data utilities for benchmarking code generation quality against standard metrics.

Use Cases

Code completion and generation for software development workflows without relying on proprietary cloud APIs
Academic research into code language models, training dynamics, and AI programming assistants
Building self-hosted code intelligence tools for privacy-sensitive or regulated environments
Benchmarking and evaluating code generation model performance against standard metrics
Fine-tuning a base code LLM on domain-specific or proprietary codebases for specialized assistance

Pros

Fully Open Source: MIT-licensed and self-hostable, making it perfect for privacy-sensitive or air-gapped environments without dependency on external APIs.
Easy HuggingFace Integration: Requires only a few lines of code to load and run inference, leveraging the familiar transformers ecosystem.
Flexible Sizing: Three model sizes let users run PolyCoder on consumer hardware while scaling up for higher-quality outputs when resources allow.

Cons

Smaller Scale Than Commercial Models: At a maximum of 2.7B parameters, PolyCoder lags behind large proprietary models like GPT-4 and GitHub Copilot in code generation quality.
Limited Ongoing Maintenance: The repository has limited recent activity, meaning it may not keep pace with newer techniques or language trends in the fast-moving code AI space.

Frequently Asked Questions

PolyCoder is an open-source large language model trained on source code, released by researcher Vincent Hellendoorn. It is available in 160M, 0.4B, and 2.7B parameter sizes for code generation and completion.

PolyCoder is hosted on the HuggingFace Hub. You can load it using the transformers library with `AutoTokenizer.from_pretrained('NinedayWang/PolyCoder-2.7B')` and `AutoModelForCausalLM.from_pretrained(...)`, requiring transformers version 4.23.0 or later.

Yes. PolyCoder is completely free and open-source under the MIT license, permitting both research and commercial use without restriction.

PolyCoder was trained on a diverse multi-language corpus of source code, giving it broad support across many popular programming languages.

PolyCoder is a significantly smaller open-source model compared to the large proprietary models powering GitHub Copilot. It offers greater transparency, self-hosting capability, and zero API cost, but may produce lower-quality completions on complex tasks.