Adapters

Adapters#

AlphaGenome supports several transfer learning strategies via TransferConfig. See Yuan et al., 2025 for more details about using these adapters for sequence-to-function models and calico/baskerville for how such adapters can be used on other models like Borzoi.

Available Modes#

Mode	Trainable Params	When to Use
`linear`	Heads only	Fast baseline
`lora`	Heads + LoRA adapters	Extra expressiveness in addition to the linear baseline
`locon`	Heads + Locon adapters	Alternative to LoRA, applied to conv layers
`ia3`	Heads + IA3 scaling	Minimal added parameters
`houlsby`	Heads + Houlsby bottleneck adapters	Classic bottleneck adapters with residual connection
`full`	All weights	Maximum expressiveness

Linear Probing#

The simplest approach: freeze the entire pretrained trunk and train only the newly added heads. This is the fastest mode and a strong baseline.

config = TransferConfig(
    mode='linear',
    remove_heads=['atac', 'dnase'],
    new_heads={'my_atac': {'modality': 'atac', 'num_tracks': 4}},
)
model = prepare_for_transfer(model, config)

No adapter parameters are injected — only head weights are trainable.

LoRA#

Low-Rank Adaptation adds small trainable low-rank matrices to Linear layers (typically attention projections) while keeping the trunk frozen. This is the recommended mode for most use cases.

Reference: LoRA: Low-Rank Adaptation of Large Language Models (Hu et al., 2021)

config = TransferConfig(
    mode='lora',
    lora_rank=8,          # Rank of the low-rank matrices
    lora_alpha=16,        # Scaling factor (alpha / rank)
    lora_targets=['q_proj', 'v_proj'],  # Target modules by name
    remove_heads=['atac', 'dnase'],
    new_heads={'my_atac': {'modality': 'atac', 'num_tracks': 4}},
)
model = prepare_for_transfer(model, config)

Parameters:

lora_rank — rank of the decomposition (higher = more expressive, more params)
lora_alpha — scaling factor; effective scale is alpha / rank
lora_targets — list of substrings to match in module names (e.g. ['q_proj', 'v_proj'])

After training, LoRA weights can be merged into the base layers for zero-overhead inference — see Merging Adapters for Inference below.

Locon#

LoRA for Convolutional layers applies the same low-rank adaptation to Conv1D layers. Useful for adapting the convolutional tower.

Reference: Navigating Text-To-Image Customization: From LyCORIS Fine-Tuning to Model Evaluation (Yeh et al., 2023)

config = TransferConfig(
    mode=['lora', 'locon'],
    lora_targets=['q_proj', 'v_proj'],
    locon_rank=4,         # Rank for conv decomposition
    locon_alpha=1,        # Scaling factor
    locon_targets=['down_blocks.4', 'down_blocks.5'],  # 4 Locon adapters on encoder
    remove_heads=['atac'],
    new_heads={'my_atac': {'modality': 'atac', 'num_tracks': 4}},
)
model = prepare_for_transfer(model, config)

Parameters:

locon_rank — rank of the decomposition (default: 4)
locon_alpha — scaling factor (default: 1)
locon_targets — list of substrings to match Conv1D module names. Required when Locon is enabled.

Use block-level targets:

Locon2: ['down_blocks.5'] (2 Locon adapters)
Locon4: ['down_blocks.4', 'down_blocks.5'] (4 Locon adapters)
Locon6: ['down_blocks.3', 'down_blocks.4', 'down_blocks.5'] (6 Locon adapters)

IA3#

Infused Adapter by Inhibiting and Amplifying Inner Activations learns a multiplicative scaling vector for layer outputs. Extremely parameter-efficient — only output_dim parameters per adapted layer.

Reference: Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning (Liu et al., 2022)

config = TransferConfig(
    mode='ia3',
    ia3_targets=['k_proj', 'v_proj'],  # Output-scaling targets
    ia3_ff_targets=['fc2'],         # Input-scaling targets (feed-forward)
    remove_heads=['atac'],
    new_heads={'my_atac': {'modality': 'atac', 'num_tracks': 4}},
)
model = prepare_for_transfer(model, config)

Parameters:

ia3_targets — modules for output scaling (IA3)
ia3_ff_targets — modules for input scaling (IA3_FF, used in feed-forward layers)

Houlsby Adapters#

Classic bottleneck adapters insert a down-projection → activation → up-projection block with a residual connection. This implementation follows the Baskerville TensorFlow reference, placing adapters at transformer block boundaries.

Reference: Parameter-Efficient Transfer Learning for NLP (Houlsby et al., 2019)

Block-Level Placement

The default placement inserts adapters after each transformer sub-layer (MHA and MLP), before the residual add:

config = TransferConfig(
    mode='houlsby',
    houlsby_latent_dim=8,            # Bottleneck dimension
    houlsby_placement='block',       # Baskerville-style (default)
    houlsby_targets=['mha', 'mlp'],  # Adapt both MHA and MLP blocks
    unfreeze_norm=True,              # Unfreeze LayerNorm/RMSBatchNorm (default)
    remove_heads=['atac'],
    new_heads={'my_atac': {'modality': 'atac', 'num_tracks': 4}},
)
model = prepare_for_transfer(model, config)

The computation for each transformer block becomes:

x = x + adapter(mha(x))    # adapter has internal residual
  = x + mha(x) + bottleneck(mha(x))

x = x + adapter(mlp(x))
  = x + mlp(x) + bottleneck(mlp(x))

Linear-Level Placement

You can also wrap individual Linear layers (similar to LoRA targeting):

config = TransferConfig(
    mode='houlsby',
    houlsby_latent_dim=8,
    houlsby_placement='linear',
    houlsby_targets=['q_proj', 'v_proj'],  # Target specific projections
    remove_heads=['atac'],
    new_heads={'my_atac': {'modality': 'atac', 'num_tracks': 4}},
)
model = prepare_for_transfer(model, config)

Parameters:

houlsby_latent_dim — bottleneck dimension (default: 8)
houlsby_placement — where to insert adapters:
- 'block' (default): Baskerville-style, at transformer block boundaries
- 'linear': wrap individual Linear layers
houlsby_targets — which components to adapt:
- For 'block': ['mha', 'mlp'] (default), ['mha'], or ['mlp']
- For 'linear': module name substrings like ['q_proj', 'v_proj']
unfreeze_norm — whether to unfreeze normalization layers (default: True). This matches Baskerville’s behavior where LayerNorm parameters are trained alongside adapters.

Combining Adapter Modes#

Adapter modes (lora, locon, ia3, houlsby) can be combined by passing a list to mode. This applies each adapter type simultaneously — for example, LoRA on attention layers and Locon on convolutional layers:

config = TransferConfig(
    mode=['lora', 'locon'],
    # LoRA settings (applied to attention)
    lora_rank=8,
    lora_alpha=16,
    lora_targets=['q_proj', 'v_proj'],
    # Locon settings (applied to convolutions)
    locon_rank=4,
    locon_alpha=1,
    locon_targets=['down_blocks.5'],
    remove_heads=['atac', 'dnase'],
    new_heads={'my_atac': {'modality': 'atac', 'num_tracks': 4}},
)
model = prepare_for_transfer(model, config)

Rules:

'full' cannot be combined with other modes.
'linear' can appear alongside adapter modes — the trunk is frozen and adapter layers are injected on top.
Any subset of ['lora', 'locon', 'ia3', 'houlsby'] can be combined.

Merging Adapters for Inference#

Some adapters can be folded back into the base layer weights, eliminating all adapter overhead at inference time. After merging, the adapted layers become plain nn.Linear modules and the model’s state dict is compatible with vanilla AlphaGenome.

from alphagenome_pytorch.extensions.finetuning import merge_adapters
model = merge_adapters(model)

Adapter	Mergeable?	Reason
LoRA	Yes	Linear decomposition `B @ A` folds into the weight matrix.
IA3 / IA3_FF	Yes	Multiplicative scaling folds into weight rows (IA3) or columns (IA3_FF).
Locon	No	AlphaGenome’s convolutional layers use `StandardizedConv1d`, which applies weight standardization (mean subtraction + variance normalization) on every forward pass. This transformation is not invertible — the per-channel mean is destroyed — so a merged weight tensor cannot be fed back through standardization and produce the correct result. Locon adapters are left in place at inference time; the overhead is one extra small convolution per adapted layer.
Houlsby	No	The bottleneck contains a nonlinear activation (ReLU) between the down- and up-projections, so it cannot be represented as a single linear transform.

Adapters

Contents

Adapters#

Available Modes#

Linear Probing#

LoRA#

Locon#

IA3#

Houlsby Adapters#

Combining Adapter Modes#

Merging Adapters for Inference#