Audio Summary

Large Language Models (LLMs) are at the heart of many modern AI applications, capable of understanding and generating human-like text. However, as these models grow in size, their deployment costs and resource requirements also skyrocket, often making them prohibitively expensive for most organizations. Imagine spending millions just to deploy a state-of-the-art AI model—this is the challenge many face today.

A groundbreaking research paper from Google DeepMind has introduced an innovative approach called Relaxed Recursive Transformers that could dramatically reduce these costs while maintaining the impressive capabilities of large language models.

Recursive Transformers leverage parameter sharing as a key strategy to reduce model size and deployment costs. Parameter sharing involves reusing the same set of weights across different layers of the model, which significantly reduces the number of parameters without compromising the model’s ability to learn and adapt. Although parameter sharing has been explored in the past, its success with modern LLMs has been limited—until now.

What’s Wrong with Traditional Models?

High Memory Requirements: Each layer in a traditional large language model has its own set of parameters, which leads to enormous memory usage.
Complex Scalability: Because every layer is unique, the computational power required increases dramatically as models grow.
High Costs: The infrastructure needed to deploy and scale these models is expensive, making them inaccessible to smaller organizations.

Recursive Transformers as a Solution

**Figure 1** : Overview of the conversion from a vanilla N-layer Transformer to a Recursive Transformer
with 𝑁/𝐾 blocks of K shared layers. The Recursive Transformer is obtained by repeating a single
block of K layers multiple times, resulting in a looped architecture. The Recursive Transformer can
also be converted into a Relaxed Recursive Transformer by adding layer-specific LoRA modules. This
preserves many of the advantages of weight sharing, but also allows for better performance. **Image Courtesy** Relaxed Recursive Transformers:
Effective Parameter Sharing with Layer-wise LoRA

The DeepMind team asked: What if we could reuse the expertise embedded in a single layer for multiple rounds of analysis? This led to the development of Recursive Transformers, where the same set of parameters is used repeatedly, allowing the model to achieve greater depth of analysis with fewer parameters.

Benefits of Recursive Transformers:

Reduced Memory Footprint: Sharing parameters across repeated layers means significantly less memory is required compared to models with unique parameters for each layer.
Increased Throughput: The repetitive structure also means that processing can be streamlined, potentially boosting throughput and making model inference more efficient.

Parameter Sharing and Stepwise Initialization

However, the journey wasn’t as straightforward as simply reusing parameters. The researchers discovered that how these shared parameters are initialized—or ‘taught’ their initial skills—makes all the difference between success and failure.

Think of it like training a master chef. You could start with a novice and hope they learn through repetition (random initialization), or you could carefully select and combine the skills of several experienced chefs. The researchers found that the latter approach, implemented through several distinct initialization strategies, was crucial for success.

Random Initialization: This method starts from scratch, giving the model random values for its parameters. While simple, it often leads to slower convergence and suboptimal performance, as the model lacks any foundational knowledge.
Layer-wise Pretraining: Here, each layer is pretrained individually, using smaller, targeted datasets or tasks. This approach allows the model to build foundational knowledge at each layer before integrating them into the full model, but it can be resource-intensive and less cohesive.
Stepwise Initialization: The most successful approach, Stepwise Initialization, carefully selects specific checkpoints of expertise. Instead of trying to compress all the knowledge from the original model at once, it preserves both fundamental skills (from early layers) and sophisticated understanding (from later layers). This method ensures a balanced foundation of skills, leading to a remarkable 37% improvement in performance compared to random initialization.

Stepwise Initialization outperforms other methods by preserving both fundamental and advanced skills, leading to smoother training and superior performance. Unlike random initialization, which often struggles to converge, or layer-wise pretraining, which is resource-intensive, Stepwise Initialization finds a balance that ensures robust and consistent learning.

**Figure 2** Left: An example of unshared, full-size model with 6 layers. Middle: Three proposed
methodologies for initializing looped layers in a Recursive Transformer. Each layer number indicates
the source layer in the full-size model used for initialization. Right: Example of a Relaxed Recursive
Transformer initialized by SVD method. Here, looped layers are initialized using the Average method **Image Courtesy** Relaxed Recursive Transformers:
Effective Parameter Sharing with Layer-wise LoRA

Instead of trying to compress all the knowledge from the original model at once, it carefully selects crucial checkpoints of expertise, ensuring that both fundamental skills (from early layers) and sophisticated understanding (from later layers) are preserved. This method led to a remarkable 37% improvement in performance compared to random initialization.

Relaxed Recursive Transformers: Adding Flexibility

The concept of Recursive Transformers introduces what can be called a Basic Recursive Approach, which involves reusing the same set of parameters across multiple layers in the model. In traditional large language models, each layer has its own unique parameters, which provide greater expressiveness but also significantly increase memory and computational requirements. In contrast, the Basic Recursive Approach reuses the same weights throughout all layers, drastically reducing the model size and computational needs.

However, this strategy comes with limitations. By using identical parameters across all layers, the model lacks flexibility, making it less effective at adapting to diverse types of information and complex patterns. This rigidity can lead to reduced effectiveness, especially when dealing with varied nuances or when different processing is required at different stages of the model.

To overcome these limitations, the research team introduced Relaxed Recursive Transformers. These models enhance the basic recursive structure by adding targeted flexibility, achieved through Low-Rank Adaptation (LoRA), which allows for subtle modifications to the shared weights—enabling better adaptability to different types of data.

What is LoRA? : LoRA is a technique designed to add flexibility to parameter sharing in models like Recursive Transformers. Instead of using fully independent parameters for each layer, LoRA allows the model to make small, targeted adjustments to the shared parameters, which helps the model maintain both efficiency and adaptability.

LoRA functions like fine-tuning the model at each step to optimize results, allowing shared weights to be adapted slightly without introducing the full complexity of separate parameters for each layer.

How Does LoRA Work? LoRA works by adding low-rank matrices that modify shared weights with minimal overhead. These modifications capture the essential differences needed at each layer iteration, allowing for fine adjustments that keep the model efficient while enhancing its flexibility. The rank of the LoRA matrices determines the extent of these adjustments—higher ranks add more adaptability, whereas lower ranks keep the model lean.

Small Layer-Specific Modifications: LoRA introduces low-rank matrices to modify the shared weights at each iteration, allowing the model to slightly adjust how it processes data each time. This means that while the core parameters are the same, they can adapt slightly based on the specific data being processed.
Rank of LoRA Matrices: The rank of the LoRA matrices determines how much flexibility is introduced. Higher ranks mean greater flexibility, which allows the model to better adapt to different situations, while lower ranks keep the model more compact. Striking the right balance is crucial for maintaining both efficiency and model performance.
Initialization Using SVD: LoRA modules are initialized using truncated Singular Value Decomposition (SVD). SVD (Singular Value Decomposition) is a mathematical method used to break down a matrix into three simpler components—U, Σ, and V. In the context of LoRA, SVD is used to analyze the difference between the original and shared weights. By applying a truncated version of SVD, only the most significant features are retained, which simplifies the representation and reduces computational complexity while maintaining key characteristics of the model.

By using Singular Value Decomposition (SVD), the researchers could identify the most impactful adjustments needed for shared weights. SVD breaks the differences into simpler components, enabling efficient parameter modifications that reduce redundancy while preserving critical features of the model.

This initialization ensures that Relaxed Recursive Transformers can smoothly transition between a fully shared recursive model and a more traditional model structure. The SVD-based initialization is key because it not only reduces the parameter count but also retains crucial features that contribute to model quality, thereby balancing efficiency with high-performance gains.

**Figure 3** An illustrative example of a continuous depth-wise batching strategy together with early-
exiting. We assume a maximum batch size of 32, three model “stages” (e.g., layer blocks), and a
stream of batched inputs that arrive sequentially in time. In (a), all three model stages must complete
for the first (non-maximal) batch of 16 before the second batch of 32 examples that arrives next can
be started. In (b), however, half of second batch of 32 examples can share computation with the first
batch of 16 that is still finishing. Finally, (c) demonstrates a situation where some examples within
each batch can early-exit after stage 2; their vacant slots in the batch are then immediately filled. **Image Courtesy** Relaxed Recursive Transformers:
Effective Parameter Sharing with Layer-wise LoRA

The beauty of this approach is that it maintains the core efficiency benefits of parameter sharing while adding just enough flexibility to optimize performance for different tasks. This balance is what makes Relaxed Recursive Transformers so powerful—they can adapt without the massive computational costs associated with non-recursive, fully independent layers.

Multi-LoRA Layers: Enhanced Flexibility Without the Cost

The paper also introduces Multi-LoRA Layers as a significant enhancement to the Relaxed Recursive Transformer. Unlike a simple low-rank adaptation applied uniformly, Multi-LoRA uses multiple low-rank matrices to adjust the shared layers differently at each repetition, effectively capturing a broader range of features.

By layering several low-rank matrices into each repeated iteration, the model gains a more dynamic ability to adapt to the diverse patterns present in the data, improving its performance without bloating its parameter size. This innovation allows Recursive Transformers to maintain high performance across both generic and specialized tasks.

Continuous Depth-wise Batching

To further enhance efficiency, the researchers introduced Continuous Depth-wise Batching, a novel approach that allows different stages of processing to happen simultaneously, much like an optimized assembly line. This drastically improves throughput by leveraging the recursive structure to process data across different iterations concurrently.

Difference from Sequence-wise Batching:

Sequence-wise Batching: Groups tokens from the same sequence together for simultaneous processing.
Depth-wise Batching: Groups computations across different depths (loop iterations) of the model for multiple sequences, enhancing parallel processing efficiency.

This method led to speed improvements of up to 2-3 times compared to traditional methods, making it particularly effective for real-time AI applications.

**Figure 4:** Graph showing throughput gains achieved through Continuous Depth-wise Batching compared to traditional sequence-wise batching. **Image Courtesy** Relaxed Recursive Transformers:
Effective Parameter Sharing with Layer-wise LoRA

This is not just an incremental improvement; it’s a fundamental rethinking of how language models can process information more efficiently.

Knowledge Distillation and Extended Training

Perhaps one of the most exciting breakthroughs came from combining these architectural innovations with two powerful training techniques: knowledge distillation and extended training. Knowledge distillation involves having a larger, more knowledgeable model teach a smaller one, while extended training gives the smaller model more time to learn and refine its understanding.

In one striking example, a recursive version of the Gemma model, trained on just 60 billion tokens with knowledge distillation, performed nearly as well as the original model trained on 3 trillion tokens. This means the recursive model achieved comparable performance with only 2% of the original training data, making it significantly more efficient.

Real-World Impact and Performance Gains

Stepwise Initialization: Using carefully curated checkpoints to initialize the model led to a 37% improvement in performance compared to random initialization, preserving both basic and advanced skills.
Relaxed Recursive Transformers with LoRA: By adding controlled flexibility, Relaxed Recursive Transformers combine the benefits of parameter sharing with the ability to make nuanced adjustments, maintaining high performance without expanding model size unnecessarily.
Multi-LoRA: Multi-LoRA layers further increase adaptability without large increases in parameter counts, effectively handling diverse input patterns.
Continuous Depth-wise Batching: This technique boosts processing speed by enabling multiple stages of computation to happen simultaneously, crucial for real-time applications.

The research team validated their approach on models like Gemma, TinyLlama, and Pythia.

The recursive version of Gemma 2B achieved 98.5% of the original model’s accuracy , using only half the parameters and running more than twice as fast. This breakthrough could significantly reduce the cost and complexity of deploying advanced AI models in real-world applications.

Why This Matters

Reduced Resource Requirements: Recursive Transformers make large models more accessible, significantly reducing the computational footprint and enabling deployment in environments with constrained resources, such as mobile devices and edge computing.
Improved Throughput with Continuous Depth-wise Batching: Recursive Transformers can perform inferences faster by executing multiple stages concurrently. This enhancement is especially beneficial for real-time applications like customer service bots or financial systems where response time is crucial.
Maintaining Performance with Efficiency: The use of knowledge distillation allows Recursive Transformers to match or surpass the performance of much larger models, using a fraction of the training data and computational cost.

The Road Ahead

While the results are promising, the researchers believe that we have only scratched the surface of what is possible with Recursive Transformers. They suggest exploring ways to scale this approach to much larger models (7B+ parameters) and optimize early-exit strategies to further enhance efficiency.

There are also opportunities for better balancing the degree of parameter sharing, adaptation allowed, and computational trade-offs to achieve even greater performance. This line of research could fundamentally change our approach to AI by showing that smarter architecture and training techniques can enable us to achieve more with less.

Conclusion

The development of Recursive Transformers is more than just another technical advancement—it redefines how we can build and deploy large language models efficiently. By rethinking the core structure, Recursive Transformers introduce new ways to manage computational resources without sacrificing model performance. Recursive Transformers reuse parameters effectively while adding targeted flexibility, which allows them to adapt to complex data. This approach helps maintain high model performance while significantly cutting down on memory and computational costs. This innovation makes AI technology more cost-effective and accessible, enabling smaller organizations and applications to leverage advanced models that were previously out of reach due to high resource demands.

Reduced Costs: Recursive Transformers dramatically reduce GPU and memory requirements, making AI accessible to organizations with limited resources.
Efficiency Gains: Existing AI models can be upgraded to use Recursive Transformers to achieve more efficiency, enabling cost savings and more sustainable practices.
Wider Impact: By maintaining high performance while reducing resource demands, Recursive Transformers are set to make AI both powerful and accessible, breaking down barriers to entry for many.

GPU Usage Comparison: Traditional LLMs vs. Recursive Transformers

Feature	Traditional LLMs	Recursive Transformers
Parameter Storage	Each layer has a unique set of parameters, leading to high GPU memory consumption.	Parameters are reused across multiple layers, significantly reducing GPU memory requirements.
Training and Inference	Each unique layer’s parameters must be loaded into GPU memory, increasing resource footprint.	Fewer parameters mean reduced GPU memory usage, enabling more efficient training and inference.
Scaling Requirements	Scaling to larger models requires multiple high-end GPUs or specialized infrastructure.	Requires fewer resources, allowing scaling with less expensive hardware.
Flexibility with LoRA	No parameter sharing, unique weights per layer.	LoRA introduces low-rank matrices for slight adjustments, adding flexibility with minimal extra memory cost.
Efficiency Gains	High memory usage due to unique parameters per layer.	Experiments show up to 50% reduction in GPU memory usage compared to traditional LLMs.

Key Links

Research Paper: Relaxed Recursive Transformers:
Effective Parameter Sharing with Layer-wise LoRA
Authors : Sangmin Bae, Adam Fisch, Hrayr Harutyunyan, Ziwei Ji, Seungyeon Kim and Tal Schuster

Discover more from Ajith Vallath Prabhakar

Subscribe to get the latest posts sent to your email.

What’s Wrong with Traditional Models?

Recursive Transformers as a Solution

Benefits of Recursive Transformers:

Parameter Sharing and Stepwise Initialization

Relaxed Recursive Transformers: Adding Flexibility

Multi-LoRA Layers: Enhanced Flexibility Without the Cost

Continuous Depth-wise Batching

Difference from Sequence-wise Batching:

Knowledge Distillation and Extended Training

Real-World Impact and Performance Gains

Why This Matters

The Road Ahead

Conclusion

GPU Usage Comparison: Traditional LLMs vs. Recursive Transformers

Key Links

Share this:

Related

Discover more from Ajith Vallath Prabhakar

Discover more from Ajith Vallath Prabhakar