OpenELM: Apple’s Groundbreaking Open Language Model

, ,
12 minutes


TL;DR

  • Apple has released OpenELM, an open-source language model that can outperform ChatGPT and GPT-3 in some cases.
  • OpenELM is a decoder-only transformer language model pre-trained on publicly available datasets using on-the-fly tokenization.
  • The model provides the full training framework and checkpoints and includes code for inference on Apple Silicon.
  • OpenELM utilizes a layer-wise scaling strategy, Grouped Query Attention, Switched Gated Linear Unit feed-forward networks, and Rotatory Positional Embeddings for improved accuracy and performance.
  • Apple’s open-sourcing of OpenELM demonstrates its dedication to advancing open research and fostering transparency in the AI field.

Historically, Apple has taken a more reserved approach to artificial intelligence (AI), focusing primarily on enhancing user experiences through subtle AI integrations, such as improvements to Siri. Unlike AI giants like Google and Microsoft, Apple had not prioritized releasing standalone AI products. However, recent years have witnessed a significant shift in this strategy. Apple has ramped up its commitment to AI with an annual investment of about $1 billion, aiming not just to catch up but potentially to lead in the generative AI space. This substantial investment underlines a broader strategy to enhance existing products and explore new AI-driven capabilities across their ecosystem, marking a new era in Apple’s engagement with cutting-edge AI technologies.

Apple has recently made a significant advancement in the AI Field with the release of OpenELM, an open-source language model. Despite being compact in size, OpenELM is an advanced model that can outperform leading models like ChatGPT and GPT-3 in some cases. 

OpenELM is a decoder-only transformer language model which is available as an open-source software. It achieves state-of-the-art accuracy through layer-wise scaling of parameters across layers. The model is pre-trained efficiently on publicly available datasets using on-the-fly tokenization. It also provides the full training framework and checkpoints, and demonstrates strong performance across diverse benchmarks. Moreover, the model includes code for inference on Apple Silicon, which makes it a favorable foundation model for furthering open research in large language models.

Notably, Apple is not merely releasing OpenELM as another product but is making it open-source, a significant departure from their traditionally closed approach. By open-sourcing a state-of-the-art language model like OpenELM, Apple is demonstrating its dedication to advancing open research and fostering transparency. This move reassures the research community about the model’s future development and aligns with industry trends, ensuring Apple does not lag behind in emerging technologies such as machine learning and AI.

Architecture

At the heart of OpenELM lies a decoder-only transformer-based architecture, following the design principles of state-of-the-art language models like GPT and LLaMA. However, what truly sets OpenELM apart is its layer-wise scaling strategy, which enables a more efficient allocation of parameters within each layer of the transformer model.

Unlike traditional isotropic models that have the same configuration for each layer, OpenELM adjusts the number of attention heads and the feed-forward network multiplier dynamically across different layers. This means that OpenELM does not allocate the same number of parameters for each layer. Instead, it distributes parameters non-uniformly, allowing it to better utilize the available parameter budget. This improves accuracy and performance compared to models with uniform layer configurations.

Another key component in OpenELM that contributes to its impressive performance is Grouped Query Attention (GQA). GQA is a technique in neural networks that enhances traditional attention mechanisms by grouping similar queries together when calculating attention scores. This method is especially useful in processing large and complex datasets, as it helps reduce computational complexity, improve model performance by better capturing contextual relationships, and increase efficiency by minimizing redundancy in attention computations. GQA is valuable in applications such as language processing, where understanding the context and relationships in data is crucial. In OpenELM, the multi-head attention mechanism is replaced by Grouped Query Attention, thus improving model performance while reducing computational overhead. 

OpenELM utilizes Switched Gated Linear Unit (SwiGLU) feed-forward networks, which are a more efficient variant and have shown superior performance in language modeling tasks as compared to standard feed-forward networks. SwiGLU is an advanced activation function used in neural networks that enhances the Gated Linear Unit (GLU) by incorporating a learnable switching mechanism. This mechanism allows SwiGLU to dynamically adapt its gating strategy based on the input data, making it particularly effective for complex tasks like natural language processing. The key advantages of SwiGLU include improved learning dynamics due to finer control over information flow, increased model flexibility through its ability to handle diverse data patterns, and potential computational efficiency benefits, especially in scenarios where dynamic gating significantly enhances learning effectiveness.

OpenELM utilizes Rotatory Positional Embeddings (RoPE) to encode positional information effectively. RoPE is a technique used in Transformer models to encode the position of tokens by applying a rotation to token embeddings based on their sequence position. Unlike traditional positional embeddings that add vectors, RoPE uses rotations to maintain the length of embedding vectors, enhancing model stability. This method helps the model’s attention mechanism to understand the relative positions of tokens, improving the handling of sequential orders and increasing computational efficiency. RoPE is particularly useful for tasks that require acute awareness of the structure and order within data, making it a valuable enhancement in sequence modeling.

Additionally, OpenELM leverages Flash Attention to improve the efficiency of attention computation. Flash Attention is a highly optimized attention mechanism developed by NVIDIA for Transformer models. It helps to speed up attention computations and improve efficiency on GPUs, making it particularly advantageous for applications dealing with large datasets. By reorganizing the calculation of softmax and dot products, Flash Attention significantly reduces the need for extensive data movement, allowing for much faster processing times and enhanced scalability. This method also enables the handling of longer sequences than traditional attention mechanisms, significantly boosting model performance and computational efficiency. Flash Attention is useful for applications such as natural language processing and bioinformatics, where rapid processing is required.

By combining these architectural innovations with the layer-wise scaling strategy, OpenELM achieves remarkable performance while maintaining a compact size, making it an attractive choice for a wide range of applications, including resource-constrained environments and on-device deployments.

Pre-Training Data and Process

Pre-training data quality and diversity are crucial to language models’ capabilities and performance. Apple curated a dataset of diverse domains and linguistic patterns from various public sources for OpenELM pre-training.

The dataset used for pre-training consists of several sources such as RefinedWeb, which is a large collection of filtered and sanitized web data, a subset of the PILE dataset comprising diverse text data from various sources, segments from the RedPajama dataset used for replicating the LLaMA training data, and selected portions from the Dolma v1.6 dataset which was specifically designed for research in language model pre-training.

The total size of the pre-training dataset utilized for OpenELM is approximately 1.8 trillion tokens, providing a rich and diverse foundation for the model’s language understanding and generation capabilities.

Unlike previous approaches that relied on pre-tokenized data, OpenELM employs on-the-fly tokenization and data filtering techniques, offering greater flexibility and simplifying the prototyping and research process.

The training process for OpenELM followed a two-phase approach designed to impart the model with general knowledge, language understanding, and specialized skills. 

  • In the initial phase, OpenELM was exposed to heavily filtered web data to teach the model general knowledge and language comprehension. 
  • The second phase incorporates a refined subset of web data combined with synthetic data generated by large language models, aiming to instill logical reasoning abilities and niche skills.

Evaluation and Experimental Results

Apple used several evaluation frameworks to test OpenELM’s performance. These frameworks focused on different aspects of the model’s capabilities, such as reasoning, knowledge understanding, and detecting misinformation and bias. The tests were comprehensive and covered a wide range of tasks.

The evaluations focused on standard zero-shot tasks (e.g., ARC, BoolQ, HellaSwag, PIQA, SciQ, WinoGrande), tasks from the OpenLLM and LLM360 leaderboards (e.g., ARC Challenge, HellaSwag, MMLU, TruthfulQA, WinoGrande, CrowS-Pairs, RACE), and few-shot settings where the model was provided with a small number of examples to adapt to specific tasks.

The results of pre-training for OpenELM are highly impressive. Across various evaluation frameworks and tasks, OpenELM consistently outperformed comparable open language models pre-trained on publicly available datasets. For instance, an OpenELM variant with 1.1 billion parameters achieved remarkable accuracy improvements of 1.28% on standard zero-shot tasks, 2.36% on the OpenLLM leaderboard tasks, and 1.72% on the LLM360 tasks in comparison to OLMo (1.2 billion parameters), while using only half of the pre-training data.

Instructional Fine Tuning 

Apple utilized instruction tuning, a technique that involves refining the model on a dataset of prompts and desired outputs, to further enhance OpenELM’s capabilities. 

To carry out this process, the researchers employed the UltraFeedback dataset, which includes 60,000 prompts. The outcomes of this tuning were quite impressive, with OpenELM demonstrating a consistent improvement of 1-2% in its average accuracy across different evaluation frameworks and model sizes.

Parameter Efficient Fine Tuning (PEFT)

Apple researchers have explored the use of parameter-efficient fine-tuning (PEFT) techniques such as LoRA and DoRA, in addition to instruction tuning, for OpenELM.

PEFT methods aim to fine-tune language models with fewer trainable parameters, reducing computational requirements and memory usage. The researchers conducted fine-tuning experiments on the CommonSense reasoning benchmark, which includes 170,000 training samples across eight multiple-choice datasets. 

OpenELM was integrated with LoRA and DoRA, and the results showed that PEFT methods can be successfully applied to OpenELM. Both LoRA and DoRA delivered comparable performance on average across the given CommonSense reasoning datasets. This finding demonstrates the versatility of OpenELM and its compatibility with state-of-the-art techniques for efficient and scalable fine-tuning, which can expand its potential applications in resource-constrained environments.

Benchmarking and Performance Analysis

To validate OpenELM’s performance and efficiency, Apple conducted comprehensive benchmarking and evaluation studies on modern, consumer-grade hardware, including an NVIDIA RTX 4090 GPU and Apple’s proprietary M2 Max system-on-chip

Evaluation Methodology

The benchmarking analysis centered on two crucial aspects of token throughput – prompt processing (pre-fill) and token generation. These measurements offer valuable insights into the model’s performance in real-life scenarios, where prompt comprehension and text generation are critical. The researchers conducted several runs to guarantee precise and consistent measurements, including warm-up passes and caching techniques.

Throughput Results 

The results revealed that OpenELM demonstrated higher accuracy for a similar parameter count compared to other models. However, its inference speed was slightly slower than that of OLMo, an open-source large language model (LLM) developed by the Allen Institute for AI (AI2). The researchers also found that a significant portion of OpenELM’s processing time was due to the RMSNorm layer used for normalization, which was implemented in a naive way.

Performance Analysis and Optimization

To understand the performance bottlenecks and identify potential areas for optimization, the researchers conducted a comprehensive profiling of OpenELM. They found that by replacing the naive RMSNorm with Apex’s RMSNorm implementation, they were able to increase OpenELM’s throughput significantly. However, even with this improvement, OpenELM’s performance was not as good as models that used optimized LayerNorm implementations. The research shows that the normalization layer is a critical bottleneck that needs to be addressed to unlock OpenELM’s full performance potential. This analysis has identified significant optimization potential for future endeavors.

Significance and Impact

The launch of OpenELM is a significant achievement in the field of natural language processing and open research. Apple has taken a commendable step towards promoting transparency and empowering the open research community by providing the complete framework for training, fine-tuning, and evaluating the language model.

Reproducibility and Transparency

One of the key contributions of OpenELM is its emphasis on reproducibility and transparency. Apple has released the entire framework, including training logs, multiple checkpoints, pre-training configurations, and MLX inference code. This comprehensive release allows researchers, developers, and enthusiasts to replicate and validate the results, enabling a deeper understanding of the model’s behavior and performance characteristics and facilitating the identification of potential biases or limitations.

Empowering Open Research

By opening up the inner workings of OpenELM, Apple has empowered the open research community to build upon this foundation and contribute to the advancement of language models. This level of openness fosters collaboration, knowledge-sharing, and the cross-pollination of ideas, ultimately accelerating the pace of innovation in the field of NLP.

Potential Applications and Use Cases

OpenELM’s compact size and ability to run locally on smartphones and other personal devices open up a world of possibilities for seamless and responsive language-based applications. From virtual assistants and real-time translation to conversational interfaces and content generation, OpenELM can power a wide range of innovative solutions.

Privacy, Security, and Personalization

The local nature of OpenELM’s deployment presents significant advantages in terms of privacy and data security. With sensitive data remaining on the user’s device, the risk of unauthorized access, data breaches, or misuse is greatly reduced, instilling confidence in users and promoting trust in the technology. Moreover, OpenELM’s on-device deployment enables personalized experiences tailored to individual preferences, interests, and language patterns.

Accessibility and Inclusivity

By bringing advanced language AI capabilities to personal devices, OpenELM has the potential to democratize access to these technologies, making them available to a wider range of users, including those with limited internet access or computational resources. The open-source nature of OpenELM promotes inclusivity within the AI community, allowing developers and researchers from diverse backgrounds and resource levels to contribute to the field.

Conclusion

OpenELM demonstrates the power of efficient parameter allocation, careful data curation, and responsible AI principles. Its remarkable performance, combined with its compact size and on-device deployment capabilities, make it a groundbreaking achievement in the field of natural language processing.

While challenges remain, such as optimizing performance and addressing potential biases or limitations, Apple’s commitment to open research and transparency paves the way for further innovation, collaboration, and the development of trustworthy AI systems. As the need for smart language models continues to grow in various industries, OpenELM sets the stage for future advancements and showcases the remarkable progress that can be made through architectural innovations, thorough data curation, and a dedication to responsible AI principles.

Key Links

Research Paper : OpenELM: An Efficient Language Model Family with Open Training and Inference Framework

Hugging Face Page : https://huggingface.co/apple/OpenELM-270M

Apple CoreNet Github : https://github.com/apple/corenet

Authors : Sachin Mehta, Mohammad Hossein Sekhavat, Qingqing Cao, Maxwell Horton, Yanzi Jin, Chenfan Sun, Iman Mirzadeh, Mahyar Najibi, Dmitry Belenko, Peter Zatloukal, Mohammad Rastegari


Discover more from Ajith's AI Pulse

Subscribe to get the latest posts to your email.

Leave a comment

Trending

Create a website or blog at WordPress.com