Skip to main content
Redhat Developers  Logo
  • Products

    Featured

    • Red Hat Enterprise Linux
      Red Hat Enterprise Linux Icon
    • Red Hat OpenShift AI
      Red Hat OpenShift AI
    • Red Hat Enterprise Linux AI
      Linux icon inside of a brain
    • Image mode for Red Hat Enterprise Linux
      RHEL image mode
    • Red Hat OpenShift
      Openshift icon
    • Red Hat Ansible Automation Platform
      Ansible icon
    • Red Hat Developer Hub
      Developer Hub
    • View All Red Hat Products
    • Linux

      • Red Hat Enterprise Linux
      • Image mode for Red Hat Enterprise Linux
      • Red Hat Universal Base Images (UBI)
    • Java runtimes & frameworks

      • JBoss Enterprise Application Platform
      • Red Hat build of OpenJDK
    • Kubernetes

      • Red Hat OpenShift
      • Microsoft Azure Red Hat OpenShift
      • Red Hat OpenShift Virtualization
      • Red Hat OpenShift Lightspeed
    • Integration & App Connectivity

      • Red Hat Build of Apache Camel
      • Red Hat Service Interconnect
      • Red Hat Connectivity Link
    • AI/ML

      • Red Hat OpenShift AI
      • Red Hat Enterprise Linux AI
    • Automation

      • Red Hat Ansible Automation Platform
      • Red Hat Ansible Lightspeed
    • Developer tools

      • Red Hat Trusted Software Supply Chain
      • Podman Desktop
      • Red Hat OpenShift Dev Spaces
    • Developer Sandbox

      Developer Sandbox
      Try Red Hat products and technologies without setup or configuration fees for 30 days with this shared Openshift and Kubernetes cluster.
    • Try at no cost
  • Technologies

    Featured

    • AI/ML
      AI/ML Icon
    • Linux
      Linux Icon
    • Kubernetes
      Cloud icon
    • Automation
      Automation Icon showing arrows moving in a circle around a gear
    • View All Technologies
    • Programming Languages & Frameworks

      • Java
      • Python
      • JavaScript
    • System Design & Architecture

      • Red Hat architecture and design patterns
      • Microservices
      • Event-Driven Architecture
      • Databases
    • Developer Productivity

      • Developer productivity
      • Developer Tools
      • GitOps
    • Secure Development & Architectures

      • Security
      • Secure coding
    • Platform Engineering

      • DevOps
      • DevSecOps
      • Ansible automation for applications and services
    • Automated Data Processing

      • AI/ML
      • Data Science
      • Apache Kafka on Kubernetes
      • View All Technologies
    • Start exploring in the Developer Sandbox for free

      sandbox graphic
      Try Red Hat's products and technologies without setup or configuration.
    • Try at no cost
  • Learn

    Featured

    • Kubernetes & Cloud Native
      Openshift icon
    • Linux
      Rhel icon
    • Automation
      Ansible cloud icon
    • Java
      Java icon
    • AI/ML
      AI/ML Icon
    • View All Learning Resources

    E-Books

    • GitOps Cookbook
    • Podman in Action
    • Kubernetes Operators
    • The Path to GitOps
    • View All E-books

    Cheat Sheets

    • Linux Commands
    • Bash Commands
    • Git
    • systemd Commands
    • View All Cheat Sheets

    Documentation

    • API Catalog
    • Product Documentation
    • Legacy Documentation
    • Red Hat Learning

      Learning image
      Boost your technical skills to expert-level with the help of interactive lessons offered by various Red Hat Learning programs.
    • Explore Red Hat Learning
  • Developer Sandbox

    Developer Sandbox

    • Access Red Hat’s products and technologies without setup or configuration, and start developing quicker than ever before with our new, no-cost sandbox environments.
    • Explore Developer Sandbox

    Featured Developer Sandbox activities

    • Get started with your Developer Sandbox
    • OpenShift virtualization and application modernization using the Developer Sandbox
    • Explore all Developer Sandbox activities

    Ready to start developing apps?

    • Try at no cost
  • Blog
  • Events
  • Videos

How to integrate vLLM inference into your macOS and iOS apps

A developer's guide to vLLM on macOS and iOS

May 13, 2025
Rich Naszcyniec
Related topics:
Artificial intelligenceProgramming languages & frameworks
Related products:
Red Hat AIRed Hat Enterprise Linux AIRed Hat OpenShift AI

Share:

    Imagine this scenario: Your company has carefully selected an open source large language model (LLM) as its foundation for generative AI, meticulously aligning and tuning it with proprietary data to meet specific business needs. As a macOS and iOS developer, you’ve just received the green light: the model is ready to integrate with applications. Your first assignment is to harness this finely tuned LLM to create an internal chatbot, bringing its advanced conversational abilities to life for users on macOS and iOS devices your company uses. 

    With the chatbot assignment in hand, you start to think about how you will build the application. You’ll use Swift as your implementation language, SwiftUI to build the user experience, and likely use Swift Data for local storage of data as needed—like conversation history. 

    However, the question of how to connect the chatbot to the model requires some discovery. Fortunately, after consulting with your internal AI team, you learn that the company has deployed the model using vLLM, a high-performance serving engine designed to handle LLM interaction—online inference—efficiently.

    What is vLLM?

    vLLM, which stands for virtual large language model, is a library of open source code maintained by the vLLM community. It helps large language models (LLMs) perform calculations more efficiently and at scale.

    Specifically, vLLM is an inference server that speeds up the output of generative AI applications by making better use of the GPU memory.

    By using vLLM for model inference, you have two robust options you can use with Swift for interaction with your company model: the OpenAI-compatible Completions API and Chat API, which provides a widely used and flexible interface for sending prompts and receiving responses, and the newly introduced Llama Stack API, which promises standardized access to many advanced generative AI features including inference. 

    Good news. You now know your connectivity options and can move on to the next question: Should you use the OpenAI-compatible API or the Llama Stack API?

    High-level information to keep in mind as you weigh your options:

    • The OpenAI-compatible API has been in use for many years, and offers a fairly widely used option that might be familiar to existing AI application developers. It provides a flexible interface for sending prompts and receiving responses for text, images and vision, audio and speech, and more.
    • Llama Stack API is a relatively new project and offers the advantage of defining and standardizing a set of core building blocks needed to bring generative AI applications to life. One of those building blocks is inference, but building blocks for RAG, Agents, Tools, Safety, Evals, and Telemetry are also in the specification.

    Maximize flexibility using low-level HTTP communications with a standard Apple developer framework

    Because both the vLLM OpenAI-compatible endpoint and Llama Stack API communicate using REST, macOS and iOS developers can use low-level HTTP communication code using Apple’s standard URLSession framework for both. This option offers a versatile, dependency-free option that allows you to create inference requests, handle responses, and manage data flow natively with access to all the possible low level activities.

    Advantages of this approach:

    • No external dependencies: By relying solely on URLSession, you avoid adding third-party libraries, keeping your project lightweight, and reducing potential versioning conflicts.
    • Universal compatibility: Works seamlessly with any REST endpoint. This flexibility ensures you’re not locked into one API’s ecosystem, making it easier to switch or support both in a single app.
    • Full control: You manage every aspect of the request—headers, timeouts, retries, and response parsing. This is valuable for customizing interactions with vLLM’s OpenAI-compatible API (e.g., tweaking streaming) or adapting to Llama Stack’s evolving features (e.g., RAG-specific parameters).
    • Native Swift integration: Leverages Apple’s modern concurrency (e.g., async/await in iOS 15+), and Swift 6 making it future-proof and performant.

    Disadvantages of this approach:

    • Verbosity: Crafting requests and parsing responses manually requires more code than library-based, purpose-built options.
    • Error handling: You’re responsible for handling all HTTP errors, endpoint API errors (e.g., 429 rate limits from OpenAI, 500 errors from Llama Stack), network failures, and malformed JSON.
    • Time-intensive development: Building and testing low-level HTTP calls takes longer than using a library with prebuilt methods. 
    • Maintenance burden: As APIs evolve (e.g., OpenAI adding endpoints or Llama Stack refining its spec), you must update your code manually. With OpenAI’s frequent updates and Llama Stack’s early-stage development, this could become a recurring task.

    Using low-level HTTP communications with URLSession is a viable option that shines in its independence and control, letting you tailor requests to either API’s needs without external overhead. However, it trades convenience for complexity, requiring more effort to handle edge cases and maintain over time. For a chatbot leveraging your company’s tuned LLM, this approach works well if you’re comfortable with the DIY trade-off—or if you’re laying the groundwork for a custom integration that might later evolve into a library-based solution.

    However, many people favor speed of innovation and ability to rapidly iterate application versions over investing time to have low level network control. With that objective in mind, let’s take a look at some built-for-purpose open source projects to rapidly get started on the chatbot application.

    Using an open source project for OpenAI and Llama Stack REST endpoints

    Using an open source project that can communicate with both the OpenAI and Llama Stack APIs helps abstract away a lot of the low-level HTTP complexities you get using Apple’s standard URLSession framework. You’ll get full vLLM inference interaction with the model and can get started on your project faster using this option compared to URLSession.

    Alamofire

    Alamofire is one of the most popular networking libraries in the Swift ecosystem for HTTP network communications. The project has over 41,000 GitHub stars, and more than 251 project contributors. The project falls under the permissive MIT license. 

    While not specific to OpenAI or Llama Stack, it simplifies HTTP requests compared to URLSession, and interacts seamlessly with REST endpoints. It’s a solid option for developers already using Alamofire or needing a reliable, general-purpose networking solution but don’t want to directly leverage the Apple URLSession framework.

    Advantages of this approach:

    • Simplified networking: Abstracts away low-level HTTP details (e.g., headers, encoding), reducing boilerplate code for API requests.
    • Broad adoption: With a large and active user and contributor community, it’s well-tested and widely trusted across Swift projects.
    • Modern Swift support: Integrates with Codable, async/await, and other Swift features for clean, type-safe code.
    • Robust features: Offers built-in response validation, retry mechanisms, and JSON handling, streamlining error management and debugging.
    • Extensive documentation: Comprehensive guides and examples make it easy to implement and troubleshoot, even for complex requests.

    Disadvantages of this approach:

    • General-purpose design: Not tailored to OpenAI or Llama Stack, lacking specific abstractions for their endpoints or features (e.g., streaming).
    • Learning curve: Requires familiarity with its extensive API which might slow onboarding of new developers.
    • Less control: Hides some low-level details (e.g., raw URLSession configuration), which can limit customization for advanced use cases.
    • Overkill for small projects: For minimal API calls, its feature set might be excessive.

    Using Alamofire will simplify your HTTP communications code compared to URLSession. However, like URLSession, it trades the convenience of fit-for-purpose APIs for access to many low-level HTTP capabilities. For more applications that require multiple types of HTTP communications, this approach might work better than using the fit-for-purpose API options discussed next.

    The Llama Stack open source project

    The Llama Stack project is a relatively new project that was released by Meta in November 2024. The project currently has nearly 8,000 GitHub stars, an impressive 130+ contributors, and is growing rapidly. The project falls under the permissive MIT license.

    Llama Stack has support for the inference capabilities needed for a chatbot, but also allows for application functionality that can be built around Retrieval Augment Generation (RAG), agents, tools, safety, evals, and telemetry. The broad scope of capabilities matches well with the goal of the project to standardize the core building blocks that simplify AI application development. 

    Llama Stack allows you to use swappable providers for the various functional building blocks without having to rewrite your application code and avoid vendor lock-in. vLLM is one of the Llama Stack inference provider options for the inference building block.

    Want more information on vLLM as an inference provider for Llama Stack?

    Take a look at the Introducing vLLM Inference Provider in Llama Stack article for more information about the vLLM inference provider.

    In our example scenario, your internal AI team has set up vLLM to run as a remote inference provider, and set up one or more Llama Stack servers running for applications to connect to. This will allow multiple applications to connect to the provider. 

    As a result of this deployment pattern, even though your application will get inference server capabilities through vLLM, you will request inference services through the Llama Stack server using the Llama Stack inference API. Fortunately, the Llama Stack project supports multiple languages for applications, including Swift. 

    Advantages of this approach:

    • Standardized API access: Provides a unified interface to Llama Stack’s inference, agent, and RAG APIs, simplifying integration with advanced AI features.
    • Native Swift integration: Built for iOS/macOS, it supports Swift’s modern concurrency (e.g., async/await), ensuring clean, performant code for Apple platforms.
    • Custom tool calling: Enables defining Swift-based tools for Llama Stack agents, enhancing chatbot interactivity with app-specific logic (e.g., calendar integration).
    • Built-in safety: Leverages Llama Stack’s safety features (e.g., Llama Guard), ensuring responsible AI output for your chatbot without extra effort.

    Disadvantages of this approach

    • Early-stage maturity: Released in November 2024, the Swift client is relatively new compared to OpenAI libraries.
    • Feature gaps: The Swift client does not yet support all Llama Stack capabilities.
    • Learning curve: Developers unfamiliar with Llama Stack’s ecosystem might face a steeper onboarding process compared to the widely known OpenAI API structure.
    • Sparse documentation: While improving, documentation is less comprehensive than OpenAI’s, potentially requiring experimentation to master advanced features like RAG.

    For a chatbot project, a developer can get started using just a handful of Llama Stack APIs. In addition, Swift developers can create additional types of AI enabled applications using Llama Stack. For example, you could create Agents using Swift that integrate with a Llama Stack server. Currently, llama-stack-client-swift supports inference, agents, and custom tool calling. As the Llama Stack and llama-stack-client-swift package mature, additional capabilities will be supported.

    Personally, I’ve switched to the Llama Stack and the llama-stack-client-swift for most of my macOS and iOS development when working on AI applications.

    OpenAI-compatible specific open source projects

    The OpenAI API is a more “traditional” route to take. (If there is such a thing as traditional development for AI!) If you choose the OpenAI API route and want to get started as quickly as possible while simultaneously trying to avoid learning the distinct details about OpenAI HTTP communications, then there are multiple open source projects purpose-built to do that. This article focuses on two of those options that each abstracts away low-level HTTP details and provides Swift-friendly APIs that will simplify and accelerate application development.

    SwiftOpenAI

    SwiftOpenAI is a dedicated OpenAI API wrapper that is growing in popularity, with nearly 500 GitHub stars and 11 active contributors. The project falls under the permissive MIT license. 

    The project supports most OpenAI endpoints, including chat completions (chatbot would use this), vision, and assistants. Getting started with the API is fairly well documented and a sample application is available with the project.

    Advantages of this approach

    • Ease of use: Simplifies integration with intuitive APIs that help reduce setup time for OpenAI-powered features like chatbots.
    • Modern Swift features: Fully supports async/await and structured concurrency, aligning with Swift’s evolution for clean, efficient code.
    • Streaming support: Handles advanced OpenAI features like real-time response streaming out of the box, which is a common requirement of a chatbot.
    • Active maintenance: Regularly updated (e.g., supports latest OpenAI features like Assistants API), ensuring compatibility with evolving API specs.
    • Lightweight: Focused scope keeps it leaner than broader networking libraries, minimizing overhead for OpenAI-specific projects.

    Disadvantages of this approach

    • Smaller community: With a smaller contributor community and less GitHub stars, it has less widespread adoption than Alamofire, potentially leading to slower issue resolution or fewer resources.
    • Documentation gaps: While improving, documentation might not be as comprehensive as what Apple and Alamofire offer.

    The full capabilities of SwiftOpenAI go far beyond what you will need in a chatbot; however, if you intend to increase usage of the full set of options in the OpenAI API, then this might be the right choice for you.

    MacPaw -> OpenAI

    The MacPaw -> OpenAI project is another open source option; it is an OpenAI wrapper primarily maintained by MacPaw, a trusted company behind apps like CleanMyMac. The GitHub project has 2.4K stars and over 60 contributors. The project falls under the permissive MIT license.

    MacPaw -> OpenAI has a strong focus on ease of use and support for chat, completions, and multimodal inputs, making it a great choice for developers seeking a polished, well-supported OpenAI wrapper for a chatbot. However, it doesn’t offer a complete implementation of the OpenAI API specification for those who might eventually need those expanded capabilities.

    Advantages of this approach:

    • OpenAI-specific design: Tailored for OpenAI’s API, with structured methods for chat completions, embeddings, and image generation, which can accelerate coding.
    • Clean Swift API: Offers type-safe structs and modern Swift conventions, reducing errors and improving code readability.
    • Async/await integration: Fully supports Swift’s modern concurrency, enabling efficient, non-blocking API calls for responsive apps.
    • Streaming and multimodal: Supports streaming responses and multimodal inputs (e.g., images with vision models), enhancing chatbot interactivity.

    Disadvantages of this approach:

    • Mid-sized community: This project has less widespread adoption than Alamofire, but more than SwiftOpenAI.
    • Incomplete feature coverage: Might lag in supporting OpenAI features compared to SwiftOpenAI.
    • Documentation gaps: While clear, documentation can be less comprehensive than larger libraries, requiring more trial-and-error for complex use cases.

    Conclusion

    Integrating the usage of a finely tuned large language model remotely served by vLLM into your macOS and iOS applications opens up exciting possibilities for creating powerful, AI-driven experiences. An internal chatbot application is a fantastic way to get started on your AI application journey. 

    When working with OpenAI-compatible APIs and Llama Stack APIs, you can opt for the flexibility of low-level HTTP communications with the Apple URLSession framework, or the streamlined networking of Alamofire.

    For building Llama Stack applications, you can currently use the project's standardized AI building blocks for inference, agents, and custom tool calling using the llama-stack-client-swift client. In the future, you should be able to access additional Llama Stack building blocks for safety, evals, telemetry, and more.

    Purpose-built OpenAI-compatible libraries like SwiftOpenAI or MacPaw’s OpenAI wrapper are additional options. Each offers unique strengths to suit your project’s needs. By carefully weighing factors like development speed, control, and scalability, you can choose the path that best aligns with your goals and resources.

    In future articles I'll provide sample code showing how to use the options mentioned in this article. However, feel free to dive into the Llama Stack documentation to explore its standardized AI building blocks as well as Swift sample apps, or check out SwiftOpenAI’s sample app for a quick start with OpenAI’s API. If you prefer a robust networking foundation, experiment with Alamofire’s examples to simplify your REST calls, or check out the extensive Apple documentation related to URLSession . 

    The time to start building AI-enabled applications for macOS and iOS is here. Are you ready to get started?

    Related Posts

    • LLM Compressor is here: Faster inference with vLLM

    • Performance boosts in vLLM 0.8.1: Switching to the V1 engine

    • vLLM V1: Accelerating multimodal inference for large language models

    • Multimodal model quantization support through LLM Compressor

    • Enable 3.5 times faster vision language models with quantization

    • Llama 4 herd is here with Day 0 inference support in vLLM

    Recent Posts

    • More Essential AI tutorials for Node.js Developers

    • How to run a fraud detection AI model on RHEL CVMs

    • How we use software provenance at Red Hat

    • Alternatives to creating bootc images from scratch

    • How to update OpenStack Services on OpenShift

    Red Hat Developers logo LinkedIn YouTube Twitter Facebook

    Products

    • Red Hat Enterprise Linux
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform

    Build

    • Developer Sandbox
    • Developer Tools
    • Interactive Tutorials
    • API Catalog

    Quicklinks

    • Learning Resources
    • E-books
    • Cheat Sheets
    • Blog
    • Events
    • Newsletter

    Communicate

    • About us
    • Contact sales
    • Find a partner
    • Report a website issue
    • Site Status Dashboard
    • Report a security problem

    RED HAT DEVELOPER

    Build here. Go anywhere.

    We serve the builders. The problem solvers who create careers with code.

    Join us if you’re a developer, software engineer, web designer, front-end designer, UX designer, computer scientist, architect, tester, product manager, project manager or team lead.

    Sign me up

    Red Hat legal and privacy links

    • About Red Hat
    • Jobs
    • Events
    • Locations
    • Contact Red Hat
    • Red Hat Blog
    • Inclusion at Red Hat
    • Cool Stuff Store
    • Red Hat Summit

    Red Hat legal and privacy links

    • Privacy statement
    • Terms of use
    • All policies and guidelines
    • Digital accessibility

    Report a website issue

    OSZAR »