At rabbit, our goal is to create a novel consumer experience that is simple and intuitive.

We believe that the best approach to achieve this should be cloud-driven, amenable to low-latency interactions, and optimally compatible with integration paths of existing consumer services. We've incorporated these beliefs into various systems we've built, which comprise rabbit OS. It enjoys a high degree of control over the hardware form factor we have (& will) manufactured to interact with users. By extensively using deep learning, it can analyze and synthesize a wide range of multi-modal information. rabbit OS is also the best host for the Large Action Model (LAM) we've developed to ease the integration of consumer services. Here, individual task handlers in our system can accept both an API and an interface from LAM that interacts directly with user-facing consumer apps. This is achieved through "minion," a secure cloud-based infrastructure that enables human-AI collaboration in LAM-related use cases, from authentication to action execution.

System Overview

rabbit OS is comprised of three major components: (1) cloud services that process various types of user requests, (2) system software that operates rabbit devices (like r1), and (3) application software that encompasses the web presence of rabbit OS.

rabbit os system overview

Our data model is heavily cloud-driven: the cloud services process the majority of user requests and decide which types of data to deliver to the user. This approach provides an ample amount of elastic compute to a user, allowing us to run various types of powerful workloads affordably (such as multimodal language models). It also enables resource sharing across clients: the web application and an r1 could refer to the same long-running task that resides in the cloud services.

The cloud services are also the source of truth for various types of user data. Different types of rabbit devices and clients can agree on a consolidated view of a user, even during periods of intermittent connection. Accumulated data over time opens opportunities for personalized experiences that are context- and memory-aware.

The on-device system software is designed with two principles, fast prototyping and seamless interaction. Our architecture is based on the open-source Android Open Source Project (AOSP), commonly used in smartphones and tablets. This allows us to easily work with our manufacturing partners to orchestrate hardware resources (camera, screen, microphone) and enable various types of connectivity options (Bluetooth, Wi-Fi, 4G LTE) on our form factors. To eliminate non-essential interfaces, we disabled and removed the majority of the software in the original stack that makes up the traditional app-based mobile experience. Our replacements include software clients deeply embedded within the on-device system software that focuses on quick invocation, flexibility of user input, and a server-driven display of information. This results in an experience that complements existing mobile usage patterns but is also standalone, requiring no tethered connections.

Finally, the web application ensures that rabbit's platforms and devices are not isolated from the rest of the user's computing environment, where certain information and actions (such as authentication) can still be performed on existing platforms.

To anchor our exposition on rabbit OS in concrete characteristics, we've listed a few high-level observations on the problems we are trying to solve and how we are solving them:

Low-latency interaction is a reconciliation problem

Voice interactions are known to be slow in a synchronous client-server architecture. The change of modality when going from audio to text (and back) incurs both additional latency and a loss of context. For example, an ASR system in a traditional architecture would need to finalize producing a transcription before any text processing system (such as an NLP algorithm) is able to kick off. The transcription itself is also limited by both a lack of temporal (the user's past utterance) and contextual (noises in the background that may indicate the user's surroundings, or the user's tone) information.

The nature of services and experiences rabbit OS is trying to provide implies that a good architecture needs to be highly stateful, needs to stream, and needs to pipeline:

  • Statefulness: Consumer activities, from ride booking to food delivery, have multiple stages of states that a voice-based assistant would need to keep track of. The states that should be persisted, cached and refreshed over time are delicate choices that advocate for a new design that goes beyond a simple request-response scheme.
  • Streaming: Most components in voice interaction are streaming systems. LLM generates results token-by-token autoregressively, speech is also inherently ingested and synthesized front to back. A low latency architecture should be as amenable to streams as possible.
  • Pipelining: Consumers perform multiple tasks over time, many with overlaps, and some that could last for hours with intermittent result updates. The assistant would need to be able to keep track of long-running tasks, process intermediate results as they come in, and reconcile potential races with ephemeral events that come and go.

The state of various tasks we run on the cloud services is also dynamic. Tasks can finish, get started, and error out over time, all asynchronously.

Thus, at the center of our vision of low-latency interaction is a reconciliation problem: how to make the assistant aware of everything that is happening, efficiently and quickly, and still maintain a coherent interaction with the user.

This idea manifests in various parts of the cloud services. The majority of requests from a user go through a centralized reconciliation engine that can make quick decisions on dispatching, sustaining, and terminating tasks using very few computational resources. Our usage of Kubernetes to maintain virtualized environments for web and mobile apps for LAM is also abstracted into a stateful service that our individual task handlers can use. Multimedia experiences (e.g., music streaming) are also delivered over real-time communication (RTC) interfaces between the client and server.

The “Holy Grail” of natural human-machine interaction, as systems like rabbit OS approach sub-second response latency, lies in the range of 100-500 ms. This is starting to be tested with end-to-end multimodal, neural systems like GPT-4o (omni). As we iterate our architecture, we are constantly striving to make our interactions faster, while keeping an eye out for end-to-end systems such as these to maintain compatibility for a potential integration in the future.

Neural language models are great orchestrators of user intention

rabbit OS is a hierarchical system: both heuristics and neural networks contribute to the processing of user intention and the execution of actions. As we develop the system, we constantly optimize the cost to prototype, allowing it to evolve with the newest technical advances and constantly changing customer demands. Especially for neural models, language models are becoming increasingly "agentic," both in terms of planning and recall from the context (Llama 3, Claude Opus, GPT-4o). This increase in capability implies that the models can handle more complex requests on their own, reducing the effort of maintaining a complex system architecture. rabbit OS is built with this thesis in mind: that neural language models are excellent orchestrators of user intention, and that they will become increasingly more capable. It is then reflected in two parts:

  • Our modular components in conversational AI and NLP tasks are implemented with quick turnaround: a certain component like intent classification can be quickly implemented with a tuned language model in hours to days (instead of weeks), then optimized over a longer period of time for cost efficiency and latency with a more specialized model.
  • We are continuously aggregating multiple components into one, executed by a single neural language model. rabbit OS is built to be flexible enough so that various modular components can be combined and abstracted away, with a single large language model running under the hood.

LAM makes consumer service integration seamless

Consumer services are primarily available as apps designed for current mobile operating systems and web platforms. For a new operating system, such as rabbit OS, to gain utility in the realm of personal computing, it must quickly gain access to these services. Existing integration paths, like application programming interfaces (APIs), offer a viable solution, with the OS dispatching tasks—some chained, others running in parallel—to call these APIs. The enhanced capability of neural language models, as discussed above, significantly aids this cause. However, many consumer services are not available as APIs. Some providers do not prioritize it because it does not aid their business model (e.g., ride-sharing and food delivery), while others have incentives against it (e.g., music streaming services).

The Large Action Model (LAM) bridges the gap caused by the lack of APIs for consumer services —it presents itself as an interface to a new OS (like rabbit OS) for various services that only have apps on existing OSes. This is accomplished by creating environments that run the apps as if a real user is accessing them through a UI, then driving them with neuro-symbolic methods. In an ideal future, the boundary between user-facing "apps" and program-facing "APIs" will disappear, with apps becoming more friendly to “agentic” algorithms and APIs increasingly invoked using natural language. We already see signs of both, with agents like LAM using accessibility labels that are stabilizing in consumer apps, and the GPTs ecosystem thriving under OpenAI's curation. We hope that LAM can be a catalyst to bring us closer to this future.

Minions offer a “sandbox” for both LAM and the user

The mental model of a minion is a remote computer that the rabbit OS can dynamically spin up for the duration of tens of minutes for users and the assistant to perform certain tasks.

The key functionalities of a minion are threefold: an isolated environment that can run an app with a user's authenticated state on existing operating systems, a series of techniques to mirror the environment to that of a mobile access point (e.g., as if a user holding an r1 is using them), and an interface that allows LAM to drive them. It is essentially a “sandbox.”

  • Minions are ephemeral. They are created on-demand and destroyed after usage. This ensures that any activity that takes place within the minion stays within the minion.
  • Minions are dedicated. No two users share the same minion. At a given time, rabbit OS could spin up multiple minions to serve a user.
  • Minions are also exposed to users through a remote control interface in certain scenarios. This includes login, where users provide necessary authenticated states for later instantiating them on minions, and “teach mode”, where users can show rabbit OS how to perform certain tasks so that they can be reused in the future.

rabbit doesn't own users' credentials. During the login process, the minion looks for certain states that represent a user's authenticated session (e.g., cookies), then passes them to trusted 3rd party privacy vaults. This approach is similar to how online vendors store one's credit card details. The same credentials are temporarily restored to a consumer app when a minion is spun up, allowing the LAM to temporarily work with an authenticated app. The minion provides a safe private environment where users and LAM can collaborate together. Users and the LAM co-own this environment.

Mobile hardware enables edge capabilities and location-aware experiences

We build the software system stack to ensure that we can prototype as quickly as possible with the hardware stacks we are operating on. We are embracing an open-source ecosystem (AOSP) and rolling out our own modifications to reduce platform risk with existing distribution platforms. This has two major benefits:

  • Hand-off of certain compute: The majority of requests are server-processed, but a dedicated mobile platform allows specialized, local models to handle some of the processing during periods of poor connectivity, offering other privacy-centric benefits.
  • Mobile hardware can provide necessary information to cloud services that unlock features which must be location- and user-aware. For example, Uber and DoorDash offer experiences that are tied to the customer's physical location.

Future Outlook

We believe the future of mobile computing should reduce the touch time for users to connect to services. App-based operating systems have provided a working formula where service providers develop apps that immediately have distribution, and users can choose the apps they would like to interact with through the pre-installed stores on the hardware.

Meanwhile, machine intelligence is becoming radically more capable, affordable and efficient. This opens up a possibility where the hardware can remove the last mile of friction that an app-based operating system has, where intentions no longer need to be repetitively and implicitly expressed through UI interactions like cursor movements. This system would be able to both understand you in a fuzzy way (natural language) and can also act on your behalf. It is personal as it learns by working with you, over time.

rabbit OS is built to accompany the journey of scaling intelligence. As the models become better, the system will be better, and the apps will be more friendly and interoperable, which we will happily accept as one of the integrations and skills our assistant can utilize.