idea

There are two main ways of ingesting data: streaming, and batch. The global direction for data-intensive applications is to have separate read and write systems, a pattern known as CQRS^[3]. With CQRS, command systems are optimized to accept high volumes of input data ; while query systems are purpose-built to optimize the use-case needing the data. Data usually gets duplicated between systems, each system storing it in a manner optimized for its purpose: ingestion systems store it in batches or logs, query systems store it in a structured or indexed manner.

Lambda architecture^[1] is separating ingestion into two kinds: streaming and batching. Streaming is handling continuously incoming messages. They are traditionally "push" and continuous: external systems continuously push data into the ingestion system. Batch processes are handling batches of messages incoming at once. They are traditionally "pull" and scheduled: ingestion system query external systems for some data on a given schedule (although external systems might actually push batched data on a given schedule)

Kappa architecture^[2] is an evolution of lambda architecture, taking on the idea that batches are just out-of-order streams, with each event/message/element arriving together. It relies on an immutable event log (stream) to handle data storage, and data is streamed through the log into auxiliary serving systems. Kappa architecture is requiring to make stream processing idempotent and capable of processing out of order items, and re-transform batches back into streams. This allows to have only one system to process everything, and through that, allows data processing pipelines such as complex event processing to happen for batches as well. It is also argued to simplify systems.

links

Local-first software is using an event log and CRDTs to handle synchronization between clients.

references

[1]: wikipedia / lambda architecture
[2]: Kappa-architecture.com is a repo collecting references on kappa architecture.
[3]: Martin Fowler / CQRS^ref is describing Command Query Responsibility Segregation: having commands (push) and queries (pull) under the responsibility of segregated systems.