Introduction

In my 2019 talk at the Dataworks Summit called "Building Streaming Applications" I spoke about using Apache Avro for evolving the schema of the data that is transported via a stream.

Over the years several people have asked for an example on how to do all of these ideas in a practical way.

So what is this project?

This project is intended to be an example project.

IMPORTANT:

This project is intended to show A POSSIBLE WAY of structuring a project that uses records serialized with Apache Avro in an effective way. There are many alternative ways of doing this.
I created this mostly from scratch to provide an "as clean as possible" showcase of how to make this work. This project was created specifically as this example. This code has NOT been used in production.

If you build it: it will fail

Yes, this code is in the (deliberate!) state of failing when you build it. I did this to show that in the Measurement.avdl adding a new field (Session) without a default value will fail the schema evolution check that is done during the build.

Usecases

In general this can help in having schemas in a schemaless environment.

I have seen two major use cases where this can help.

Streaming

Most streaming transport systems (like Apache Kafka) only accept a byte array as the data. By serializing the records with Avro you can now have independent rolling upgrades and canary releases of the producing and consuming systems. By using these ideas; these upgrades no longer incur any downtime.

Databases

Systems like Apache HBase and Google BigTable are databases that do not have any kind of schema: All values are byte arrays.

Here you can use this to serialize your records and persist them in HBase. Then when a new field is needed or an old field is deprecated there is no need to update any of the old record.

Because the application carries all historical schemas the Avro Schema evolution will take care of reading the old records into the new shape.

Basic design

We have a separation between the Schema, the producers and the consumers.
- This example project is ONLY the Schema
This schema
- is published to a schema database (key = 64 bit long, value = json string)
- is published to an artifact repository as a jar file
  - Both the producers and consumers should have this Schema project as a dependency.

Usage

Producers
- have the current schema from the jar file.
Consumers
- have the current schema from the jar file
- have all previous versions are also in the jar file.
- can get newer schema versions from the schema registry.

How this works

During the build/development
1. The current schema is verified to be evolution safe with the full schema history of the released schemas. The build will fail if it is not.
2. The current JSON of the Avro schema is written to a temporary file. I abused a junit test for this.
Releasing the software is done with the maven-release-plugin.
1. Do mvn release:prepare and this temporary file is stored in the schema-history and committed as one of the resources in the jar. All historical schemas are combined into a single file SchemaHistory.csv, this makes reading the resource simpler and more reliable. The format of these files is <schema fingerprint>|<json string of the schema>
2. Do mvn release:perform and the jar is built and deployed to the repository and the schema is deployed to the schema registry. NOTE: These two deploy steps have been disabled/mocked in this example repository; so nothing will really be deployed.

Notable other things

The AVRO schema IDLs are put through a filter during the build before being generated into Java code by the avro-maven-plugin, this way we can dynamically generate the version of the schema into the schema itself.

LICENSE

Avro schema evolution example.
Copyright (C) 2022 Niels Basjes

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

https://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
src		src
.editorconfig		.editorconfig
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

src

src

.editorconfig

.editorconfig

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

pom.xml

pom.xml

Repository files navigation

Introduction

So what is this project?

If you build it: it will fail

Usecases

Streaming

Databases

Basic design

Usage

How this works

Notable other things

LICENSE

About

Languages

License

nielsbasjes/avro-schema-example

Folders and files

Latest commit

History

Repository files navigation

Introduction

So what is this project?

If you build it: it will fail

Usecases

Streaming

Databases

Basic design

Usage

How this works

Notable other things

LICENSE

About

Topics

Resources

License

Stars

Watchers

Forks

Languages