📢 Special thanks to our speaker: Karim Lamouri
1.1 Avro - What is it ?
- Avro is used for serialization and deserialization of payloads
- More compact than
JSON
- Is a container file - It has both
the schema
andthe payload
- Allows RPC calls
- Fast serialization and smaller than JSON (do not repeat all the JSON keys as the schema is at the top of the file)
- Allows schema documentation
- Format easy to leverage on Spark or ETL pipelines
You can always deserialize an AVRO file as the schema is embedded in the file itself !
1.2 Avro - Basics
- Avro Schema
|
|
1.3 Avro - Avro Protocol
|
|
Avro Protocol allows the definition of RPC functions. But this schema is still defined in JSON which is not optimal for reading and documenting
1.3 Avro - Avro IDL - Higher level representation
- The AVRO IDL was introduced by the Apache AVRO project
- Much shorter writing
- Similar to
C structures
- Easy definition of
default values
- Easy documentation
- Allows
import
- Automatic generation of Java POJO
POJOs
or generated classes objects fromAVRO
enforce strong consistency on what fields aremandatory
/optional
ornull
. This is especially important in a microservice world where events are at the heart of the data flow.
|
|
- Example with
default value
andcomments
:
|
|
- Example with imports
|
|
2.1 Schema Registry
-
Schema registry is an open-source serving layer of AVRO schemas provided by Confluent Platform
-
Schema registry is in charge of:
- Registering schemas
- Returning the ID of the schema
-
A
schema
is a living object (it changes over time, new fields get added / removed / updated) -
In a Kafka world, AVRO schemas are not sent along with the payload but a
"pointer"
to the schema is passed down. -
Kafka libraries (
SerDe
orLibrdkafka
) prepends 5 bytes (the"pointer"
) to theAVRO binary
to inject schema information.
Byte 0 | Byte 1-4 | Byte 5-… |
---|---|---|
Magic Byte | 4-bytes schema ID as returned by Schema Registry | Serialized data for the specified schema format |
2.2 Schema Evolution & Compatibility
Compatibility Type | Changes allowed | Check against which schemas | Upgrade first |
---|---|---|---|
BACKWARD | ​Delete fields / Add optional fields | Last version | ​Consumers |
BACKWARD_TRANSITIVE | Delete fields Add optional fields | All previous versions | ​Consumers |
FORWARD | Add fields / Delete optional fields | Last version | ​Producers |
FORWARD_TRANSITIVE | Add fields / Delete optional fields | All previous versions | ​Producers |
FULL | Add optional fields / Delete optional fields | Last version | Any order |
FULL_TRANSITIVE | Add optional fields / Delete optional fields | All previous versions | Any order |
NONE | All changes are accepted | Compatibility checking disabled | Depends |
- Tips for easy schema evolution in a non
TRANSITIVE
world:- be
nullable
- have
default values
(default value can benull
as well) - fields that are
nullable
and havedefault values
can be removed
- be
2.3 Schema Evolution & Kafka-connect performances
-
Kafka-connect is an open-source software from Confluent Platform
-
Schema compatibility can directly impact Kafka-connect performance
Story behind an outage resulting in 8h worth of data lost… Back in the day a change was made to a Kafka event where the schema compatibility was set to NONE (where it should have been set to FULL or at least BACKWARD). As a result of this event modification, Kafka-connect performance started dropping resulting in a lot a small files being uploaded to S3 impacting Spark jobs from running efficiently.