📢 Special thanks to our speaker: Karim Lamouri
1.1 Avro - What is it ?
- Avro is used for serialization and deserialization of payloads
- More compact than
JSON - Is a container file - It has both
the schemaandthe payload - Allows RPC calls
- Fast serialization and smaller than JSON (do not repeat all the JSON keys as the schema is at the top of the file)
- Allows schema documentation
- Format easy to leverage on Spark or ETL pipelines
You can always deserialize an AVRO file as the schema is embedded in the file itself !
1.2 Avro - Basics
- Avro Schema
|
|
1.3 Avro - Avro Protocol
|
|
Avro Protocol allows the definition of RPC functions. But this schema is still defined in JSON which is not optimal for reading and documenting
1.3 Avro - Avro IDL - Higher level representation
- The AVRO IDL was introduced by the Apache AVRO project
- Much shorter writing
- Similar to
C structures - Easy definition of
default values - Easy documentation
- Allows
import - Automatic generation of Java POJO
POJOsor generated classes objects fromAVROenforce strong consistency on what fields aremandatory/optionalornull. This is especially important in a microservice world where events are at the heart of the data flow.
|
|
- Example with
default valueandcomments:
|
|
- Example with imports
|
|
2.1 Schema Registry
-
Schema registry is an open-source serving layer of AVRO schemas provided by Confluent Platform
-
Schema registry is in charge of:
- Registering schemas
- Returning the ID of the schema
-
A
schemais a living object (it changes over time, new fields get added / removed / updated) -
In a Kafka world, AVRO schemas are not sent along with the payload but a
"pointer"to the schema is passed down. -
Kafka libraries (
SerDeorLibrdkafka) prepends 5 bytes (the"pointer") to theAVRO binaryto inject schema information.
| Byte 0 | Byte 1-4 | Byte 5-… |
|---|---|---|
| Magic Byte | 4-bytes schema ID as returned by Schema Registry | Serialized data for the specified schema format |
2.2 Schema Evolution & Compatibility
| Compatibility Type | Changes allowed | Check against which schemas | Upgrade first |
|---|---|---|---|
| BACKWARD | ​Delete fields / Add optional fields | Last version | ​Consumers |
| BACKWARD_TRANSITIVE | Delete fields Add optional fields | All previous versions | ​Consumers |
| FORWARD | Add fields / Delete optional fields | Last version | ​Producers |
| FORWARD_TRANSITIVE | Add fields / Delete optional fields | All previous versions | ​Producers |
| FULL | Add optional fields / Delete optional fields | Last version | Any order |
| FULL_TRANSITIVE | Add optional fields / Delete optional fields | All previous versions | Any order |
| NONE | All changes are accepted | Compatibility checking disabled | Depends |
- Tips for easy schema evolution in a non
TRANSITIVEworld:- be
nullable - have
default values(default value can benullas well) - fields that are
nullableand havedefault valuescan be removed
- be
2.3 Schema Evolution & Kafka-connect performances
-
Kafka-connect is an open-source software from Confluent Platform
-
Schema compatibility can directly impact Kafka-connect performance
Story behind an outage resulting in 8h worth of data lost… Back in the day a change was made to a Kafka event where the schema compatibility was set to NONE (where it should have been set to FULL or at least BACKWARD). As a result of this event modification, Kafka-connect performance started dropping resulting in a lot a small files being uploaded to S3 impacting Spark jobs from running efficiently.