Ops-Talks #02 - AVRO

📢 Special thanks to our speaker: Karim Lamouri

1.1 Avro - What is it ?

Avro is used for serialization and deserialization of payloads
More compact than JSON
Is a container file - It has both the schema and the payload
Allows RPC calls
Fast serialization and smaller than JSON (do not repeat all the JSON keys as the schema is at the top of the file)
Allows schema documentation
Format easy to leverage on Spark or ETL pipelines

You can always deserialize an AVRO file as the schema is embedded in the file itself !

1.2 Avro - Basics

Avro Schema

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15


{
  "namespace": "com.gumgum.avro.verity",
  "type": "record",
  "name": "Callback",
  "fields": [
    {
      "name": "callback_uuid",
      "type": "string"
    },
    {
      "name": "target_url",
      "type": "string"
    }
  ]
}

1.3 Avro - Avro Protocol

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21


{
  "protocol": "Callback",
  "namespace": "com.gumgum.avro.verity",
  "types": [
    {
      "type": "record",
      "name": "Callback",
      "fields": [
        {
          "name": "callback_uuid",
          "type": "string"
        },
        {
          "name": "target_url",
          "type": "string"
        }
      ]
    }
  ],
  "messages": {}
}

Avro Protocol allows the definition of RPC functions. But this schema is still defined in JSON which is not optimal for reading and documenting

1.3 Avro - Avro IDL - Higher level representation

The AVRO IDL was introduced by the Apache AVRO project
Much shorter writing
Similar to C structures
Easy definition of default values
Easy documentation
Allows import
Automatic generation of Java POJO
POJOs or generated classes objects from AVRO enforce strong consistency on what fields are mandatory / optional or null. This is especially important in a microservice world where events are at the heart of the data flow.

1
2
3
4
5
6
7
8


@namespace("com.gumgum.avro.verity")

protocol Callback {
  record Callback {
    string callback_uuid;
    string target_url;
  }
}

Example with default value and comments:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


@namespace("com.gumgum.avro.verity")

protocol Callback {
  record Callback {
    /**
      This is the documentation for the Callback record
    **/
    string callback_uuid = 'uuid';
    string target_url;
  }
}

Example with imports

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14


@namespace("com.gumgum.avro.verity")

import address.idl

protocol Callback {
  record Callback {
    /**
      This is the documentation for the Callback record
    **/
    string callback_uuid = 'uuid';
    string target_url;
    Address callback_address;
  }
}

2.1 Schema Registry

Schema registry is an open-source serving layer of AVRO schemas provided by Confluent Platform
Schema registry is in charge of:
- Registering schemas
- Returning the ID of the schema
A schema is a living object (it changes over time, new fields get added / removed / updated)
In a Kafka world, AVRO schemas are not sent along with the payload but a "pointer" to the schema is passed down.
Kafka libraries (SerDe or Librdkafka) prepends 5 bytes (the "pointer") to the AVRO binary to inject schema information.

Byte 0	Byte 1-4	Byte 5-…
Magic Byte	4-bytes schema ID as returned by Schema Registry	Serialized data for the specified schema format

2.2 Schema Evolution & Compatibility

Compatibility Type	Changes allowed	Check against which schemas	Upgrade first
BACKWARD	Delete fields / Add optional fields	Last version	Consumers
BACKWARD_TRANSITIVE	Delete fields Add optional fields	All previous versions	Consumers
FORWARD	Add fields / Delete optional fields	Last version	Producers
FORWARD_TRANSITIVE	Add fields / Delete optional fields	All previous versions	Producers
FULL	Add optional fields / Delete optional fields	Last version	Any order
FULL_TRANSITIVE	Add optional fields / Delete optional fields	All previous versions	Any order
NONE	All changes are accepted	Compatibility checking disabled	Depends

Tips for easy schema evolution in a non TRANSITIVE world:
- be nullable
- have default values (default value can be null as well)
- fields that are nullable and have default values can be removed

2.3 Schema Evolution & Kafka-connect performances

Kafka-connect is an open-source software from Confluent Platform
Schema compatibility can directly impact Kafka-connect performance

Story behind an outage resulting in 8h worth of data lost… Back in the day a change was made to a Kafka event where the schema compatibility was set to NONE (where it should have been set to FULL or at least BACKWARD). As a result of this event modification, Kafka-connect performance started dropping resulting in a lot a small files being uploaded to S3 impacting Spark jobs from running efficiently.

Checkout the slides to understand the impact of schema compatibility with Kafka-connect

Resources

AVRO Documentation

Tuesday, October 20, 2020