Ops-Talks #02 - AVRO

📢 Special thanks to our speaker: Karim Lamouri

1.1 Avro - What is it ?

  • Avro is used for serialization and deserialization of payloads
  • More compact than JSON
  • Is a container file - It has both the schema and the payload
  • Allows RPC calls
  • Fast serialization and smaller than JSON (do not repeat all the JSON keys as the schema is at the top of the file)
  • Allows schema documentation
  • Format easy to leverage on Spark or ETL pipelines

You can always deserialize an AVRO file as the schema is embedded in the file itself !

1.2 Avro - Basics

  • Avro Schema
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
{
  "namespace": "com.gumgum.avro.verity",
  "type": "record",
  "name": "Callback",
  "fields": [
    {
      "name": "callback_uuid",
      "type": "string"
    },
    {
      "name": "target_url",
      "type": "string"
    }
  ]
}

1.3 Avro - Avro Protocol

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
{
  "protocol": "Callback",
  "namespace": "com.gumgum.avro.verity",
  "types": [
    {
      "type": "record",
      "name": "Callback",
      "fields": [
        {
          "name": "callback_uuid",
          "type": "string"
        },
        {
          "name": "target_url",
          "type": "string"
        }
      ]
    }
  ],
  "messages": {}
}

Avro Protocol allows the definition of RPC functions. But this schema is still defined in JSON which is not optimal for reading and documenting

1.3 Avro - Avro IDL - Higher level representation

  • The AVRO IDL was introduced by the Apache AVRO project
  • Much shorter writing
  • Similar to C structures
  • Easy definition of default values
  • Easy documentation
  • Allows import
  • Automatic generation of Java POJO
  • POJOs or generated classes objects from AVRO enforce strong consistency on what fields are mandatory / optional or null. This is especially important in a microservice world where events are at the heart of the data flow.
1
2
3
4
5
6
7
8
@namespace("com.gumgum.avro.verity")

protocol Callback {
  record Callback {
    string callback_uuid;
    string target_url;
  }
}
  • Example with default value and comments:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
@namespace("com.gumgum.avro.verity")

protocol Callback {
  record Callback {
    /**
      This is the documentation for the Callback record
    **/
    string callback_uuid = 'uuid';
    string target_url;
  }
}
  • Example with imports
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
@namespace("com.gumgum.avro.verity")

import address.idl

protocol Callback {
  record Callback {
    /**
      This is the documentation for the Callback record
    **/
    string callback_uuid = 'uuid';
    string target_url;
    Address callback_address;
  }
}

2.1 Schema Registry

  • Schema registry is an open-source serving layer of AVRO schemas provided by Confluent Platform

  • Schema registry is in charge of:

    • Registering schemas
    • Returning the ID of the schema
  • A schema is a living object (it changes over time, new fields get added / removed / updated)

  • In a Kafka world, AVRO schemas are not sent along with the payload but a "pointer" to the schema is passed down.

  • Kafka libraries (SerDe or Librdkafka) prepends 5 bytes (the "pointer") to the AVRO binary to inject schema information.

Byte 0 Byte 1-4 Byte 5-…
Magic Byte 4-bytes schema ID as returned by Schema Registry Serialized data for the specified schema format

2.2 Schema Evolution & Compatibility

Compatibility Type Changes allowed Check against which schemas Upgrade first
BACKWARD ​Delete fields / Add optional fields Last version ​Consumers
BACKWARD_TRANSITIVE Delete fields Add optional fields All previous versions ​Consumers
FORWARD Add fields / Delete optional fields Last version ​Producers
FORWARD_TRANSITIVE Add fields / Delete optional fields All previous versions ​Producers
FULL Add optional fields / Delete optional fields Last version Any order
FULL_TRANSITIVE Add optional fields / Delete optional fields All previous versions Any order
NONE All changes are accepted Compatibility checking disabled Depends
  • Tips for easy schema evolution in a non TRANSITIVE world:
    • be nullable
    • have default values (default value can be null as well)
    • fields that are nullable and have default values can be removed

2.3 Schema Evolution & Kafka-connect performances

Story behind an outage resulting in 8h worth of data lost… Back in the day a change was made to a Kafka event where the schema compatibility was set to NONE (where it should have been set to FULL or at least BACKWARD). As a result of this event modification, Kafka-connect performance started dropping resulting in a lot a small files being uploaded to S3 impacting Spark jobs from running efficiently.

Resources