Exploring Air Transport Data in Brazil

Thiago Santana
9 min readMay 13, 2021

One of the many industries impacted by the COVID-19 pandemic was the air transport industry. Due to the travel restrictions, air transport demand was suddenly halted and is still catching up — whereas, the air cargo market is blooming.

But how affected was aviation? The data shared by ANAC — the Brazilian Civil Aviation Authority — gives us opportunity to answer this question, for the Brazilian case. Let’s explore this data and get some insights.

The data used in this article may be found at:

Importing the libraries and data clean-up

First of all, let’s import the libraries we are going to use:

Some comments:

  • For plotting, we are using the Seaborn library. For more information, refer to the library website.
  • I am also using the unidecode library to convert the column names to a more friendly format.

Now the files are loaded and merged into a single data frame.

Let’s look at the data.

EMPRESA (SIGLA) EMPRESA (NOME) EMPRESA (NACIONALIDADE)   ANO  MÊS  \
0 AAF AIGLE AZUR ESTRANGEIRA 2019 1
1 AAF AIGLE AZUR ESTRANGEIRA 2019 1
2 AAF AIGLE AZUR ESTRANGEIRA 2019 2
3 AAF AIGLE AZUR ESTRANGEIRA 2019 2
4 AAF AIGLE AZUR ESTRANGEIRA 2019 3

AEROPORTO DE ORIGEM (SIGLA) AEROPORTO DE ORIGEM (NOME) \
0 LFPO ORLY (NEAR PARIS)
1 SBKP CAMPINAS
2 LFPO ORLY (NEAR PARIS)
3 SBKP CAMPINAS
4 LFPO ORLY (NEAR PARIS)

AEROPORTO DE ORIGEM (UF) AEROPORTO DE ORIGEM (REGIÃO) \
0 NaN NaN
1 SP SUDESTE
2 NaN NaN
3 SP SUDESTE
4 NaN NaN

AEROPORTO DE ORIGEM (PAÍS) ... COMBUSTÍVEL (LITROS) DISTÂNCIA VOADA (KM) \
0 FRANÇA ... NaN 149856.0
1 BRASIL ... NaN 149856.0
2 FRANÇA ... NaN 149856.0
3 BRASIL ... NaN 149856.0
4 FRANÇA ... NaN 159222.0

DECOLAGENS CARGA PAGA KM CARGA GRATIS KM CORREIO KM ASSENTOS PAYLOAD \
0 16.0 920725000.0 0.0 0.0 4592.0 770089.0
1 16.0 263700000.0 25232000.0 0.0 4592.0 770089.0
2 16.0 617173000.0 0.0 0.0 4592.0 770089.0
3 16.0 0.0 0.0 0.0 4592.0 770089.0
4 17.0 933032000.0 0.0 0.0 4879.0 1252270.0

HORAS VOADAS BAGAGEM (KG)
0 144,86 NaN
1 227,34 NaN
2 107,35 NaN
3 267,29 NaN
4 134,73 NaN

[5 rows x 38 columns]

Some observations about the column names:

  • They are written in Portuguese and contain accentuation;
  • They are all in upper case letters;
  • They contain spaces and parenthesis.

To facilitate readability we will modify the column names by:

  • Replacing the spaces with underlines “_”;
  • Removing the parenthesis;
  • Making all letters lowercase; and
  • Removing the accents.

We will use the unidecode library to quickly make these changes.

This convention is called snake_case and it is commonly used. For more information, refer to: https://en.wikipedia.org/wiki/Snake_case

Column names before changes:

Index(['EMPRESA (SIGLA)', 'EMPRESA (NOME)', 'EMPRESA (NACIONALIDADE)', 'ANO',
'MÊS', 'AEROPORTO DE ORIGEM (SIGLA)', 'AEROPORTO DE ORIGEM (NOME)',
'AEROPORTO DE ORIGEM (UF)', 'AEROPORTO DE ORIGEM (REGIÃO)',
'AEROPORTO DE ORIGEM (PAÍS)', 'AEROPORTO DE ORIGEM (CONTINENTE)',
'AEROPORTO DE DESTINO (SIGLA)', 'AEROPORTO DE DESTINO (NOME)',
'AEROPORTO DE DESTINO (UF)', 'AEROPORTO DE DESTINO (REGIÃO)',
'AEROPORTO DE DESTINO (PAÍS)', 'AEROPORTO DE DESTINO (CONTINENTE)',
'NATUREZA', 'GRUPO DE VOO', 'PASSAGEIROS PAGOS', 'PASSAGEIROS GRÁTIS',
'CARGA PAGA (KG)', 'CARGA GRÁTIS (KG)', 'CORREIO (KG)', 'ASK', 'RPK',
'ATK', 'RTK', 'COMBUSTÍVEL (LITROS)', 'DISTÂNCIA VOADA (KM)',
'DECOLAGENS', 'CARGA PAGA KM', 'CARGA GRATIS KM', 'CORREIO KM',
'ASSENTOS', 'PAYLOAD', 'HORAS VOADAS', 'BAGAGEM (KG)'],
dtype='object')
Column names after changes:

Index(['empresa_sigla', 'empresa_nome', 'empresa_nacionalidade', 'ano', 'mes',
'aeroporto_de_origem_sigla', 'aeroporto_de_origem_nome',
'aeroporto_de_origem_uf', 'aeroporto_de_origem_regiao',
'aeroporto_de_origem_pais', 'aeroporto_de_origem_continente',
'aeroporto_de_destino_sigla', 'aeroporto_de_destino_nome',
'aeroporto_de_destino_uf', 'aeroporto_de_destino_regiao',
'aeroporto_de_destino_pais', 'aeroporto_de_destino_continente',
'natureza', 'grupo_de_voo', 'passageiros_pagos', 'passageiros_gratis',
'carga_paga_kg', 'carga_gratis_kg', 'correio_kg', 'ask', 'rpk', 'atk',
'rtk', 'combustivel_litros', 'distancia_voada_km', 'decolagens',
'carga_paga_km', 'carga_gratis_km', 'correio_km', 'assentos', 'payload',
'horas_voadas', 'bagagem_kg'],
dtype='object')

This looks better. Let’s add some new columns to this data frame, to support our analysis:

  • Since we are looking for a chronologic observation, it is interesting to concatenate the calendar months and years into a single variable called ‘data’ (Portuguese for date. I am keeping Portuguese names for consistency).
  • Let’s also add a column named ‘quarto’ (Portuguese for quarter) to concatenate variables around the four quarters of the year.
  • We can also infer the routes from the origin and destination airport variables (respectively called aeroporto_de_origem_sigla and aeroporto_de_destino_sigla). A variable named ‘rota’ (Portuguese for route) will be created to store an ‘origin->destination’ string. Another variable with the names of the airports (instead of the codes) will be created (and called ‘rota_nome’) for readability.
  • Dividing RPK for ASK we get the load factor, which is a very important metric for airlines economics. This variable will also be created.

Airline metrics for efficiency and capacity

Since there is no data dictionary, let’s talk about some data features:

  • RPK meaning “Revenue Passenger Kilometers” is an air transport industry metric that aggregates the number of paying passengers and the quantity of kilometers traveled by them. It is calculated by multiplying the number of paying passengers by the distance traveled in kilometers.
  • ASK meaning “Available Seat Kilometers” is similar to the RPK but instead of using the paying passengers, the passenger capacity (number of seats available in the aircraft) is multiplied by the traveled distance.
  • RTK (for “Revenue Tonne Kilometers”) measures the revenue cargo load in tons multiplied by the distance flown in kilometers.
  • ATK (for “Available Tonne Kilometers”) measures the aircraft capacity of cargo load in tons multiplied by the distance flown in kilometers.

The data frame presents not only the value of these parameters but also the variables that compose their formula. Therefore, let’s make a consistency check, verifying it is possible to reproduce their values through the variables.

The formulas of the variables are:

The number of rpk values that correspond to rpk calculation is: 70.51%
The number of ask values that correspond to ask calculation is: 86.68%
The number of rtk values that correspond to rtk calculation is: 56.28%
The number of atk values that correspond to atk calculation is: 86.13%

We can see that the consistency is variable, and is specifically lower for RTK values.

One disadvantage of this calculated RTK is that the same average weight value (75 kg) was used for passengers of all airlines. This assumption implies that Brazilian and foreign companies use the same value for passenger weight for flight planning purposes.

Let’s check this assumption by observing if being either a Brazilian or foreign airline effects the relationship between reported RTK and calculated RTK:

png

We can see clearly that the line y=x has many Brazilian airlines into it. Also, there is a second line below the y=x line — made mostly of foreign airlines — , suggesting a different tendency for some of them.

By optimizing the error between RKT and calculated RTK for Brazilian airlines and foreign airlines separately, we arrive at the following values:

  • Brazilian airlines have 75kg as the best average value for passenger weight;
  • Foreign airlines have 90kg as the best average value for passenger weight.

These numbers come from an optimization exercise is found in the article below:

With this knowledge, let’s calculate again the RTK:

The number of rtk values that correspond to rtk calculation is: 58.90%

We see now that the match of RTK values passed from 56.28% to 58.93%. Let’s also reprint the previous graphic with the corrected calculated RTK.

png

The second tendency line is gone, since its behavior is now considered in the model.

Evolution of number of flights

After the consistency check, let’s take a look on how the number of flights has evolved through time.

png

We see that number of flights diminishes drastically after March 2020. Interestingly enough, this should not be credited only to the pandemic effect, since January is the month with most flights in an year.

Let’s verify what are the busiest airports (according to number of takeoffs) in Brazil, according to ANAC’s data:

decolagens
aeroporto_de_origem_nome
GUARULHOS 233420.0
RIO DE JANEIRO 139209.0
SÃO PAULO 124867.0
CAMPINAS 103436.0
BRASÍLIA 101135.0
CONFINS 80310.0
RECIFE 63420.0
PORTO ALEGRE 52173.0
SALVADOR 50196.0
SÃO JOSÉ DOS PINHAIS 45537.0
png

Finally, let’s verify which airports have been most affected by the travel restrictions.

We define “most affected” by the difference between the quantity of flights in the first quarter of 2020 (when COVID-19 flight restrictions started) and the first quarter of 2019 (one year before). With that we account for the seasonality of the air traffic. The calculation is made for each airport, in difference of flight quantities and in percentage.

Absolute numbers are important to observe since an airport that had one flight in 2019 and had zero in 2020 has lost 100% of its flights, but has lost only one flight. We’d like to avoid accounting cases like this as the most impactful, since they shouldn’t be.

quarter                   2019-Q1  2019-Q2  2019-Q3  2019-Q4  2020-Q1  \
aeroporto_de_origem_nome
GUARULHOS 35860.0 31179.0 35204.0 36139.0 33677.0
SÃO PAULO 21869.0 20541.0 21562.0 22668.0 19916.0
RIO DE JANEIRO 23031.0 19658.0 22195.0 24090.0 22097.0
BRASÍLIA 15672.0 14208.0 15204.0 15889.0 14281.0
CONFINS 12448.0 12108.0 12994.0 12646.0 11160.0
CAMPINAS 13954.0 14029.0 14262.0 13603.0 12132.0
RECIFE 9177.0 8359.0 9012.0 9040.0 8623.0
PORTO ALEGRE 8198.0 7702.0 8538.0 8756.0 7970.0
SÃO JOSÉ DOS PINHAIS 7435.0 7173.0 7757.0 7942.0 6816.0
SALVADOR 8117.0 5821.0 7298.0 7740.0 7434.0

quarter 2020-Q2 2020-Q3 2020-Q4 2021-Q1 delta2020 \
aeroporto_de_origem_nome
GUARULHOS 5437.0 13350.0 21906.0 20668.0 25742.0
SÃO PAULO 459.0 2188.0 7424.0 8240.0 20082.0
RIO DE JANEIRO 1598.0 5665.0 10914.0 9961.0 18060.0
BRASÍLIA 1256.0 5643.0 9963.0 9019.0 12952.0
CONFINS 1172.0 3206.0 7002.0 7574.0 10936.0
CAMPINAS 4170.0 7473.0 11871.0 11942.0 9859.0
RECIFE 1218.0 3291.0 6839.0 7861.0 7141.0
PORTO ALEGRE 721.0 1834.0 4217.0 4237.0 6981.0
SÃO JOSÉ DOS PINHAIS 697.0 1430.0 3289.0 2998.0 6476.0
SALVADOR 709.0 2099.0 5146.0 5832.0 5112.0

quarter delta_perc2020 delta_perc2021
aeroporto_de_origem_nome
GUARULHOS 82.561981 42.364752
SÃO PAULO 97.765445 62.321094
RIO DE JANEIRO 91.870994 56.749598
BRASÍLIA 91.159910 42.451506
CONFINS 90.320449 39.154884
CAMPINAS 70.275857 14.418805
RECIFE 85.428879 14.340198
PORTO ALEGRE 90.638795 48.316663
SÃO JOSÉ DOS PINHAIS 90.283006 59.677202
SALVADOR 87.819962 28.150795

We can see that the list of most impacted airports (refer to variable ‘delta_perc2020’) is identical to the list of most busy airports (except for the order).

We can also see, by comparing the first quarter of 2021 with the first quarter of 2019 (through the variable delta_perc2021) that the gap is closing for all airports, and airports like Campinas and Recife have currently only 14% less flights than they had two years ago — which is very good news for the aviation in Brazil.

As of now, the COVID-19 vaccination is progressing and ICAO and the airlines are make decisions focused on the safety of air travel and public health. Therefore, measures are being taken to allow these air travel numbers to grow as fast and responsibly as possible. I look forward to that!

--

--

Thiago Santana

I am a data driven Customer Service professional with an aerospace industry background. Always looking for win-win situations and effective communication.