An Optimization Problem modeling using Brazilian Flight Data

7 min readMay 13, 2021

The Brazilian flight data shared by their Civil Aviation Authority (ANAC) brings some airline marketing metrics, and also variables that enables one to recalculate these metrics. While testing the consistency of these values, I have arrived at a model optimization problem: what is the average weight for passengers that airlines use for their flight plans? Are they the same for Brazilian and foreign airlines?

We will answer these questions in this article.

The data used in this notebook may be found at:

Importing the libraries and data clean-up

NOTE: this section is exactly the same found in the EDA article below:

Exploratory Data Analysis on flight data from Brazil

santanasan.medium.com

First of all, let’s import the libraries we are going to use:

I am using the Seaborn library instead of matplotlib. I am also using the unidecode library to convert the column names to a more friendly format.

Now the files are loaded and merged into a single data frame.

Let’s look at the data.

EMPRESA (SIGLA) EMPRESA (NOME) EMPRESA (NACIONALIDADE)   ANO  MÊS  \
0             AAF     AIGLE AZUR             ESTRANGEIRA  2019    1   
1             AAF     AIGLE AZUR             ESTRANGEIRA  2019    1   
2             AAF     AIGLE AZUR             ESTRANGEIRA  2019    2   
3             AAF     AIGLE AZUR             ESTRANGEIRA  2019    2   
4             AAF     AIGLE AZUR             ESTRANGEIRA  2019    3   

  AEROPORTO DE ORIGEM (SIGLA) AEROPORTO DE ORIGEM (NOME)  \
0                        LFPO          ORLY (NEAR PARIS)   
1                        SBKP                   CAMPINAS   
2                        LFPO          ORLY (NEAR PARIS)   
3                        SBKP                   CAMPINAS   
4                        LFPO          ORLY (NEAR PARIS)   

  AEROPORTO DE ORIGEM (UF) AEROPORTO DE ORIGEM (REGIÃO)  \
0                      NaN                          NaN   
1                       SP                      SUDESTE   
2                      NaN                          NaN   
3                       SP                      SUDESTE   
4                      NaN                          NaN   

  AEROPORTO DE ORIGEM (PAÍS)  ... COMBUSTÍVEL (LITROS) DISTÂNCIA VOADA (KM)  \
0                     FRANÇA  ...                  NaN             149856.0   
1                     BRASIL  ...                  NaN             149856.0   
2                     FRANÇA  ...                  NaN             149856.0   
3                     BRASIL  ...                  NaN             149856.0   
4                     FRANÇA  ...                  NaN             159222.0   

  DECOLAGENS CARGA PAGA KM CARGA GRATIS KM CORREIO KM ASSENTOS    PAYLOAD  \
0       16.0   920725000.0             0.0        0.0   4592.0   770089.0   
1       16.0   263700000.0      25232000.0        0.0   4592.0   770089.0   
2       16.0   617173000.0             0.0        0.0   4592.0   770089.0   
3       16.0           0.0             0.0        0.0   4592.0   770089.0   
4       17.0   933032000.0             0.0        0.0   4879.0  1252270.0   

  HORAS VOADAS  BAGAGEM (KG)  
0       144,86           NaN  
1       227,34           NaN  
2       107,35           NaN  
3       267,29           NaN  
4       134,73           NaN  

[5 rows x 38 columns]

The following can be observed about the column names:

They are written in Portuguese and contain accentuation;
They are all in upper case letters;
They contain spaces and parenthesis.

To facilitate readability we will modify the column names by:

Replacing the spaces with underlines “_”;
Removing the parenthesis;
Making all letters lowercase; and
Removing the accents.

We will use the unidecode library to quickly make these changes.

This convention is called snake_case and it is commonly used. For more information, refer to: https://en.wikipedia.org/wiki/Snake_case

Column names before changes:

Index(['EMPRESA (SIGLA)', 'EMPRESA (NOME)', 'EMPRESA (NACIONALIDADE)', 'ANO',
       'MÊS', 'AEROPORTO DE ORIGEM (SIGLA)', 'AEROPORTO DE ORIGEM (NOME)',
       'AEROPORTO DE ORIGEM (UF)', 'AEROPORTO DE ORIGEM (REGIÃO)',
       'AEROPORTO DE ORIGEM (PAÍS)', 'AEROPORTO DE ORIGEM (CONTINENTE)',
       'AEROPORTO DE DESTINO (SIGLA)', 'AEROPORTO DE DESTINO (NOME)',
       'AEROPORTO DE DESTINO (UF)', 'AEROPORTO DE DESTINO (REGIÃO)',
       'AEROPORTO DE DESTINO (PAÍS)', 'AEROPORTO DE DESTINO (CONTINENTE)',
       'NATUREZA', 'GRUPO DE VOO', 'PASSAGEIROS PAGOS', 'PASSAGEIROS GRÁTIS',
       'CARGA PAGA (KG)', 'CARGA GRÁTIS (KG)', 'CORREIO (KG)', 'ASK', 'RPK',
       'ATK', 'RTK', 'COMBUSTÍVEL (LITROS)', 'DISTÂNCIA VOADA (KM)',
       'DECOLAGENS', 'CARGA PAGA KM', 'CARGA GRATIS KM', 'CORREIO KM',
       'ASSENTOS', 'PAYLOAD', 'HORAS VOADAS', 'BAGAGEM (KG)'],
      dtype='object')
Column names after changes:

Index(['empresa_sigla', 'empresa_nome', 'empresa_nacionalidade', 'ano', 'mes',
       'aeroporto_de_origem_sigla', 'aeroporto_de_origem_nome',
       'aeroporto_de_origem_uf', 'aeroporto_de_origem_regiao',
       'aeroporto_de_origem_pais', 'aeroporto_de_origem_continente',
       'aeroporto_de_destino_sigla', 'aeroporto_de_destino_nome',
       'aeroporto_de_destino_uf', 'aeroporto_de_destino_regiao',
       'aeroporto_de_destino_pais', 'aeroporto_de_destino_continente',
       'natureza', 'grupo_de_voo', 'passageiros_pagos', 'passageiros_gratis',
       'carga_paga_kg', 'carga_gratis_kg', 'correio_kg', 'ask', 'rpk', 'atk',
       'rtk', 'combustivel_litros', 'distancia_voada_km', 'decolagens',
       'carga_paga_km', 'carga_gratis_km', 'correio_km', 'assentos', 'payload',
       'horas_voadas', 'bagagem_kg'],
      dtype='object')

This looks better.

Airline metrics for efficiency and capacity

Since there is no data dictionary, let’s talk about some data features:

RPK meaning “Revenue Passenger Kilometers” is an air transport industry metric that aggregates the number of paying passengers and the quantity of kilometers traveled by them. It is calculated by multiplying the number of paying passengers by the distance traveled in kilometers.
ASK meaning “Available Seat Kilometers” is similar to the RPK but instead of using the paying passengers, the passenger capacity (number of seats available in the aircraft) is multiplied by the traveled distance.
RTK (for “Revenue Tonne Kilometers”) measures the revenue cargo load in tons multiplied by the distance flown in kilometers.
ATK (for “Available Tonne Kilometers”) measures the aircraft capacity of cargo load in tons multiplied by the distance flown in kilometers.

The dataset presents not only the value of these features but also the variables that compose their formula. Therefore, let’s make a consistency check, verifying it is possible to reproduce their values through the variables.

The formulas of the variables are:

The only unknown here is the AvgWeight variable. It proposes a challenge: what if we calculate the AvgWeight that gives the minimum error between the given RTK and the calculated RTK?

This is an optimization problem that we will define below:

Let’s define the optimization function (with some margin of error) and use the library Scipy to optimize this problem.

fun: 2.477700693756194e-05
 message: 'Solution found.'
    nfev: 25
  status: 0
 success: True
       x: 75.0006857462938

Great, the result is 75. Let’s apply it and calculate the consistency of the calculated RTK with the RTK value provided by the data set.

The number of rtk values that correspond to rtk calculation is: 56.28%

We can see that the consistency is a little over 50%.

One disadvantage of this calculated RTK is that the same average weight value (75 kg) was used for passengers of all airlines. This assumption implies that Brazilian and foreign companies use the same value for passenger weight for flight planning purposes.

Let’s check this assumption by observing if being either a Brazilian or foreign airline effects the relationship between reported RTK and calculated RTK:

We can see clearly that the line y=x has many Brazilian airlines into it. Also, there is a second line below the y=x line — made mostly of foreign airlines — , suggesting a different tendency for some of them.

Let’s improve the optimization problem by considering this fact. The optimization function defined above will be split in two: one to optimize the weight for Brazilian airlines and the other one for foreign airlines.

fun: 2.5802456393848696e-05
 message: 'Solution found.'
    nfev: 27
  status: 0
 success: True
       x: 75.00044845613596     fun: 0.00028669724770642203
 message: 'Solution found.'
    nfev: 22
  status: 0
 success: True
       x: 90.0005090318264

By optimizing the error between RKT and calculated RTK for Brazilian airlines and foreign airlines separately, we arrive at the following values:

Brazilian airlines have 75kg as the best average value for passenger weight;
Foreign airlines have 90kg as the best average value for passenger weight.

With this knowledge, let’s calculate again the RTK:

The number of rtk values that correspond to rtk calculation is: 58.90%

We see now that the match of RTK values passed from 56.28% to 58.90%. Let’s also reprint the previous graphic with the corrected calculated RTK.

We can see that the second tendency line is gone, since we have took into consideration its behavior in our model.

It would be very interesting to find other behaviors to use in this optimization problem. Other variables, however, are not clearly related to clusters in the model to account for their use.

Out of curiosity, let’s check a few examples.

As Data Scientists, we should verify we have questioned all our assumptions and made all the confirmations before accepting a model as the best outcome possible.

What can be done to pursue the other 40%+ consistency:

The percentage of RTK parameters equal to NaN is 13.31%.
The percentage of RTK parameters equal to zero is 14.64%. It means that some non-zero variable was provided as zero. We can see these values in the y=0 line of the graphics.

So almost 28% is due to misreported / unreported data.

For the remaining around 12% of “inconsistent” data, we can verify if there are clusters of data associated with the graphic above and what association these clusters have (perhaps a different value for AverageWeight).

When problems similar to this are presented, the Scipy library is an excellent tool, being able to solve even more complex problems such as multivariate optimizations.

Don’t forget to check the article in which the other data of this data set is explored, and we verify the airports most affected by travel restrictions in 2020. The link is:

Exploratory Data Analysis on flight data from Brazil

santanasan.medium.com

See you there!

An Optimization Problem modeling using Brazilian Flight Data

Importing the libraries and data clean-up

Exploratory Data Analysis on flight data from Brazil

Airline metrics for efficiency and capacity

Exploratory Data Analysis on flight data from Brazil

Written by Thiago Santana