$$ \newcommand{\floor}[1]{\left\lfloor{#1}\right\rfloor} \newcommand{\ceil}[1]{\left\lceil{#1}\right\rceil} \renewcommand{\mod}{\,\mathrm{mod}\,} \renewcommand{\div}{\,\mathrm{div}\,} \newcommand{\metar}{\,\mathrm{m}} \newcommand{\cm}{\,\mathrm{cm}} \newcommand{\dm}{\,\mathrm{dm}} \newcommand{\litar}{\,\mathrm{l}} \newcommand{\km}{\,\mathrm{km}} \newcommand{\s}{\,\mathrm{s}} \newcommand{\h}{\,\mathrm{h}} \newcommand{\minut}{\,\mathrm{min}} \newcommand{\kmh}{\,\mathrm{\frac{km}{h}}} \newcommand{\ms}{\,\mathrm{\frac{m}{s}}} \newcommand{\mss}{\,\mathrm{\frac{m}{s^2}}} \newcommand{\mmin}{\,\mathrm{\frac{m}{min}}} \newcommand{\smin}{\,\mathrm{\frac{s}{min}}} $$

Prijavi problem


Obeleži sve kategorije koje odgovaraju problemu

Još detalja - opišite nam problem


Uspešno ste prijavili problem!
Status problema i sve dodatne informacije možete pratiti klikom na link.
Nažalost nismo trenutno u mogućnosti da obradimo vaš zahtev.
Molimo vas da pokušate kasnije.

10. Jupyter and Excel

In this lecture we will talk abot:

  1. the relationship between Jupyter and Microsoft Excel;
  2. loading tables from Excel files; and
  3. writing tables to Excel files.

10.1. Why Jupyter, and why Excel

Microsoft Excel is the most popular spreadsheet program in the world. It owes its popularity to the fact that the table you are working on is right there in front of you, you can see it, you can click on a cell and enter a value or a formula. It is a typical representative of the What You See Is What You Get philosophy. So, why did we decide to focus this course on Jupyter?

  • Price. Microsoft Excel is a commercial product -- it costs money. In contrast to that Python, all of its accompanying libraries and Jupyter as the interactive environment are free of charge.

  • Clearly visible procedures. Data processing in Microsoft Excell consists of entering formulas into cells. If you are working with a table with intricate relationships between cells expressed by many complicated formulas spread out over the entire worksheet, it soon becomes almost impossible to track the flow of information and, more importantly, to understand, debug and improve the process. On the other hand, if the processing is expressed in terms of a programming language (such as Python), we do lose the What You See Is What You Get approach of Excel but gain much more in readability of the code. A clear procedure located in one place (a Jupyter cell or a Python file) and coded in a simple and expressive programming language can easily be checkt for errors, upgraded and shared.

  • Flexibility. Microsoft Excel is convenient for processing tables that are relatively small so that they can easily fit onto a few computer screens. Once you find yourself in the position where you have to process huge tables with thousands of rows and columns the advantages of scripting languages become obvious. Moreover, each Python distribution comes with a large entourage of libraries where most of the standard data processing algorithms have already been implemented.

Using clearly visible procedures that are not mingling with the data to be analyzed is the most efficient way to process data. This is the corner-stone of any approach to modern data processing.

10.2. Loading tables from local Excel files

Each Excel document consists of several worksheets. Each worksheet is a table which can be accessed through its name:

Worksheets

Because Microsoft Excel is the most popular spreadsheet program in the world the pandas library has a way to load a worksheet of an Excel document into a DataFrame. If an Excel document consists of several worksheets, we have to load it as several DataFrames -- one DataFrame per worksheet.

For example, the file data/Additives.xlsx has a single worksheet "E-numbers" which we load into a DataFrame straightforwardly:

In [1]:
import pandas as pd
additives = pd.read_excel("data/Additives.xlsx", sheet_name="E-numbers")

This file contains some basic information about additives, which are substances used in food industry to preserve food or enhance its color and taste. Let us peek at the table:

In [2]:
additives.head(15)
Out[2]:
E-number Status Comment
0 100 harmless NaN
1 101 harmless NaN
2 102 DANGEROUS NaN
3 104 harmless NaN
4 105 harmless NaN
5 110 DANGEROUS NaN
6 111 harmless NaN
7 120 DANGEROUS NaN
8 121 harmless NaN
9 123 DANGEROUS NaN
10 124 DANGEROUS NaN
11 125 lab trials NaN
12 130 harmless NaN
13 131 HARMFUL may cause cancer
14 132 harmless NaN

The cells that were empty in the Excel table get a special NaN value, which stands for "not a number". Since in our case these cells represent comments, we wish the empty cells to remain empty. So, we shall reload the table, but this time instruct the system not to complain about empty cells:

In [3]:
additives = pd.read_excel("data/Additives.xlsx", sheet_name="E-numbers", na_filter=False)
additives.head(15)
Out[3]:
E-number Status Comment
0 100 harmless
1 101 harmless
2 102 DANGEROUS
3 104 harmless
4 105 harmless
5 110 DANGEROUS
6 111 harmless
7 120 DANGEROUS
8 121 harmless
9 123 DANGEROUS
10 124 DANGEROUS
11 125 lab trials
12 130 harmless
13 131 HARMFUL may cause cancer
14 132 harmless

The option na_filter=False instructs the read_excel function to "switch off artificial intelligence" and leave empty cells empty. Let us make a frequency analysis based on the harmfulness of additives.

In [4]:
additives["Status"].value_counts()
Out[4]:
HARMFUL       33
harmless      29
lab trials    10
DANGEROUS      5
Name: Status, dtype: int64

Let us now filter the table to single out additives that may cause cancer:

In [5]:
additives[additives.Comment == "may cause cancer"]
Out[5]:
E-number Status Comment
13 131 HARMFUL may cause cancer
17 142 HARMFUL may cause cancer
28 210 HARMFUL may cause cancer
29 211 HARMFUL may cause cancer
30 213 HARMFUL may cause cancer
31 214 HARMFUL may cause cancer
32 215 HARMFUL may cause cancer
33 216 HARMFUL may cause cancer
34 217 HARMFUL may cause cancer
45 239 HARMFUL may cause cancer
55 330 HARMFUL may cause cancer

Finally, let us list the additives that are marked as DANGEROUS or may cause cancer. To do so we have to combine two filtering criteria:

Comment == "may cause cancer"  or  Status == "DANGEROUS" (or both)

When we have to combine two criteria so that a row is included in the filtered table if at least one of the criteria is satisfied, we use the | connector:

In [6]:
additives[(additives.Comment == "may cause cancer") | (additives.Status == "DANGEROUS")]
Out[6]:
E-number Status Comment
2 102 DANGEROUS
5 110 DANGEROUS
7 120 DANGEROUS
9 123 DANGEROUS
10 124 DANGEROUS
13 131 HARMFUL may cause cancer
17 142 HARMFUL may cause cancer
28 210 HARMFUL may cause cancer
29 211 HARMFUL may cause cancer
30 213 HARMFUL may cause cancer
31 214 HARMFUL may cause cancer
32 215 HARMFUL may cause cancer
33 216 HARMFUL may cause cancer
34 217 HARMFUL may cause cancer
45 239 HARMFUL may cause cancer
55 330 HARMFUL may cause cancer

10.3. Writing tables to Excel files

Any table can be written into an Excel file just like we used to write them into CSV files. The only difference is that instead ot the to_csv funcion we invoke the to_excel function. For example, let us create a table containing the list of additives that are labelled by dangerous or may cause cancer:

In [7]:
bad_additives = additives[(additives.Comment == "may cause cancer") | (additives.Status == "DANGEROUS")]

and let us write the table into an Excel file:

In [8]:
bad_additives.to_excel("data/BadAdditives.xlsx")

Let's take a look at the Excel file we've just created:

Excel file

We see that the system has also written the index column of the table, which in this case is just a list of meaningless integers. To get rid of it, we'll write the table again, but this time using the option index=False:

In [9]:
bad_additives.to_excel("data/BadAdditives.xlsx", index=False)

The new file looks like this:

Excel file

That's exactly what we wanted.

10.4. Exercises

Exercise 1. The file data/CS201.xlsx has an overview of marks of a group of students in Computer Science 201. The data is real, so the table is anonymized.

(a) Load this table into a DataFrame and take a look at the first few rows to understand the structure of the table ("Hnn" stands for "homework nn", "Cn" stands for "colloquim n", "WE" stands for "written part of the exam" and "OE" stands for "oral oart of the exam").

(b) Index the table by "StudentID".

(c) Compute the average mark on each of the colloquia (columns "C1", "C2" and "C3").

(d) Add a new column "Avg" and for each student compute the average mark and write it into the corresponding cell.

(e) Add a new column "FinalGrade" and for each student compute the final grade based on the average mark using the following function:

In [10]:
def final_grade(avg):
    if avg >= 4.50:
        return 5
    elif avg >= 3.50:
        return 4
    elif avg >= 2.50:
        return 3
    elif avg >= 1.50:
        return 2
    else:
        return 1

(f) Write the new table into the Excel file data/CS201-FinalGrades.xlsx

Exercise 2. Eurostat is an official European agency in charge of the statistical analyses of various data relevant to the development of the European Union. All the data Eurostat collects is publicly available at the following link: https://ec.europa.eu/eurostat/data/database

The file data/EUProjPop.xlsx contains the projection of the population of each of the EU countries 2080. The table has two worksheets: Baseline containing the projected population of the EU countries, and Migration containing the projected population of the EU countries in case of an increased migration.

(a) Load these two worksheets into to DataFrames (Baseline and Migration) and display a few rows of each table to understand the structure of the tables.

(b) Add a new row "EU" to each of the tables and compute the projected population of the entire union for each year.

(c) Add a new row to the Migration table and compute the projected migration for each of the years (subtract the row EU in the Baseline table from the row EU in the Migration table).

(d) Visualize the projected migration by a line chart.

(e) Add a new row "EU-UK" to the Baseline table and compute the projected population of EU without the UK.

(f) Write the two DataFrames to data/EU-UK.xlsx and data/EU-Migration.xlsx

Exercise 3. The file data/Cricket.xlsx contains the data about the best cricket players in the history of cricket.

(a) Load this table into a DataFrame and index it by the column "Player".

(b) Add new column "YP" (Years Played) to the table and compute the number of years of active playing for each player (subtract the column "From" from the column "To").

(c) Add new column "ARY" (Average Runs per Year) to the table and for each player compute the average number of runs per year (ARY = Runs / YP).

(d) Sort the table by "ARY" in the descending oreder and display the first 25 rows of the table. In what century were most of the top 25 players playing actively? What do you thik why?

© 2019 Petlja.org Creative Commons License