# Data Analysis Apart from Excel – Getting Started

March 4th, 2019

tl;dr This post is about how to analyse data in a very efficient manner. Try out Python with pandas for data analysis and matplotlib for visualisation.

## Introduction – How can I get insight into my data?

Our daily life brings us tons of data. Our smartphones, a collection of ten sensors or even more, follow us everywhere we go to. Another example is our credit or debit card account. Everytime we pay, we leave an entry in the record. Depending upon the business we work in, there are much more opportunities for data aggregation. Among others, these are:

• Measurements
• Calculations
• Simulations
• Customer data (country, region, zip code, sales volume, branch, etc.)
• Weather forecast
• Utilisation of servers, vehicles, tools, etc.
• Accounting, stock market, etc.

I would like to show an easy way to understand data, to analyse the past, and to find trends for the future. However, I do not like to employ spreadsheets for this task, since data wrangling with cells and clicks is tedious and prone to errors.

## Data Analysis – How do I analyse my data efficiently?

There are several approaches for data analysis. Beside spreadsheets, there are several scripting languages which are famous for their statistics features, such as R and S. However, the language Python is more convenient as it is not restricted to statistics but an universal programming language allowing coding of almost all tasks from a simple calculator until a web server. The primary focus of their developers is easy readability, which is the reason why people sometimes say that Python is executable pseudo-code. Moreover, it is open-source and extendible with modules. Here, we would like to employ:

• pandas
• NumPy
• matplotlib

Most of the features we will use here are taken from pandas. This module has been developed for the analysis of time series for stock market and provides data frames with tabular structure allowing access to rows, columns, and cells. The installation routine of Python and the modules depend upon the operating system. There is an excellent documentation tackling this task: https://pandas.pydata.org/pandas-docs/stable/install.html. For further reading, checkout the PyData website (https://pandas.pydata.org).

Our script-based approach features some important advantages:

• Few commands only
• Less error prone than manual spreadsheet work
• Very efficient

Let’s get started. At the beginning we have to import the required modules.

In :
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt


## Example: Black-box of a thermo-hydraulic system

### Story

We have a thermo-hydraulic system without any knowledge of its internal structure. Therefore, we plug it into a test bench, connect supply and return of water, and the electrical power cable. Moreover, we measure a couple of quantities, such as:

• Flow rate Q
• Pressure difference between inlet and outlet Δp
• Temperatures of inlet T1 and outlet T2
• Electrical power P Scheme for a black-box thermohydraulic system

In order to acquire operation data of the system, the following work flow is carried out:

• Adjusting the flow rate to the desired value
• Measuring the physical quantities
• Storing data into CSV file

### Reading data from a CSV file

The data is stored in a CSV file that actually uses spaces as column separators. Loading a file and creating a data frame df is as easy as:

In :
df = pd.read_csv('measurement_1.csv', sep=' ')


A data frame is an indexed matrix-like structure with entries that can be addressed with their row (index) and column (title).

### What is stored in the data frame?

There are several options to list the entities stored in a data frame. These are:

• df: showing all entries (for large data sets, the beginning and end only)
• df.head(): showing the first 5 rows (any other value can be provided as argument)
• df.tail(): showing the last 5 rows (any other value can be provided as argument)
In :
df.head()

Out:
Index t/s T1/C T2/C Tu/C pu/Pa fu/% P/W Q/(m^3/s) dp/Pa u/(m/s)
0 0 21.97 21.97 19.70 99979.55 60.14 999.70 0.000017 28520.99 0.844
1 1 21.40 22.43 20.08 99996.69 59.59 999.69 0.000017 29131.00 0.856
2 2 21.46 23.42 19.97 100001.17 60.54 1000.18 0.000017 29720.75 0.862
3 3 21.86 24.74 19.68 99964.14 58.92 999.44 0.000017 28927.89 0.849
4 4 22.30 25.89 19.60 99993.45 60.38 999.87 0.000017 29111.47 0.858

N.b.: The table head is determined automatically from the CSV file. It could have been also redefined in the pd.read_csv() command.

### How do I get the full statistics?

The statistics of a data frame can be easily obtained by calling the method describe(). This provides the number of valid entries, i.e., non-NaN values, the mean value, the standard deviation, and various quantiles of the sample of each column. By the way, the method describe() can also be applied to non-numeric columns, where the number of entries is returned.

In :
df.describe()

Out:
t/s T1/C T2/C Tu/C pu/Pa fu/% P/W Q/(m^3/s) dp/Pa u/(m/s)
count 151.000000 151.000000 151.000000 151.000000 151.000000 151.000000 151.000000 1.510000e+02 151.000000 151.000000
mean 75.000000 21.940662 34.973311 20.019073 99996.339007 60.044437 999.939404 1.668497e-05 28727.719603 0.849748
std 43.734045 0.209799 2.892432 0.212632 26.095390 0.636769 0.959258 1.841585e-07 605.813018 0.009424
min 0.000000 21.280000 21.970000 19.520000 99934.530000 58.510000 997.600000 1.616100e-05 27035.320000 0.823000
25% 37.500000 21.800000 35.345000 19.870000 99980.325000 59.565000 999.230000 1.656900e-05 28337.590000 0.844000
50% 75.000000 21.920000 36.060000 20.020000 99996.290000 60.050000 999.930000 1.669400e-05 28758.160000 0.850000
75% 112.500000 22.085000 36.330000 20.165000 100015.625000 60.540000 1000.585000 1.681100e-05 29152.945000 0.856000
max 150.000000 22.360000 37.100000 20.550000 100071.300000 61.880000 1002.320000 1.731500e-05 30911.540000 0.882000

### How do I plot data of a particular column?

Reading data in a tabulated manner is very often not intuitive. For this purpose, the plot() method of a data frame can be employed. Here, we would like to see the evolution of the temperature T2 over time within the limits of [22,38] and save the resulting figure to a PDF file. The figure file type can be also any other common one, such as (E)PS, PNG, JPG, etc.

In :
df['T2/C'].plot(grid=True, ylim=(22,38))
plt.savefig('t2.pdf') Evolution of the outlet temperature

### How do I find the steady-state?

#### Calculating the running mean value

The next step in our analysis is to consider the data only when the system is in a steady-state, i.e., the thermodynamic quantities do not vary more than the noise of the signal. For this purpose, we evaluate the mean value of a particular time interval just before the time under consideration. This feature is implemented with the method rolling() which accepts several arguments. Here we define the length to be 20 entries and the minimum length to be 20, too. This is important for the behaviour at the beginning. Just try on your own, what happens if you set min_periods to other values. Afterwards, the method mean() is applied, which calculates the mean value of all quantities in this interval. The results are stored in a new data frame dff.

In :
dff=df['T2/C'].rolling(20,min_periods=20).mean()


The following plot shows that the first 20 entries (0…19 s) are without data, which is not quite surprising as the default behaviour of the method rolling() is to store its value at the end of the interval. It can be observed that the running mean is more or less constant from times greater than 70 s in this particular example. However, we do not know this a priori.

In :
dff.plot(grid=True, ylim=(22,38)) Evolution of the running mean of the outlet temperature

#### Calculating with columns

Now, we want to use the running mean data frame dff for determining the data when the system is in steady-state. Therefore, we calculate the difference of the present temperature and its running mean value.

We see that the difference is less than 1 K for most of the values from times greater than 40 s.

In :
diff=(df['T2/C']-dff)
diff.plot(grid=True)
plt.savefig('t2_diff.pdf') Evolution of the temperature difference between the outlet temperature and the coinciding running mean value

#### Creating a filter

We can now create a filter that stores the result of an inequality as a Boolean array. Here, we check if the temperature difference is less than 2.5% of the coinciding temperature. Again, there is no output, since the result is assigned to a variable.

In :
fil = diff < 0.025*df['T2/C']


The output of the filter variable fil is the Boolean array. It can be observed that the first and the last thirty entries are shown in order to keep the list comprehensible.

In :
fil

Out:
0      False
1      False
2      False
3      False
4      False
5      False
6      False
7      False
8      False
9      False
10     False
[40 lines more]
141     True
142     True
143     True
144     True
145     True
146     True
147     True
148     True
149     True
150     True
Name: T2/C, Length: 151, dtype: bool

#### Using filtered values only

The filter fil can be applied to the original data frame df by employing the squared bracket. The method describe() reveals the statistics of the filtered data frame, i.e., only of those rows for which the filter is true. Compared to Out we see that roughly 40 data points are not considered. Thus, the mean value of T2/C is now slightly higher.

In :
df[fil].describe()

Out:
t/s T1/C T2/C Tu/C pu/Pa fu/% P/W Q/(m^3/s) dp/Pa u/(m/s)
count 110.000000 110.000000 110.000000 110.000000 110.000000 110.000000 110.000000 1.100000e+02 110.000000 110.000000
mean 95.445455 21.945727 36.206273 20.029636 99997.141545 60.086636 999.903545 1.668840e-05 28714.969091 0.849936
std 31.992140 0.198256 0.343078 0.200842 26.690450 0.632201 0.968197 1.875631e-07 635.548513 0.009619
min 39.000000 21.550000 35.240000 19.520000 99939.890000 58.510000 997.600000 1.626500e-05 27035.320000 0.828000
25% 68.250000 21.800000 36.000000 19.890000 99981.295000 59.572500 999.150000 1.657100e-05 28335.310000 0.844000
50% 95.500000 21.920000 36.240000 20.030000 99995.845000 60.080000 999.940000 1.671050e-05 28755.000000 0.851000
75% 122.750000 22.090000 36.430000 20.170000 100017.955000 60.540000 1000.577500 1.680975e-05 29087.127500 0.856000
max 150.000000 22.360000 37.100000 20.550000 100071.300000 61.880000 1002.230000 1.731500e-05 30911.540000 0.882000

If not the entire statistics is required, but some particular values only, they can be addressed with the methods min(), max(), mean(), std(), median() and quantile(x), where $x\in[0,1]$. Here, the mean values are calculated as follows:

In :
df[fil].mean()

Out:
t/s             95.445455
T1/C            21.945727
T2/C            36.206273
Tu/C            20.029636
pu/Pa        99997.141545
fu/%            60.086636
P/W            999.903545
Q/(m^3/s)        0.000017
dp/Pa        28714.969091
u/(m/s)          0.849936
dtype: float64

Finalising the analysis, we plot the temperature T2 for the filtered values only. We can see that temperature indeed is quite constant, beside of the noise.

In :
df.loc[fil,['T2/C']].plot(grid=True, ylim=(22,38))
plt.savefig('t2_filtered.pdf') Evolution of the outlet temperature; filtered to steady-state conditions

## Summary – What have we learned?

Summarising this post, we can make the following statements:

• Data analysis is easy and very efficient with Python and pandas.
• Matplotlib is a convenient tool to produce nice figures.
• Results can easily be exportet to HTML, PDF and other formats.

Therefore, I strongly advice to give Python + pandas a try and start today!

#Python #PyData #DataAnalysis #pandas #futureofcompute @cloudandheat https://cloudandheat.com

Dr.-Ing. Andreas Hantsch CFD & AI Engineer, Tech Enthusiast, Speaker, Author at Cloud&Heat.