Manipulating Data

Let’s Do Digital Team

Python data libraries

  • Pandas
  • NumPy
  • Scipy

Pandas

What is Pandas?

  • Pandas is a library for working with tabular data in Python.
  • Tabular data is data that is stored in rows and columns, like in a spreadsheet.
  • The 2-dimensional data is stored and manipulated in dataframes.

What is Pandas?

  • Great for working with spreadsheets or databases.
  • Widely used in data science.

Common Pandas Tasks

  • Load data from a file.
  • View data.
  • Edit data.
  • Filter data.
  • Save data back to a file.

Why Use Pandas?

  • Easy to learn and very useful.
  • Works well with big datasets.
  • Helps you clean and analyse data.
  • A key tool for data analysis in Python.

Numpy

What is NumPy?

  • NumPy is a Python library for working with datasets using NumPy arrays of varying dimensions.
  • Arrays are like python lists but faster and more powerful.
  • Great for mathematical and scientific calculations.
  • Core tool in data science and machine learning.

Key Concepts in NumPy

  • NumPy Array: A grid of values (1D, 2D, or more).
  • Efficient for storing and working with lots of data.
  • NumPy makes mathematical operations fast and easy.
  • Use NumPy for calculations across whole arrays all at once.

Why Use NumPy?

  • Very fast and efficient for working with numbers.
  • Easy to perform complex calculations.
  • Used in data analysis, machine learning, and more.
  • Essential for handling large datasets.

Scipy

Scipy

  • Scipy is a Python library for scientific and technical computing.
  • It builds on NumPy and provides more advanced functions.
  • It can carry out complex mathematical operations and statistics, eg linear algebra and p-value calculations.

Let’s Code!

Printing your variables

pandas.py
a_variable = 5

print(a_variable)

# or, if last variable in cell

a_variable


Output:

    5
    5

Pandas

Load Data with Pandas

  • Import the pandas library.
  • Load data from a file into a DataFrame.
pandas.py
import pandas as pd
kidney_function_dataframe = pd.read_csv('kidney_function.csv')

Update Data with loc

  • Use the loc (location) method.
  • Use dataframe_name['column name'] to select a column.
  • Search for a specific value in this column using == and the value.
  • In the second loc argument, specify the column to update.
  • Update the value as needed.
pandas_update.py
patient_data.loc[patient_data['Date'] == '2022-12-01', 'Stage'] = 5

iloc

  • Use iloc (integer location) to select rows and columns by position.
  • Use dataframe_name.iloc[row_number, column_number].
pandas_iloc.py
patient_data.iloc[0, 0] = 22

Filtering DataFrames

  • A filter is a condition to select rows from a dataframe.
  • Use the [] operator to filter rows.
  • Use a condition (eg more than >) to filter rows.
  • You can then save this as a new dataframe.
pandas_filter.py
patients_over_55 = patient_data[patient_data['Age'] > 55]

Get some statistics

  • You can get some basic statistics from a Pandas dataframe.
numpy_stats.py
print(patient_data.describe())


Output:

           Patient_ID        Age  Cholesterol  Glucose Level
    count    6.000000   6.000000     6.000000       6.000000
    mean     3.333333  56.666667   208.333333     107.166667
    std      1.632993   9.309493    20.412415      22.003788
    min      1.000000  45.000000   180.000000      90.000000
    25%      2.250000  51.250000   200.000000      95.750000
    50%      3.500000  55.000000   205.000000      99.000000
    75%      4.750000  62.500000   217.500000     107.500000
    max      5.000000  70.000000   240.000000     150.000000

Refining the return results

  • First, you define filtering condition.
  • Next, you state what column you want to return.
pandas_split.py
hypertension = patient_data_2[patient_data_2['Diagnosis'] == 
'Hypertension']['Cholesterol']


  • patient_data_2['Diagnosis'] == 'Hypertension' finds all rows where the diagnosis is hypertension.
  • ['Cholesterol'] returns only the cholesterol values for the rows that are filtered by the above query.

NumPy

Creating a NumPy Array

  • Import the NumPy library.
  • Create an array from a python list (eg [1, 2, 5, 6]).
numpy.py
import numpy as np

array_1x7 = np.array([1, 2, 2, 4, 1, 1, 7])

array_2x7 = np.array([[1, 2, 3, 4, 5, 6, 7],
                      [8, 9, 10, 11, 12, 13, 14]])

array_3D = np.array([[[1, 2, 3, 4, 5, 6, 7],
                      [8, 9, 10, 11, 12, 13, 14]],
                     [[15, 16, 17, 18, 19, 20, 21],
                      [22, 23, 24, 25, 26, 27, 28]]])

Shapes

  • The shape of a NumPy array tells you how many elements are in each dimension.
  • Use the shape attribute to find the shape of an array.
numpy_shapes.py
array_2x3 = np.array([[1, 2, 3], [4, 5, 6]])

print(array_2x3.shape)


Output:

    (2, 3)

Update NumPy Arrays

  • Much like normal Python lists.
numpy_update.py
array_1x7 = np.array([1, 2, 2, 4, 1, 1, 7])
array_1x7[0] = 10
print(array_1x7)


Output:

    [10  2  2  4  1  1  7]

Scipy

More advanced statistics with Scipy

  • Here we can use the stats module.
  • An example is the t-score and p-value by using stats.ttest_ind.
  • Let’s say we have two groups of patients: one with hypertension and one without.
scipy_stats.py
from scipy import stats

t_score, p_value = stats.ttest_ind(hypertension, non_hypertension)

print(f't-score: {t_score}')
print(f'p-value: {p_value}')

Output:

    t-score: 2.5
    p-value: 0.05

Statistics

Note:

  • A t-score (or t-statistic) of > 1 implies more than one standard deviation from the mean.
  • A p-value of less than 0.05 is considered statistically significant.

Now try it yourself!

  • Go to the Lesson 2 folder.
  • Open lesson_2.ipynb.
  • Don’t forget to ask your tutor if you need help.
  • See you in 40 minutes.