Let’s Do Digital Team

- Pandas
- NumPy
- Scipy

`Pandas`

is a library for working with`tabular`

data in Python.- Tabular data is data that is stored in rows and columns, like in a spreadsheet.
- The 2-dimensional data is stored and manipulated in
`dataframes`

.

- Great for working with spreadsheets or databases.
- Widely used in data science.

- Load data from a file.
- View data.
- Edit data.
- Filter data.
- Save data back to a file.

- Easy to learn and very useful.
- Works well with big datasets.
- Helps you clean and analyse data.
- A key tool for data analysis in Python.

`NumPy`

is a Python library for working with datasets using`NumPy arrays`

of varying dimensions.- Arrays are like python
`lists`

but faster and more powerful. - Great for mathematical and scientific calculations.
- Core tool in data science and machine learning.

**NumPy Array**: A grid of values (1D, 2D, or more).- Efficient for storing and working with lots of data.
- NumPy makes mathematical operations fast and easy.
- Use NumPy for calculations across whole arrays all at once.

- Very fast and efficient for working with numbers.
- Easy to perform complex calculations.
- Used in data analysis, machine learning, and more.
- Essential for handling large datasets.

- Scipy is a Python library for scientific and technical computing.
- It builds on NumPy and provides more advanced functions.
- It can carry out complex mathematical operations and statistics, eg linear algebra and p-value calculations.

Output:

```
5
5
```

- Import the pandas library.
- Load data from a file into a
`DataFrame`

.

- Use the
`loc`

(location) method. - Use
`dataframe_name['column name']`

to select a column. - Search for a specific value in this column using
`==`

and the value. - In the second
`loc`

argument, specify the column to update. - Update the value as needed.

- Use
`iloc`

(integer location) to select rows and columns by position. - Use
`dataframe_name.iloc[row_number, column_number]`

.

- A filter is a condition to select rows from a dataframe.
- Use the
`[]`

operator to filter rows. - Use a condition (eg more than >) to filter rows.
- You can then save this as a new dataframe.

- You can get some basic statistics from a Pandas dataframe.

Output:

```
Patient_ID Age Cholesterol Glucose Level
count 6.000000 6.000000 6.000000 6.000000
mean 3.333333 56.666667 208.333333 107.166667
std 1.632993 9.309493 20.412415 22.003788
min 1.000000 45.000000 180.000000 90.000000
25% 2.250000 51.250000 200.000000 95.750000
50% 3.500000 55.000000 205.000000 99.000000
75% 4.750000 62.500000 217.500000 107.500000
max 5.000000 70.000000 240.000000 150.000000
```

- First, you define filtering condition.
- Next, you state what column you want to return.

pandas_split.py

`patient_data_2['Diagnosis'] == 'Hypertension'`

finds all rows where the diagnosis is**hypertension**.`['Cholesterol']`

returns**only the cholesterol values**for the rows that are filtered by the above query.

- Import the NumPy library.
- Create an array from a python list (eg [1, 2, 5, 6]).

- The shape of a NumPy array tells you how many elements are in each dimension.
- Use the
`shape`

attribute to find the shape of an array.

Output:

` (2, 3)`

- Much like normal Python lists.

Output:

` [10 2 2 4 1 1 7]`

- Here we can use the stats module.
- An example is the t-score and p-value by using
`stats.ttest_ind`

. - Let’s say we have two groups of patients: one with hypertension and one without.

scipy_stats.py

Output:

```
t-score: 2.5
p-value: 0.05
```

**Note:**

- A t-score (or t-statistic) of > 1 implies more than one standard deviation from the mean.
- A p-value of less than 0.05 is considered statistically significant.

- Go to the Lesson 2 folder.
- Open
`lesson_2.ipynb`

. - Don’t forget to ask your tutor if you need help.
- See you in 40 minutes.