MENU

[AI from Scratch] Episode 215: Introduction to Pandas — Basics of the Data Manipulation Library

TOC

Recap and Today’s Theme

Hello! In the previous episode, we explored the basics of NumPy, a library for high-speed numerical computation in Python. NumPy enables efficient operations like array manipulation, matrix calculations, and statistical computations.

Today, we will learn the basics of Pandas, an essential library for data analysis. Pandas simplifies data manipulation and processing, especially for tabular data like CSV or Excel files. Let’s explore the fundamental operations of Pandas!

What Is Pandas?

Pandas is a Python library for data analysis that allows for easy manipulation, processing, and visualization of data. It offers a wealth of features for efficiently managing tabular data (data frames), making it an indispensable tool for data scientists and AI engineers.

Main Features of Pandas

  1. DataFrame: A data structure composed of rows and columns that allows for easy manipulation of tabular data.
  2. Easy Data Reading/Writing: Supports various data formats such as CSV, Excel, SQL, and JSON, making it easy to read and save data.
  3. Data Manipulation and Aggregation: Provides a range of operations for filtering, grouping, and aggregating data, essential for data analysis.

Installing Pandas

Pandas is included in Anaconda by default, so usually, there’s no need for a separate installation. However, if you want to install it manually, you can use the following command:

pip install pandas

Now, let’s dive into the basic operations of Pandas.

Basic Operations in Pandas

1. Importing Pandas

First, you need to import Pandas. By convention, it is imported as pd.

import pandas as pd

2. Creating a DataFrame

The core data structure in Pandas is the DataFrame, which is designed to handle tabular data with labeled rows and columns. Let’s create a simple DataFrame:

data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [25, 30, 35, 40],
    'City': ['New York', 'Los Angeles', 'Chicago', 'Houston']
}

df = pd.DataFrame(data)
print(df)

This code creates a DataFrame from a Python dictionary, and the output looks like this:

      Name  Age         City
0    Alice   25     New York
1      Bob   30  Los Angeles
2  Charlie   35      Chicago
3    David   40      Houston

3. Reading Data

Pandas makes it easy to read various data formats. For example, to read a CSV file:

df = pd.read_csv('data.csv')

Similarly, you can use read_excel or read_json functions to read Excel and JSON files.

4. Exploring Data

Pandas provides several methods for inspecting DataFrames:

  • head(): Displays the first 5 rows.
  • tail(): Displays the last 5 rows.
  • info(): Shows information about the DataFrame (data types, missing values, etc.).
  • describe(): Displays basic statistics (mean, standard deviation, min, max, etc.) for numeric data.
print(df.head())        # Displays the first 5 rows
print(df.info())        # Shows DataFrame information
print(df.describe())    # Shows statistical information

5. Selecting and Filtering Data

Pandas allows you to easily select specific rows or columns and filter data based on conditions.

Selecting a Specific Column

# Selecting the 'Name' column
names = df['Name']
print(names)

Selecting Multiple Columns

# Selecting the 'Name' and 'Age' columns
subset = df[['Name', 'Age']]
print(subset)

Filtering Rows Based on Conditions

# Selecting rows where 'Age' is 30 or above
filtered_df = df[df['Age'] >= 30]
print(filtered_df)

6. Modifying Data (Adding, Deleting, Updating)

You can easily add, delete, and update data within a DataFrame.

Adding a Column

# Adding a new column 'Salary'
df['Salary'] = [50000, 60000, 70000, 80000]
print(df)

Deleting a Row

# Deleting the row with index 1 (Bob's row)
df = df.drop(1)
print(df)

Updating Values

# Updating the 'Age' value for index 0 (Alice)
df.at[0, 'Age'] = 26
print(df)

7. Grouping and Aggregating Data

Pandas offers powerful tools for grouping and aggregating data. The groupby function lets you group data based on a specific column and perform calculations like aggregation.

# Calculating the average age for each city
average_age = df.groupby('City')['Age'].mean()
print(average_age)

8. Handling Missing Values

Data often contains missing values, and Pandas provides methods to handle them:

  • isna(): Returns True for missing values.
  • dropna(): Removes rows with missing values.
  • fillna(): Replaces missing values with a specified value.
# Removing rows with missing values
df_cleaned = df.dropna()

# Filling missing values with 0
df_filled = df.fillna(0)

Practical Example with Pandas

Pandas is often used in conjunction with visualization libraries like Matplotlib for data visualization and machine learning preprocessing. Here’s a simple example:

Data Visualization

Using Pandas and Matplotlib, you can easily create visualizations like bar charts:

import matplotlib.pyplot as plt

# Displaying the average age by city as a bar chart
average_age.plot(kind='bar')
plt.title('Average Age by City')
plt.xlabel('City')
plt.ylabel('Average Age')
plt.show()

Running this code displays a bar chart showing the average age for each city, providing a visual representation of the data.

Summary

In this episode, we covered the basics of Pandas, a powerful library for data manipulation. Pandas simplifies data loading, processing, filtering, and aggregation, making it an indispensable tool for data analysis and AI development. When combined with NumPy, Pandas becomes even more effective.

Next Episode Preview

Next time, we will discuss how to read and save data using Pandas, exploring methods for handling CSV, Excel, and JSON formats to gain practical data manipulation skills!


Annotations

  • DataFrame: The core data structure in Pandas, designed for efficiently handling tabular data.
  • Grouping: The operation of grouping and aggregating data using the groupby method based on specific columns.
Let's share this post !

Author of this article

株式会社PROMPTは生成AIに関する様々な情報を発信しています。
記事にしてほしいテーマや調べてほしいテーマがあればお問合せフォームからご連絡ください。
---
PROMPT Inc. provides a variety of information related to generative AI.
If there is a topic you would like us to write an article about or research, please contact us using the inquiry form.

Comments

To comment

TOC