MENU

[AI from Scratch] Episode 216: Reading and Saving Data — Handling CSV, Excel, JSON Files

TOC

Recap and Today’s Theme

Hello! In the previous episode, we learned the basics of the Pandas library for data manipulation. We explored how Pandas makes it easy to work with data frames for tasks like filtering, aggregation, and modification.

Today, we’ll discuss how to read and save various data formats (e.g., CSV, Excel, JSON) using Pandas. Data input and output are fundamental skills in data analysis and AI development. Let’s build efficient data handling skills!

Benefits of Using Pandas for Data Handling

Pandas offers several advantages for working with various data formats:

  1. Simple Commands: Reading and saving data can be done with a single line of code.
  2. Support for Multiple File Formats: Supports CSV, Excel, JSON, SQL, and many other file types.
  3. Integration with DataFrame Operations: The data you read can be directly used as a Pandas DataFrame, making analysis and visualization immediate.

Let’s explore the specific methods for handling different types of files.

1. Reading and Saving CSV Files

Reading a CSV File

CSV (Comma-Separated Values) files are the most common format for storing tabular data. You can read them easily with the read_csv function in Pandas.

import pandas as pd

# Reading a CSV file
df = pd.read_csv('data.csv')

# Displaying the data
print(df.head())
  • read_csv(): Reads the CSV file specified by the file path. The head() method displays the first five rows of the data.

Setting Options

The read_csv function offers various options that can be adjusted based on the data format.

  • header: Specifies the row containing column names (default is 0).
  • index_col: Specifies a column to use as the index.
  • usecols: Selects only the specified columns to read.
  • na_values: Defines values that should be considered as missing data.
# Using the 'ID' column as the index and treating 'N/A' as missing data
df = pd.read_csv('data.csv', index_col='ID', na_values='N/A')

Saving a CSV File

To save data in CSV format, use the to_csv function.

# Saving the DataFrame as a CSV file
df.to_csv('output.csv', index=False)
  • to_csv(): Saves the DataFrame to a CSV file.
  • index=False: Excludes the index when saving.

2. Reading and Saving Excel Files

Pandas can also handle Excel files easily with the read_excel function.

Reading an Excel File

# Reading an Excel file
df = pd.read_excel('data.xlsx', sheet_name='Sheet1')

# Displaying the data
print(df.head())
  • read_excel(): Reads an Excel file. You can specify a specific sheet using the sheet_name parameter.

Reading Multiple Sheets

If an Excel file contains multiple sheets, you can load all of them as a dictionary.

# Reading all sheets as a dictionary
sheets = pd.read_excel('data.xlsx', sheet_name=None)

# Accessing a specific sheet
sheet1_df = sheets['Sheet1']

Saving an Excel File

To save data in Excel format, use the to_excel function.

# Saving the DataFrame as an Excel file
df.to_excel('output.xlsx', sheet_name='Sheet1', index=False)
  • to_excel(): Saves the DataFrame to an Excel file.
  • sheet_name: Specifies the name of the sheet to save.
  • index=False: Excludes the index when saving.

3. Reading and Saving JSON Files

JSON (JavaScript Object Notation) is a widely used format for structured data. You can read JSON files using the read_json function.

Reading a JSON File

# Reading a JSON file
df = pd.read_json('data.json')

# Displaying the data
print(df.head())
  • read_json(): Reads a JSON file into a DataFrame. JSON data is often read as a list of dictionaries.

Saving a JSON File

To save a DataFrame as a JSON file, use the to_json function.

# Saving the DataFrame as a JSON file
df.to_json('output.json', orient='records', lines=True)
  • to_json(): Saves the DataFrame as a JSON file.
  • orient=’records’: Saves each row as a dictionary.
  • lines=True: Saves the JSON file with each row on a separate line.

4. Other File Formats

Pandas supports various other file formats.

Reading and Saving SQL Databases

You can connect directly to an SQL database and read/write data using Pandas. For example, to connect to an SQLite database:

import sqlite3

# Connecting to the database
conn = sqlite3.connect('database.db')

# Reading data from an SQL table
df = pd.read_sql_query("SELECT * FROM table_name", conn)

# Saving the DataFrame to a new table
df.to_sql('new_table', conn, if_exists='replace', index=False)

# Closing the connection
conn.close()
  • read_sql_query(): Executes an SQL query and reads the result into a DataFrame.
  • to_sql(): Saves the DataFrame as a new SQL table.

Reading HTML Files

Pandas can also extract table data from HTML files.

# Reading table data from an HTML file
tables = pd.read_html('https://example.com/data.html')

# Accessing the first table as a DataFrame
df = tables[0]
print(df.head())
  • read_html(): Extracts tables from an HTML page and reads them as DataFrames.

Summary

In this episode, we explained how to read and save various types of data using Pandas. Handling data input and output is fundamental in data analysis and AI development. Pandas makes it efficient to work with CSV, Excel, JSON, and SQL data, streamlining data preparation.

Next Episode Preview

Next time, we will discuss data visualization using Matplotlib. We’ll learn how to create graphs and plots to visually represent the data loaded through Pandas!


Annotations

  • CSV: Stands for Comma-Separated Values, a format for storing tabular data separated by commas.
  • JSON: Stands for JavaScript Object Notation, a widely used format for structured data.
  • SQL: Stands for Structured Query Language, used for managing and retrieving data from databases.
Let's share this post !

Author of this article

株式会社PROMPTは生成AIに関する様々な情報を発信しています。
記事にしてほしいテーマや調べてほしいテーマがあればお問合せフォームからご連絡ください。
---
PROMPT Inc. provides a variety of information related to generative AI.
If there is a topic you would like us to write an article about or research, please contact us using the inquiry form.

Comments

To comment

TOC