Pandas is a powerful and widely used Python library for data analysis and manipulation. One of its core structures is the DataFrame, which provides an efficient way to handle tabular data, similar to spreadsheets or SQL tables. In this article, we will explore what a Pandas DataFrame is, its key features, and how to work with it.

What is a Pandas DataFrame?

A DataFrame is a two-dimensional, mutable, and heterogeneous data structure in Pandas. It consists of rows and columns, where:

  • Rows represent individual records (like database rows).
  • Columns represent different data attributes (like fields in a table).
  • Each column can contain different data types (integers, strings, floats, etc.).

DataFrames are built on top of NumPy arrays and can be created from various data sources, including dictionaries, lists, CSV files, SQL databases, and JSON files.

Creating a Pandas DataFrame

To work with DataFrames, you need to install Pandas if you haven’t already:

pip install pandas

Then, import Pandas:

import pandas as pd

Creating a DataFrame from a Dictionary

One of the most common ways to create a DataFrame is using a dictionary:

data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'City': ['New York', 'Los Angeles', 'Chicago']
}

df = pd.DataFrame(data)
print(df)

Creating a DataFrame from a List of Lists

You can also create a DataFrame from a list of lists:

data = [
    ['Alice', 25, 'New York'],
    ['Bob', 30, 'Los Angeles'],
    ['Charlie', 35, 'Chicago']
]

df = pd.DataFrame(data, columns=['Name', 'Age', 'City'])
print(df)

Creating a DataFrame from a CSV File

To read data from a CSV file:

df = pd.read_csv('data.csv')
print(df.head())  # Displays the first 5 rows

Common DataFrame Operations

Accessing Columns

To access a specific column:

print(df['Name'])

Accessing Rows

Use .loc[] and .iloc[] to access specific rows:

print(df.loc[0])  # Access row by label (index)
print(df.iloc[1])  # Access row by position

Filtering Data

You can filter data using conditions:

filtered_df = df[df['Age'] > 28]
print(filtered_df)

Adding a New Column

To add a new column:

df['Salary'] = [50000, 60000, 70000]
print(df)

Deleting a Column

To remove a column:

df = df.drop(columns=['Salary'])
print(df)

Sorting Data

Sort data by a column:

df = df.sort_values(by='Age', ascending=False)
print(df)

Grouping Data

You can group data using groupby():

grouped_df = df.groupby('City').mean()
print(grouped_df)

Exporting Data

Save the DataFrame as a CSV file:

df.to_csv('output.csv', index=False)