20 Pandas Tricks and Code

Pandas is an open-source library for data analysis and manipulation in Python. It provides data structures for efficiently storing large datasets and tools for working with them in a user-friendly manner. The primary data structure in Pandas is the “DataFrame,” which is a two-dimensional, labeled data structure with columns that can be of different types.

Pandas Data Analysis

Pandas is widely used in the data science community for tasks such as data cleaning and preparation, aggregation, and filtering. It also provides powerful visualization capabilities through its integration with the Matplotlib library.

The library was designed to be fast, flexible, and expressive, and it has become an essential tool for data scientists, analysts, and researchers who work with large datasets. With its concise and readable syntax, Pandas makes it easy to perform complex data manipulations and analysis with just a few lines of code.

20 Pandas Tricks

Here is a list of some useful Pandas tricks along with sample code:

Firstly, here’s a code to generate sample data for “data.csv”.

import pandas as pd
import numpy as np

# Create a dataframe with sample data
df = pd.DataFrame({
    'column_1': np.random.normal(100, 10, 1000),
    'column_2': np.random.normal(50, 5, 1000),
    'column_3': np.random.normal(10, 2.5, 1000)
})

# Save the dataframe to a csv file
df.to_csv('data.csv', index=False)

This code creates a sample dataframe with 1000 samples for three columns, “column_1”, “column_2” and “column_3”. The data for each column is generated using Numpy’s np.random.normal function, which generates random data that follows a normal distribution with a specified mean and standard deviation. The dataframe is then saved to a CSV file named “data.csv”.

1. Selecting specific columns:

# Select only specific columns
selected_columns = df[['column_1', 'column_3']]

# Print the selected columns
print(selected_columns)

2. Filtering rows based on condition:

# Filter rows based on a condition
filtered_df = df[df['column_3'] > 10]

# Print the filtered dataframe
print(filtered_df)

3. Grouping data by columns:

# Group data by column_1 and find the mean of column_2, column_3
grouped_df = df.groupby('column_1').mean()

# Print the grouped dataframe
print(grouped_df)

4. Replacing values in a column:

# Replace all values of column_1 that are less than 80 with 0
df['column_1'] = df['column_1'].where(df['column_1'] >= 80, 0)

# Print the updated dataframe
print(df)

5. Sorting a dataframe by a column:

# Sort dataframe by column_1 in ascending order
sorted_df = df.sort_values('column_1')

# Print the sorted dataframe
print(sorted_df)

6. Dropping duplicates from a dataframe:

# Drop duplicate rows
df = df.drop_duplicates()

# Print the updated dataframe
print(df)

7. Renaming columns in a dataframe:

# Rename columns
df = df.rename(columns={'column_1': 'new_column_1', 'column_2': 'new_column_2'})

# Print the updated dataframe
print(df)

8. Concatenating dataframes:

# Load sample dataframes
df1 = pd.read_csv("data1.csv")
df2 = pd.read_csv("data2.csv")

# Concatenate dataframes along the rows (axis=0)
concatenated_df = pd.concat([df1, df2], axis=0)

# Print the concatenated dataframe
print(concatenated_df)

9. Merging dataframes:

# Load sample dataframes
df1 = pd.read_csv("data1.csv")
df2 = pd.read_csv("data2.csv")

# Merge dataframes on a common column
merged_df = pd.merge(df1, df2, on='common_column')

# Print the merged dataframe
print(merged_df)

10. Pivot tables:

# Create a pivot table
pivot_table = df.pivot_table(index='column_1', columns='column_2', values='column_3')

# Print the pivot table
print(pivot_table)

11. Handling missing values:

# Fill missing values in column_1 with the mean of the column
df['column_1'].fillna(df['column_1'].mean(), inplace=True)

# Drop all rows with missing values
df.dropna(inplace=True)

# Print the updated dataframe
print(df)

12. Applying functions to a dataframe:

# Define a custom function
def custom_function(x):
    return x + 5

# Apply the custom function to column_1
df['column_1'] = df['column_1'].apply(custom_function)

# Print the updated dataframe
print(df)

13. Selecting data using indexing:

# Select rows 5 to 10 of the dataframe
selected_df = df.iloc[4:10]

# Print the selected data
print(selected_df)

14. Creating histograms:

import matplotlib.pyplot as plt

# Plot a histogram of column_1
df['column_1'].plot.hist()

# Add a title to the histogram
plt.title('Histogram of column_1')

# Display the histogram
plt.show()
Histogram Plot

15. Creating box plots:

import matplotlib.pyplot as plt

# Plot a box plot of column_1
df.boxplot(column=['column_1'])

# Add a title to the box plot
plt.title('Box Plot of column_1')

# Display the box plot
plt.show()
Box Plot

16. Creating scatter plots:

# Plot a scatter plot of column_1 and column_2
plt.scatter(df['column_1'], df['column_2'])

# Add a title to the scatter plot
plt.title('Scatter Plot of column_1 and column_2')

# Display the scatter plot
plt.show()
Scatter Plot

17. Creating line plots:

# Plot a line plot of column_1
df['column_1'].plot()

# Add a title to the line plot
plt.title('Line Plot of column_1')

# Display the line plot
plt.show()
Line Plot

18. Plotting multiple plots in a single figure:

# Create a figure and axis
fig, ax = plt.subplots(nrows=1, ncols=2)

# Plot a histogram of column_1 on the first axis
df['column_1'].plot.hist(ax=ax[0])

# Plot a scatter plot of column_1 and column_2 on the second axis
ax[1].scatter(df['column_1'], df['column_2'])

# Add a title to the figure
fig.suptitle('Multiple Plots in a Single Figure')

# Display the figure
plt.show()
Multiple Plot (Subplots)

19. Finding relationships (correlation)

# Show the relationship between the columns
df.corr()

20. Use apply to run a lambda function

# Function that takes x and then divides the value in the column_1 by the value in the column_2
df['divide'] = df.apply(lambda x: x['column_1'] / x['column_2'], axis=1)

# Print the updated dataframe
print(df)

Further Readings

If you’re looking to further your knowledge of Pandas, here are some resources that you might find useful:

  1. The official Pandas documentation (https://pandas.pydata.org/docs/), which provides a comprehensive guide to Pandas, including tutorials and reference materials.
  2. The book “Python for Data Analysis” by Wes McKinney (https://amzn.to/2TZlXhI), which provides an in-depth guide to using Pandas for data analysis.
  3. The online tutorial “Data Wrangling with Pandas” by Kevin Markham (https://www.datacamp.com/courses/data-wrangling-with-pandas), which provides a comprehensive and hands-on introduction to using Pandas.
  4. The YouTube channel “Data School” (https://www.youtube.com/user/dataschool), which features a number of tutorials on Pandas and other data science topics.
  5. The online course “Data Science Fundamentals with Python and Pandas” (https://www.edx.org/course/data-science-fundamentals-with-python-and-pandas), which provides an introduction to Pandas and its use in data science.

These are just a few of the many resources available for learning Pandas. I hope you find them helpful in furthering your knowledge of this powerful library.

Leave a Reply

Your email address will not be published. Required fields are marked *