This article is part of the Data Spell series (check out more about it here ). If you're new to Python, I recommend starting with Dive into Python, and for a primer on NumPy, check out this article.
If you've ever tired into machine learning or data analysis, chances are you've heard of the Pandas library. It's one of those tools that everyone talks about—and for good reason. In this article, we’re going to take a closer look at what makes Pandas such a game-changer in the world of data manipulation and analysis.
but what are pandas ? ( 🐼 ...??)
What are Pandas ?
Pandas is an open-source data analysis and manipulation library built on top of the Python programming language. It provides data structures and functions needed to work with structured data seamlessly.
Primary data structures in pandas :
Series - A one-dimensional labeled array capable of holding data of any type (integer, string, float, etc.)
DataFrame: A two-dimensional labeled data structure with columns ( can think it of like excel sheet)
Set up
Before you start using Pandas, you need to install it. If you haven't installed Pandas yet, you can do so using pip:
!pip install pandas
After successful installation , Import the library as any other library
import pandas as pd
Getting Started with Pandas
Creating a Pandas Series
A Pandas Series is similar to a list in Python, but it comes with an index that labels each element.
data = [10, 20, 30, 40]
series = pd.Series(data, index=['a', 'b', 'c', 'd'])
This is useful when you want to reference elements by labels rather than by position.
Creating a DataFrame
A DataFrame is a table-like structure that consists of rows and columns. Each column in a DataFrame is a Series.
data = {
'Name': ['Raj', 'Lenard', 'Howard', 'Penny'],
'Age': [24, 27, 22, 32],
'City': ['India', 'Los Angeles', 'Chicago', 'New York']
}
df = pd.DataFrame(data)
Pandas automatically assigns integer indexes to each row.
Essential Operations in Pandas
Pandas offers a wide array of operations that allow you to manipulate and analyze your data. Let's look at some of the most commonly used ones.
head()
: Displays the first few rows of the DataFrame. ( can also mention some specific number of datapoints to be shown )tail()
: Displays the last few rows of the DataFrame. ( can specify the number of datapoints)info()
: Provides a concise summary of the DataFrame.
describe()
: Generates descriptive statistics of the DataFrame.
Selecting Data
You can select data from a DataFrame using labels (column names) or position (row/column index)
Selecting a single column
print(df['Name'])
Selecting multiple columns
print(df[['Name', 'City']])
Selecting rows by index
print(df.loc[1]) # By label-based index
print(df.iloc[1]) # By position-based index
Filtering Data
Pandas allows you to filter your DataFrame based on certain conditions. For example, to filter all rows where the age is greater than 25.
filtered_df = df[df['Age'] > 25]
print(filtered_df)
Adding and Dropping Columns
Add a column
df['Salary'] = [50000, 60000, 40000, 80000]
print(df)
Drop a column
df = df.drop('City', axis=1)
print(df)
Conclusion
In a nutshell, Pandas is your go-to toolkit for making data analysis simpler and more intuitive. Whether you're cleaning up messy data, exploring patterns, or preparing datasets for machine learning, Pandas streamlines the process, making you more efficient and effective. With just a few lines of code, you can transform complex data tasks into manageable steps, allowing you to focus on drawing insights and making informed decisions. So explore more and try out different functionalites and usecase of it .
Happy coding !!