Some of My Favorite Pandas Snippets

I have spent the last six months working on the migration of multiple (mostly) SQL-based data sources to a multitude of different AWS-based targets, ranging from conventional SQL backends to user stores like Cognito. Many (if not all) of these efforts involved joining and coalescing data from multiple sources (per single data set) to migrate to a single backend system. For the majority of the work, I used Jupyter notebooks – relying heavily on Pandas and Numpy – for source data analysis, transformation, and load into target systems (as well as the use of database-specific client connection libraries, including MySQL, MS SQL, and Redshift/Postgres).

After working through a few data sets, some commonalities began to emerge. Many of these commonalities occurred during early analysis and data cleanup phases of the migrations. While many of these scenarios are sort of covered in other online forums (90% Stack Overflow, 10% Medium), none of the solutions I encountered online were quite what I was looking for, so I thought I’d share my scenarios and solutions here. For all of these examples, we’ll keep things simple and pretend we have a Pandas DataFrame assigned to a variable called df.

Scenario #1: Get All Unique Values from a Single Column in a DataFrame

import pandas as pd
...
list_of_unique_vals = pd.unique(df[['column_to_extract_unique_vals_from']].values.ravel())

Scenario #2: Get All Rows from a Pandas DataFrame by Finding Duplicates Based on a Single Column in the Same DataFrame

So, this is a tricky one. Everywhere you look, you’ll see solutions that directly involve the use of the .duplicated() function, but that doesn’t tell the whole story. The duplicated() function only shows you rows once a duplicate has been detected. That is, if you have two rows that are duplicated on a single column’s value, duplicated() will only show you one of them. Maybe it’s just me, but I’d like to see both of them. Here’s how to show all of the rows!

import pandas as pd
...
df['is_colA_duped'] = df.duplicated(['colA'])
list_of_colA_dupe_vals = list(pd.unique(df.query("is_colA_duped == True")['colA'].values.ravel()))
...
# Maybe store duplicate rows into a separate DataFrame and look at a few of them
df_colA_dupes = df[df['colA'].isin(list_of_colA_dupe_vals)].sort_values(by=['colA'])
df_colA_dupes.sample(20)

I am aware that this can be done more succinctly (i.e. without the use of the extra column or storing the list in a separate value, but it is convenient to be able to look at the list of values). Optimize to your heart’s content.

Scenario #3: Create a DataFrame Directly from a SQL Query (from a DB-API Compliant Library)

import pandas as pd
from pandas import DataFrame
import pymssql
import mysql.connector
...
# MySQL
mysql_conn = mysql.connector.connect(
                ...
             )
mysql_cursor = mysql_conn.cursor()
mysql_cursor.execute("""SELECT * FROM some_table""")
df_from_mysql = DataFrame(mysql_cursor.fetchall())
df_from_mysql.columns = [x[0] for x in mysql_cursor.description]
...
# MS SQL
mssql_conn = pymssql.connect(
                ...
             )
mssql_cursor = mssql_conn.cursor()
mssql_conn.execute("""SELECT TOP 100 * FROM SomeTable""")   # I really like TOP, I wish I could use it in MySQL.  You can also use it in RedShift
df_from_mssql = DataFrame(mssql_cursor.fetchall())
df_from_mssql.columns = [x[0] for x in mssql_cursor.description]

As you’ll see, if the client library for whatever engine you’re using is DB-API compliant, the procedure is the same. For the sake of pointing it out, it’s worth creating a function to extract the column names from the cursor description (really, combining the actual load of data into the DF from the cursor and assinging columns from a single function makes the most sense). Maybe something like:

def create_df(cursor):
    df = DataFrame(cursor.fetchall())
    df.columns = [x[0] for x in cursor.description]

    return df
...

cursor.execute("""SELECT * FROM Tbl""")
my_new_df = create_df(cursor)

In hindsight, I don’t know why I didn’t do something like this before. In any case, you benefit from my hindsight!

Scenario #4: Loading a SQL Table from a DataFrame

So this one tripped me up. I expected I could use any client library, but that doesn’t work. You have to use SQLAlchemy-compliant connections to load from a DataFrame into a table. For a MySQL-based database, it’ll look something like this:

from sqlalchemy import create_engine
...

# If you have binary columns in your database with b''-type columns in your DataFrame, you'll need the "?charset=utf8mb4&binary_prefix=true" bit!
# Also, this is obviously based on Python 3.6+ since I'm using f-strings (and you should, too).
engine = create_engine(f'mysql+pymysql://{DBUSER}:{DBPASS}@{DBHOST}:3306/{DBSCHEMA}?charset=utf8mb4&binary_prefix=true")')

# You may or may not want to set index and if_exists as I have.  Look at the Pandas docs for more detail on the params.
df.to_sql('TableToLoadInto', con=engine, index=False, if_exists='append')

That’s it for now! I’ll make sure to add new snippets and solutions here as I come across them.

comments powered by Disqus