Find common genes from two conditions with their expression values using pandas merge method

Let’s say we have two pandas dataframes. Each of them contain two columns, the gene name and their expression value. There may be other columns with more details of each gene but for simplicity we will consider only two columns.

This may be expression data from two growth condition of culture, or two treatment groups of patients.

If our aim is to find common genes between this data and compare their expression values, following is how we do it.

First import pandas.

Python
import pandas as pd	

Then we will create an example dataframes and print them.

Python
df1 = pd.DataFrame({'gene_name': ['gene1', 'gene2', 'gene3', 'gene4', 'gene5'],
                   'exp': [3.3, 4.5, 2.1, 8.2, 3.7]})

df2 = pd.DataFrame({'gene_name': ['gene3', 'gene4', 'gene6', 'gene9', 'gene10'],
                   'exp': [4.5, 2.6, 7.9, 8.4, 1.8]})

print(f'df1:\n{df1} \n\ndf2:\n{df2}')
output
    df1:
      gene_name  exp
    0     gene1  3.3
    1     gene2  4.5
    2     gene3  2.1
    3     gene4  8.2
    4     gene5  3.7 
    
    df2:
      gene_name  exp
    0     gene3  4.5
    1     gene4  2.6
    2     gene6  7.9
    3     gene9  8.4
    4    gene10  1.8
    

We want to find common genes between these two dataframes and print their expression sisde-by-side in a new combined dataframe.

We use merge method of the dataframe for this. For our example, we need three arguments to it.

First argument is the second dataframe.

Second argument is on argument, which defines the list of columns in which we want to find common values.

Third argument is how. We will assign its value to be inner as we want to find the intersection between names of the genes in both dataframes.

If we wanted to find the union of the gene names, we will use the how=outer argument.

Python
df_merged = df1.merge(df2, on=['gene_name'], how='inner')
print(df_merged)
output
      gene_name  exp_x  exp_y
    0     gene3    2.1    4.5
    1     gene4    8.2    2.6
    

We see above that the common genes in both dataframes and their expression values are extracted.

The new dataframe has expression values in columns suffixed with _x and _y. It may be confusing as it is not clear which values are coming from df1 and which one is coming from df2.

So, we use the suffixes argument. Here, we have df1 as the first (or left) dataframe and df2 and the second (or right) dataframe. The value of the suffixes argument is a tuple equal to (left_dataframe, right_dataframe).

Following is the full statement.

Python
df_merged = df1.merge(df2, 
                      on=['gene_name'], 
                      how='inner', 
                      suffixes=('_df1', '_df2'))
print(df_merged)
output
      gene_name  exp_df1  exp_df2
    0     gene3      2.1      4.5
    1     gene4      8.2      2.6

Now, we got our result with clear indication of the source dataframes.