Let’s say we have two pandas dataframes. Each of them contain two columns, the gene name and their expression value. There may be other columns with more details of each gene but for simplicity we will consider only two columns.
This may be expression data from two growth condition of culture, or two treatment groups of patients.
If our aim is to find common genes between this data and compare their expression values, following is how we do it.
First import pandas.
import pandas as pd Then we will create an example dataframes and print them.
df1 = pd.DataFrame({'gene_name': ['gene1', 'gene2', 'gene3', 'gene4', 'gene5'],
'exp': [3.3, 4.5, 2.1, 8.2, 3.7]})
df2 = pd.DataFrame({'gene_name': ['gene3', 'gene4', 'gene6', 'gene9', 'gene10'],
'exp': [4.5, 2.6, 7.9, 8.4, 1.8]})
print(f'df1:\n{df1} \n\ndf2:\n{df2}') df1:
gene_name exp
0 gene1 3.3
1 gene2 4.5
2 gene3 2.1
3 gene4 8.2
4 gene5 3.7
df2:
gene_name exp
0 gene3 4.5
1 gene4 2.6
2 gene6 7.9
3 gene9 8.4
4 gene10 1.8
We want to find common genes between these two dataframes and print their expression sisde-by-side in a new combined dataframe.
We use merge method of the dataframe for this. For our example, we need three arguments to it.
First argument is the second dataframe.
Second argument is on argument, which defines the list of columns in which we want to find common values.
Third argument is how. We will assign its value to be inner as we want to find the intersection between names of the genes in both dataframes.
If we wanted to find the union of the gene names, we will use the how=outer argument.
df_merged = df1.merge(df2, on=['gene_name'], how='inner')
print(df_merged) gene_name exp_x exp_y
0 gene3 2.1 4.5
1 gene4 8.2 2.6
We see above that the common genes in both dataframes and their expression values are extracted.
The new dataframe has expression values in columns suffixed with _x and _y. It may be confusing as it is not clear which values are coming from df1 and which one is coming from df2.
So, we use the suffixes argument. Here, we have df1 as the first (or left) dataframe and df2 and the second (or right) dataframe. The value of the suffixes argument is a tuple equal to (left_dataframe, right_dataframe).
Following is the full statement.
df_merged = df1.merge(df2,
on=['gene_name'],
how='inner',
suffixes=('_df1', '_df2'))
print(df_merged) gene_name exp_df1 exp_df2
0 gene3 2.1 4.5
1 gene4 8.2 2.6Now, we got our result with clear indication of the source dataframes.