January 15, 2021
When slicing a pandas Dataframe we have an index-based selection data.iloc[row index, column index] and label-based selection data.loc[row label, column label]. Tutorials abound for this; however, when I have a large dataset with a numeric or time-series index and labeled columns, more often than not I simply want to select rows based on index and column from a label. This simple selection eluded me for much too long:
data.iloc[0].column_name
By default when creating a new Dataframe with multiple arrays pandas will stack them vertically (i.e. axis=0 or row-based). It boggles me that stacking arrays horizontally isn't included in the documentation:
data_one = np.array([1,2,3])
data_two = np.array([4,5,6])
pd.DataFrame([data_one, data_two], columns=['col_1', 'col_2', 'col_3'])
# col_1 col_2 col_3
# 0 1 2 3
# 1 4 5 6
df = pd.DataFrame(np.column_stack((data_one, data_two)), columns=['col_1', 'col_2'])
# col_1 col_2
# 0 1 4
# 1 2 5
# 2 3 6
And finally, let's avoid some "SettingwithCopyWarning" errors - when adding new columns to a DataFrame use the assign method:
df = pd.DataFrame([1, 2, 3], columns=['col_1'])
new_column = [4, 5, 6]
df = df.assign(col_2=new_column)
# col_1 col_2
# 0 1 4
# 1 2 5
# 2 3 6