A quick note on splitting strings in columns of pandas dataframes.
If we have a column that contains strings that we want to split and from which we want to extract particuluar split elements, we can use the .str.
accessor to call the split function on the string, and then the .str.
accessor again to obtain a particular element in the split list.
df_str = pd.DataFrame( {'col':['http://example.com/path/filename.suffix']*3} ) df_str['path'] = df_str['col'].str.split('/').str[-1] df_str['stub'] = df_str['path'].str.split('.').str[0] df_str
It’s apparently quicker (and cleaner) to write a custom function that does the splitting – I think this is due to having to repeatedly access .str
Example tested on Google Colab.
import pandas as pd
df = pd.DataFrame([‘hello/world.x’]*1000000)
% timeit x = df[0].str.split(‘/’).str[-1].str.split(‘.’).str[0]
def my_split(string):
return string.split(‘/’)[-1].split(‘.’)[0]
% timeit x = df[0].apply(my_split)
@andy Yes, agreed, the .apply() approach is much better. The example arose from a teaching example around the use of
.str.
when working with columns. The use of the apply map comes later… I’ll pinch the timing comparison if I may when demonstrating why .apply() may be a better approach.