Splitting Strings in pandas Dataframe Columns

A quick note on splitting strings in columns of pandas dataframes.

If we have a column that contains strings that we want to split and from which we want to extract particuluar split elements, we can use the .str. accessor to call the split function on the string, and then the .str. accessor again to obtain a particular element in the split list.

df_str = pd.DataFrame( {'col':['http://example.com/path/filename.suffix']*3} )
df_str['path'] = df_str['col'].str.split('/').str[-1]
df_str['stub'] = df_str['path'].str.split('.').str[0]
df_str

Author: Tony Hirst

I'm a Senior Lecturer at The Open University, with an interest in #opendata policy and practice, as well as general web tinkering... View all posts by Tony Hirst

2 thoughts on “Splitting Strings in pandas Dataframe Columns”

It’s apparently quicker (and cleaner) to write a custom function that does the splitting – I think this is due to having to repeatedly access .str

Example tested on Google Colab.

import pandas as pd

df = pd.DataFrame([‘hello/world.x’]*1000000)

% timeit x = df[0].str.split(‘/’).str[-1].str.split(‘.’).str[0]

1 loop, best of 3: 1.75 s per loop

def my_split(string):
return string.split(‘/’)[-1].split(‘.’)[0]

% timeit x = df[0].apply(my_split)

1 loop, best of 3: 532 ms per loop

Tony Hirst says:

October 7, 2019 at 1:27 pm

@andy Yes, agreed, the .apply() approach is much better. The example arose from a teaching example around the use of .str. when working with columns. The use of the apply map comes later… I’ll pinch the timing comparison if I may when demonstrating why .apply() may be a better approach.

Comments are closed.