Splitting Strings in pandas Dataframe Columns

A quick note on splitting strings in columns of pandas dataframes.

If we have a column that contains strings that we want to split and from which we want to extract particuluar split elements, we can use the .str. accessor to call the split function on the string, and then the .str. accessor again to obtain a particular element in the split list.

df_str = pd.DataFrame( {'col':['http://example.com/path/filename.suffix']*3} )
df_str['path'] = df_str['col'].str.split('/').str[-1]
df_str['stub'] = df_str['path'].str.split('.').str[0]

Author: Tony Hirst

I'm a Senior Lecturer at The Open University, with an interest in #opendata policy and practice, as well as general web tinkering...

2 thoughts on “Splitting Strings in pandas Dataframe Columns”

  1. It’s apparently quicker (and cleaner) to write a custom function that does the splitting – I think this is due to having to repeatedly access .str

    Example tested on Google Colab.

    import pandas as pd

    df = pd.DataFrame([‘hello/world.x’]*1000000)

    % timeit x = df[0].str.split(‘/’).str[-1].str.split(‘.’).str[0]

    1 loop, best of 3: 1.75 s per loop

    def my_split(string):
    return string.split(‘/’)[-1].split(‘.’)[0]

    % timeit x = df[0].apply(my_split)

    1 loop, best of 3: 532 ms per loop

    1. @andy Yes, agreed, the .apply() approach is much better. The example arose from a teaching example around the use of .str. when working with columns. The use of the apply map comes later… I’ll pinch the timing comparison if I may when demonstrating why .apply() may be a better approach.

Comments are closed.

%d bloggers like this: