Splitting names

Today on LinkedIn somebody posed the following question (for anonymity I have changed any names given):

Sorting by Last Name

I have a list of 2,000 names by First, Middle, Last, Suffix (Jr or I,II, or III) I need to be able to Sort by last name. I know how to use “Text to Column” but unfortunately there is not a consistancy in the names, I have some that have just First and Last name, I have some that may have a space in their last name such as “De Souza” or “John Dave Jones Jr II” any suggestions.

The issue wasn’t really about sorting, which would be easy with autofilters (CTRL+Shift+L to apply autofilters). The problem was separating out the parts of the names, and lining them up in the right columns. Usually text to columns would fix this, but doing it with the different formats mentioned in the post would have left last names in various different columns.

I find this is a common predicament in HR, where lists of names pulled from various sources are full of inconsistencies, and therefore hard to manipulate or ‘plug in’ to other systems or spreadsheets.

My solution was not particularly elegant. You are rarely afforded the time to create the perfect single formula answer that covers all angles. But it does show you some text-manipulation functions at work.

I e-mailed the file to the poster. They mentioned 5 exceptions that it had not worked for, and they all included initials (I hadn’t thought of this possibility). This leads to my next point: testing. If at all possible you should use your spreadsheets from beginning to end with real data to check if they’re working before they go to your customer. Even better, get someone else to do this for you, as its easy to miss something when you’re buried in details.

Here you can access the spreadsheet that I created. See if you can understand how it works. How would you do it differently? Can you adjust it to work for with initials? Feel free to send your spreadsheets in to admin@xlcalibre.com.

If you’d be interested in similar discussions, come join the brand new XLCalibre Group on LinkedIn.

4 thoughts on “Splitting names”

  1. It turned out there were 108 problems out of 2,775 names (so nearly a 4% error margin, due to the initials). If this were going to be used again and again it would be worth improving, but at 108 records in error it would likely be quicker / more accurate to manually correct them. I think that sometimes with Excel you need to remember that the quickest route for one-off tasks can involve some manual work.

  2. I’m glad that your follow-up mentioned the error rate. I like that. It can help get a sense of whether something is worth tweaking or leaving alone. 4% is good.

    I tried several names because I work with a lot of similar data.

    Anna Marie Willis-Kay

    Dr Byron G Graham IV
    John O’Connell Sr
    Tina Ste Marie
    Lester Walters, DVM
    Amos T. Jones

    Your check only gave me a FALSE on Graham.

    With things like this, I might do a straight Text-To-Columns (TtC) to handle the low-hanging fruit. Move the easy ones below the entire data set or Cut/Paste them onto another sheet. Maybe 50% are handled immediately.

    Then I’d start moving in big chunks. Maybe take the TtC entries that expanded out to 5 columns, investigate what happened there? Is there a way to concatenate 2 fields back together? Move them to the set that’s done.

    Then take the ones that TtC’d out to 4 columns …

    I don’t know what your dataset is like and whether there’s opportunity for people to enter inconsistencies. But I typically have data that had been manually entered with very few restrictions thus, I’m working with not only odd names but also things like:
    Anna Barnes-Keller vs. Anna Barnes Keller
    And if the person entered their own data (e.g., an online registration), they might put all 3 of their professional designations in the name field:

    Joe Smith PhD, CFP, CPA

    Ultimately, I’ve found it’s best to handle these in big chunks, look for patterns and move the completed entries out of the way.

  3. Hi Oz! Thanks for your comments and apologies for the delay in replying. You make some really good points.
    I notice a couple of the errors you list have initials. When dealing with the original query I had used a description from the poster that didn’t mention initials, so I had ignored them (at my peril). The formulas in the file could be amended to deal with single letters (or single letters followed by a full stop).
    I think maybe your text to columns method is simpler and maybe therefore better approach, depending on how well it works on your data. I might add a step at the beginning though, to take care of the titles. As there is a pretty established list of titles, they’re not hard to separate (some kind of =iferror(vlookup(left(find(” ” … formula, vlookuping to a list of titles, would probably do the trick. Then start with everything that’s leftover and do your Ttc exercise.
    Finally, I think that you hit the nail on the head when you mentioned “I typically have data that had been manually entered with very few restrictions”. Data validation is the majority of this battle. If you provide enough fields for the various name elements, with sufficient validation, you can greatly reduce the problems. The other tool in this is feedback. If the user is seeing their own data regular, say on a log-in or self-service screen, they’ll be able to identify if it looks wrong and feedback changes. Of course that all depends on what system separate to Excel you’re using if any…

  4. Yes. Data Validation is critical. It helps the data input remain as clean as possible. I find that a lot of Excel projects are light on the calculations, layout, charting, etc. Maybe nothing more complex than SUM. But it’s really really heavy on the validation–if the project is done right.

    Thanks for what you’re doing here. Lots of interesting activities.

Leave a Reply

Your email address will not be published. Required fields are marked *