5

I have data imported from a literature database (.txt format) into Excel. Unfortunately, in most cases, the authors are listed double in one cell, e.g.:

Smith, Miller, Patty, Smith, Patty, Miller

As you can see, the order of the names changes as well. I would like to remove the second repeats, to Smith, Miller, Patty

Does anyone have a super cool shortcut how to do this quickly? Currently, I'm going through each and every cell manually. (I know, fixing this already at the literature database step would be best. Unfortunately, this is not an option right now...).

Excellll
  • 12,627
  • 11
  • 51
  • 78
Simone Lerch
  • 51
  • 1
  • 1
  • 2
  • 1
    Check this out, specifically the second option. http://stackoverflow.com/questions/15805157/how-do-i-remove-duplicate-values-in-a-cell-seperated-with-a – Josh Sep 11 '13 at 15:32
  • If you don't know how to add VB code in Excel: 1) hit Alt + F11 to open VB editor. 2) Go to [Insert] - [Module] then paste the code. 3) hit Alt + F11 again to get back to Excel – Josh Sep 11 '13 at 15:33
  • can you give a sample screenshot to show how your data is laid out? –  Sep 12 '13 at 07:07

3 Answers3

4

The way Excel handles split functionality and unique or distinct functions is a little unwieldy: split is done via "text to columns", like in MS Word, and unique/distinct is done by advanced filtering "unique values only" into a new column.

Instead you could try fixing this in Google Sheets with the following formula:

Cell A1: Smith, Miller, Patty, Smith, Patty, Miller

Cell A2: =join(", ",unique(transpose(split(A1,", "))))

Result: Smith, Miller, Patty

Google Sheet: https://docs.google.com/spreadsheets/d/1A1l0qdnNSHlJB-5DARGKDeIsbuCCLGuoYWm8sR29UTA/edit?usp=sharing

SPLIT on commas to create a separate value in each adjacent cell, spread across several columns. TRANSPOSE will convert that from a row of many values to a single column of many values. I think that is necessary for UNIQUE to work. Finally, JOIN takes the unique (de-duplicated) values in this array and creates a new comma-separated string from them.

fixer1234
  • 27,064
  • 61
  • 75
  • 116
Dannid
  • 281
  • 2
  • 5
  • `SPLIT` on commas to create a separate value in each adjacent cell, spread across several columns. `TRANSPOSE` will convert that from a row of many values to a single column of many values. I think that is necessary for `UNIQUE` to work. Finally, `JOIN` takes the unique (de-duplicated) values in this array and creates a new comma-separated string from them. – Dannid Nov 25 '15 at 19:38
1

Your imported data probably has incorrect delimiters, such as the comma in between the names. You could maybe fix this by creating a new column and trim. Have Excel look for the 3rd comma, and then get rid of any characters after it.

  • 1
    Too many assumptions made here really - this is fine if you know there are only going to be 3 specific names and that no duplicates occur in the first 3 terms - I think Simone Lerch is looking for a more flexible method that excludes any duplicate in the string. – Andi Mohr Oct 01 '13 at 14:27
0

You can use the following VBA code to remove duplicates from a delimited list.

Public Function RemoveDuplicates(list As String, delimiter As String) As String
Dim arrSplit As Variant, i As Long, tmpDict As Object, tmpOutput As String
Set tmpDict = CreateObject("Scripting.Dictionary")
arrSplit = Split(list, delimiter)
For i = LBound(arrSplit) To UBound(arrSplit)
    If Not tmpDict.Exists(arrSplit(i)) Then
        tmpDict.Add arrSplit(i), arrSplit(i)
        tmpOutput = tmpOutput & arrSplit(i) & delimiter
    End If
Next i
If tmpOutput <> "" Then tmpOutput = Left(tmpOutput, Len(tmpOutput) - Len(delimiter))
RemoveDuplicates = tmpOutput
'housekeeping
Set tmpDict = Nothing
End Function

Sub ZapDuplicatesInPlace()
Dim r As Range, va() As Variant
Set r = Application.InputBox("Select range to remove duplicates cell by cell.", "Remove Duplicates From Lists", , , , , , 8)
va = r.Value
For i = LBound(va, 1) To UBound(va, 1)
    For j = LBound(va, 2) To UBound(va, 2)
        'This assumes delimiter is comma followed by space.
        va(i, j) = RemoveDuplicates(CStr(va(i, j)), ", ")
    Next j
Next i
'Print output to sheet
r.Value = va
End Sub

There are two ways you can use this code to achieve what you want.

  1. If you want to remove the duplicates in place, i.e., if you want to clean the data you have and delete the duplicates forever, you can run the ZapDuplicatesInPlace sub. It will prompt you to select a range that you want it to process. Each cell in the range will be stripped of duplicates.

  2. If you would rather use worksheet functions to leave your original data intact, you can use the function RemoveDuplicates in a formula. For example, if you have Smith, Miller, Patty, Smith, Patty, Miller in A1, you can use the formula below in another cell to return the list minus the duplicates.

    =RemoveDuplicates(A1,", ")

For instructions for using VBA in your workbook, see this post.

Excellll
  • 12,627
  • 11
  • 51
  • 78