Saturday, July 7, 2012

PowerShell (v3) - Find Duplicate lines in File

On occasion I need to track down duplicate entries in a file. Without going through a bunch of mechanics, I found this approach useful, and, most importantly, easy. First, we will create a dummy array and store the contents in a temp file:
# Create temp file with dummy data including duplicate lines
1,2,3,4,1,2,3,1,2,1 |
Out-File -FilePath ($tempfile = [IO.Path]::GetTempFileName) -Encoding ASCII -Append
Next, we get the data into an array. Interestingly, Get-Content does this for you without any extra work:
# Get file contents into an array
$filecontents = Get-Content -Path $tempfile
Once we have an array, which is verifiable by using this command:
$filecontents.GetType()

IsPublic IsSerial Name                                     BaseType                                                                     
-------- -------- ----                                     --------                                                                      
True     True     Object[]                                 System.Array
we can use the Group-Object (or group alias) with a Where-Object (or where alias) cmdlet pattern to find collections (or groupings) with more than 1 entry. In essence, this is a set of lines (or array entries) where more than 1 entry exists per group:
# Find duplicates
$filecontents |
Group |
Where {$_.count -gt 1}
When this gets run, it shows results:
Count Name                      Group                                                                                                   
----- ----                      -----                                                                                                   
    4 1                         {1, 1, 1, 1}                                                                                            
    3 2                         {2, 2, 2}                                                                                               
    2 3                         {3, 3} 
To finalize this sample, remove the temp file:
# Clean up
Remove-Item -Path $tempfile
While such a simple example may seem artificial, I am working on a way to reference the actual lines where duplicates appear this may "break" the simple Group cmdlet usage shown above, but, if you are in a hurry, these steps can save you very easily with minimal effort.