While being tasked to build out a database for a customer, I needed to build a reference table that would be used for a selection menu. We had the existing data from the legacy system via CSV, so we didn’t have the constraints. The sort of the CSV was also not related to the length of strings. Now, I could have tried to determine this by hand in my text edit. I could have opened it in Google Sheets or Microsoft Excel. But I knew I had many more reference tables to build, so I needed to target a solution that was lightweight and didn’t require a lot of uploading and mouse movement.
Here is a mock of the data that was provided. Obviously in this small dataset, we can easily see the longest name. The actual dataset we had was 100+ lines long.
This is a name, 0.2, 4.0 This is a much much much longer name, 0.3, 1.5 Shorter name, 0.12, 2.0 Short name, 0.5, 0.5 Name, 0.6, 1.0 This is a much longer name, 0.3, 2.0 Medium length name, 0.4, 3.5
Save this to a file called sample-data.csv
When finding solutions, I always strive to use as many tools as possible that are native to the host machine. It’s a clear winner when they are available across all operating systems. My local machine is a Macbook Pro and the the solution following will work on any macOS or Linux host. It will also work on Microsoft Windows via Windows Subsystem for Linux. The tools being used are:
- Terminal
awksorttail
First, let’s isolate the column of interest. For this data set, that is column 1.
[bitmvr:~] $ awk -F',' 'NR > 1 { print length($1) " " $1 }' sample-data.csv
4 name
14 This is a name
36 This is a much much much longer name
12 Shorter name
10 Short name
4 Name
26 This is a much longer name
18 Medium length name
Output from running awk against sample-data.csv
Before we proceed, let’s discuss what this awk command is doing.
awk: Text processing command-line utility.-F',': Sets the field separator to a comma.NR>1: Skips the first line (header) of the file as we don’t want to consider this as part of the data analysis.'{}': Executes the enclosed block for each line of the file.length($1): Gets the length of the first field (column)." ": Adds a space after the length.$1: Prints the value of the first field.sample-data.csv: Input file.
A few things to conider. We skipped the header as we don’t want If we did not provide $1 as an argument to the length() function, we would actually print the length of the entire line. That would result in us inflating the value we need when determining the length of the column for the database.
Alright, Next we need to sort this list. To accomplish that, we pipe (|) the output from awk into sort and pass the --numeric-sort flag so that the output is sorted in ascending order.
[bitmvr:~] $ awk -F',' 'NR > 1 { print length($1) " " $1 }' sample-data.csv | sort --numeric-sort
4 Name
10 Short name
12 Shorter name
14 This is a name
18 Medium length name
26 This is a much longer name
36 This is a much much much longer name
Code example updated to pipe output from awk into sort so that it can be sorted numerically.
Perfect. Now we know that the longest column is 36 characters long and we can use that information when setting the constraints on the column. If the dataset you have is quite large, you may want to pipe the output of the sort command to tail and pass the -1 (minus one) flag so the only output is the largest row.
[bitmvr:~] $ awk -F',' 'NR > 1 { print length($1) " " $1 }' sample-data.csv | sort --numeric-sort | tail -1
36 This is a much much much longer name
Code Example: Displaying only the last line of the output by piping to tail.
That’s it. We’ve isolated the longest column and can now use that data to properly determine what we should use for the column constraint in our database. I hope you’ve enjoyed this brief article and are able to harness it for data processing in your daily life.
Cheers!
Jesse Riddle

