banner

For a full list of BASHing data blog posts, see the index page.     RSS


Pseudo-blank ("empty") records and fields

It's easy to find records in a data table that are entirely blank or only contain tabs or whitespace. Just ask AWK to print the line numbers of any records that have no fields:

awk 'NF==0 {print NR}' datatable
 
or
 
awk '!NF {print NR}' datatable

blank

Blank records are fairly rare in real-world datasets. You're much more likely to find records that contain one or more essential data items (like a unique record ID) and nothing else, or maybe a "?" or "-" or two. These are "pseudo-blank" records. Three examples are on lines 4, 9 and 12 in the tab-separated table below, "fileB":

RecordIDLast_nameFirst_nameCustomerIDModified
10001AlbeeMichaelmel141a2018-08-01
10002AllanRebeccawar206a2018-08-01
10003???2018-08-01
10004Allan?war206b2018-08-01
10005AllenGeorginamel097a2018-08-01
10006AmadoPaolinamel141b2018-08-01
10007AndrewarthaHerbertbri129a2018-08-01
100082018-08-01
10009AndrewsSamuelsyd322a2018-08-01
10010AntillJessicahob111a2018-08-01
10011---2018-08-01

Cases like this are messy, but a general strategy for finding pseudo-blank records might be to count the fields with unexpected content, like fields without alphanumeric characters, then tally up the records by their suspect field counts:

awk -F"\t" 'BEGIN {c=0} {for (i=1;i<=NF;i++) if ($i !~ /[[:alnum:]]/) {c++}} {print "lines with "c" suspect fields"; c=0}' datatable | sort -r | uniq -c

count1

In the first part of the command, AWK goes through each tab-separated field in turn, testing it to see if it lacks alphanumeric characters. If it does, the variable "c" is incremented from zero, which was the value initialized in the BEGIN statement. When all fields in the record have been tested, AWK prints the field count "c", then resets "c" to zero. The output of the AWK command is fed to sort -r and uniq -c for a field count tally.

The next step is to print the records with the highest suspect field counts:

count2
 

A data table might have a complete header line but only blank data items in some fields. These fields aren't completely empty, because they have a field name in the header, so they're "pseudo-blank". Finding these fields is fairly straightforward with AWK, and to do this I use the "empties" script explained below. To demo the script I'll use "fileC":

Field1Field2Field3Field4Field5
aaaaaa
aaaaaaaaa
aaaaaa
aaaaaaaaa
aaaaaa
aaaaaa

The core of the script is a command that finds any fields that are empty below the header, and lists their field numbers:

awk -F"\t" 'NR>1 {for (i=1;i<=NF;i++) a[i]+=length($i)} END {for (j in a) {if (a[j]==0) print j}}' datatable

script1

AWK here ignores the header line ("NR>1") and goes through the fields on each line one by one. For each field the "length" function calculates the number of characters and adds that number to an array "a" indexed by field number. An empty field will have a total character number of zero, which is what AWK looks for in the END statement when it walks through the array "a".

The complete "empties" script is shown below. The list of empty fields is stored in a temp file ("/tmp/flds"), as is a numbered list of all fields in the table ("/tmp/allflds"). If there are no empty fields in the table, the script reports that and exits. Otherwise, an AWK command uses an array to add a colon and a field name to each of the empty field numbers, saving the output as the file "[filename]_emptyfields". The number of empty fields is reported on screen, and the script asks if I'd like the data table with the empty fields removed. If I say yes, the new table is built with a name I choose.

#!/bin/bash
 
awk -F"\t" 'NR>1 {for (i=1;i<=NF;i++) a[i]+=length($i)} END {for (j in a) {if (a[j]==0) print j}}' "$1" > /tmp/flds
head -n1 "$1" | tr '\t' '\n' | nl -w1 > /tmp/allflds
 
if [ ! -s /tmp/flds ]; then
   echo
   echo "No empty fields in $1 table" && rm /tmp/flds /tmp/allflds && exit
else
   awk -F"\t" 'FNR==NR {b[$1]=$2; next} $1 in b {print $1":"b[$1]}' /tmp/allflds /tmp/flds > "$1"_emptyfields
   echo
   echo "$(wc -l < "$1"_emptyfields) empty field(s) in "$1" table"
   echo
   read -p "Build table with non-empty fields only? (y/n) " build
   if [[ "$build" == "y" ]]; then
      read -p "What name for the new table? " name
      cut --complement -f$(paste -d',' -s /tmp/flds) "$1" > "$name"
   else
      rm /tmp/flds /tmp/allflds && exit
   fi
fi
rm /tmp/flds /tmp/allflds
exit

script2

This script is much faster than the "empties" script I published in 2017 on the Linux Rain blog and previously in A Data Cleaner's Cookbook.


Last update: 2018-08-04