For a full list of BASHing data blog posts, see the index page.
GUI ways to view and edit big text files
I do my data auditing on the command line, but it's sometimes nice to have the kind of overview of the data that you get with a spreadsheet. Unfortunately, this isn't possible with seriously large data tables. The current table size limits in Microsoft Excel are 1,048,576 rows, 16,384 columns and 32,767 characters per cell. LibreOffice Calc has the same row and character limits but tops out at 1,024 columns. Gnumeric (my favourite spreadsheet) allows 16,777,216 rows and 16,384 columns.
These limits, however, don't mean much for big datasets. How efficiently a spreadsheet works is determined by your system's memory resources and by the nature of your data items. Not only might it take an annoyingly long time for a big table to open as a spreadsheet, but operations like searching, moving around, global editing and file saving can then be annoyingly slo-o-o-o-o-ow.
GUI spreadsheets are impractical, in my experience, for viewing or editing tables with more than about 50,000 rows and 100 columns. Four viewing/editing alternatives for Linux are noted below; in the screenshots I've reduced their GUI window sizes to suit this blog's column width. Missing from the list is UltraEdit for Linux, a very capable GUI text editor which (last time I checked) cost USD$100 per copy. These are free:
glogg is a text file viewer and not a text editor, but it opens large files very quickly and does fast grep-style searching. My tab-separated, UTF-8 test file "ver1" is ca 160 MB and has 266,133 lines, 45 fields and 160,614,235 characters.
gvim is a GUI version of the vim command-line editor (GUI enabled by the vim-gtk3 and vim-gui-common packages). gvim loads huge files in an instant and has display options that make data overview easy:
Editing is another story. Unless you're familiar with vim and its keyboard commands you'll be struggling to edit your data, and if you are an experienced vim user, you might not want a GUI. (Well-known joke: "I've been using vim for two years now...because I haven't figured out how to exit it".)
Geany is a feature-rich text editor that can also open very large files. You need to be patient, though. It took Geany about a minute and a half to open "ver1" on a system with 8 GB of RAM, although the wait was worth it. The table is displayed with line numbers and with arrows as tab markers (I've wrapped the lines):
Searching, moving around, editing and saving are fast in Geany and there are lots of keyboard shortcuts, including the easy-to-remember Ctrl+l [L] for a "go to line" dialog. A handy feature is text replacement within a selection of lines, which can act like "copy down" in a spreadsheet. Block editing of columns (vertically selected data items) is also possible if line-wrap is turned off and data items are neatly aligned. Geany is both my everyday and my data-auditing text editor.
csvpad is a sort of stripped-down spreadsheet that gives an excellent overview of a data table and allows for editing of individual data items. The screenshot below shows the top left of "ver1" after I've changed a couple of the "PhysicalObject" items to "HumanObservation":
There are a few drawbacks to csvpad. It's even slower than Geany in opening large files, it saves very slowly and you can only edit one data item at a time. csvpad will only open a file that has a filename suffix (I renamed "ver1" to "ver1.txt" for this demo) and when saving files it adds (alas) a Windows carriage return to every line.
Last update: 2018-07-31