Your CSV is too big to be opened with Excel, or you need to manipulate your CSV in command line ?
Using MILLER, you can format, manipulate, cut, create columns based on other columns… everything you need to master a CSV.
Tested Configuration:
MacOS: Sierra 10.12
Terminal: iTerm2 & Miller 5.3.0
1. What is Miller (MLR) ?
If you know JQ, Miller is the same tool but for CSV. Just read this for the comparison.
Otherwise, here is a Miller description:
Miller is like awk, sed, cut, join, and sort for CSV. You get to work with your data using named fields, without needing to count positional column indices.
2. Main purpose of MLR:
- handle very large datasets : to clean and prepare your data
- quick statistics on your data
3. Biggest pros:
- streaming: most operations need only a single record in memory at a time (unlike Excel for instance). You can operate files larger than your RAM !!!
- pipe-friendly: you can combine actions
- conversion
- sort keep header first
- performance: written in C, with zero runtime dependencies
4. Get started in 2 min:
- Install Miller
brew update && brew install miller
Now, use your own csv file, or download this small csv by clicking here and start using miller !
5. Useful commands
#see first 2 rows in the terminal:
mlr --csv head -n 2 sample.csv
Note: replace head
by tail
to see le last two lines instead
#save the first 2 rows in a new file:
mlr --csv head -n 2 sample.csv > sampleOutput.csv
Note: mlr -I --csv head -n 2 sample.csv
will replace the previous file
#get a nice view of your 2 first rows:
mlr --icsv --opprint head -n 2 sample.csv
#format the output as 1 json per row:
mlr --icsv --ojson head -n 2 sample.csv
Note: use --ojson --jvstack --jlistwrap
to have an array of json
#get the number of rows:
mlr --csv stats1 -a count -f name sample.csv
Note: stats1
can help you do much more statistics if necessary
#filter:
mlr --icsv --opprint filter '$day == 1 && $name == "charles"' sample.csv
#Sort primarily alphabetically on NAME, then secondarily numerically descending on month:
mlr --icsv --opprint sort -f name -nr month sample.csv
#select only two columns:
mlr --icsv --opprint cut -f name,week sample.csv
Note: Use -o
to follow your specified order name,week
Use -x
to delete only specified columns
Combine several command in one line
# pipe (chain in a single line) command, using then
mlr --icsv --opprint cut -f name,week then cut -f name sample.csv
Manipulate columns
Initilly we had:
And using this command line
#add/replace columns which are computed from other columns:
mlr --icsv --opprint put '$nextyear = $year + 1; $wtfcolumn = $day . "_" . $year . "_" . ($week + $year); $name_upperclass = toupper($name)' sample.csv
We now have
Some command I often use
# sort
mlr --csv sort -f colomn1 sample.csv > sampleSorted.csv
# Check how many unique values for one specific column
mlr --csv count-distinct -f colomn1 -n sample.csv
#same thing (Check how many unique values) but you don't know the name of the first column
mlr count-distinct -f 1 -n sample.csv
#Remove duplicates
mlr --csv uniq -c -g column1 sample.csv > sampleNoDuplicates.csv
#remove -c if you don't want the additional column 'count' that tells you how many rows where grouped by for this unique row