AWK and manipulating CSVs
The manual is available online. A reminder on AWK’s built-in variables.
Basics
Print the header row (first row) in a CSV file
$ head FuelConsumption.csv | awk 'NR==1 {print $0}'
MODELYEAR,MAKE,MODEL,VEHICLECLASS,ENGINESIZE,CYLINDERS,TRANSMISSION,FUELTYPE,FUELCONSUMPTION_CITY,FUELCONSUMPTION_HWY,FUELCONSUMPTION_COMB,FUELCONSUMPTION_COMB_MPG,CO2EMISSIONS
Print the data
$ head FuelConsumption.csv | awk 'NR>1 {print $0}'
2014,ACURA,ILX,COMPACT,2,4,AS5,Z,9.9,6.7,8.5,33,196
2014,ACURA,ILX,COMPACT,2.4,4,M6,Z,11.2,7.7,9.6,29,221
2014,ACURA,ILX HYBRID,COMPACT,1.5,4,AV7,Z,6,5.8,5.9,48,136
2014,ACURA,MDX 4WD,SUV - SMALL,3.5,6,AS6,Z,12.7,9.1,11.1,25,255
2014,ACURA,RDX AWD,SUV - SMALL,3.5,6,AS6,Z,12.1,8.7,10.6,27,244
2014,ACURA,RLX,MID-SIZE,3.5,6,AS6,Z,11.9,7.7,10,28,230
2014,ACURA,TL,MID-SIZE,3.5,6,AS6,Z,11.8,8.1,10.1,28,232
2014,ACURA,TL AWD,MID-SIZE,3.7,6,AS6,Z,12.8,9,11.1,25,255
2014,ACURA,TL AWD,MID-SIZE,3.7,6,M6,Z,13.4,9.5,11.6,24,267
Quick reminders,
NR
row number.NF
number of fields in a row.
For regular expressions comparisons, use ~
and !~
. You can also use operators such as &&
and ||
:
$ head FuelConsumption.csv | awk -F, 'NR>1 && $12>30 {print $0}'
Data
The following data is the output of
$ kaggle competitions list > kagglcompetitions
$ cat kagglecompetitions
ref deadline category reward teamCount userHasEntered
--------------------------------------------- ------------------- --------------- --------- --------- --------------
contradictory-my-dear-watson 2030-07-01 23:59:00 Getting Started Prizes 96 False
gan-getting-started 2030-07-01 23:59:00 Getting Started Prizes 203 False
tpu-getting-started 2030-06-03 23:59:00 Getting Started Knowledge 462 False
digit-recognizer 2030-01-01 00:00:00 Getting Started Knowledge 3149 False
titanic 2030-01-01 00:00:00 Getting Started Knowledge 25172 False
house-prices-advanced-regression-techniques 2030-01-01 00:00:00 Getting Started Knowledge 6305 False
connectx 2030-01-01 00:00:00 Getting Started Knowledge 562 False
nlp-getting-started 2030-01-01 00:00:00 Getting Started Knowledge 1631 False
competitive-data-science-predict-future-sales 2022-12-31 23:59:00 Playground Kudos 10695 False
jane-street-market-prediction 2021-08-23 23:59:00 Featured $100,000 4245 False
hungry-geese 2021-07-26 23:59:00 Playground Prizes 395 False
bms-molecular-translation 2021-06-02 23:59:00 Featured $50,000 281 False
iwildcam2021-fgvc8 2021-05-26 23:59:00 Research Knowledge 4 False
herbarium-2021-fgvc8 2021-05-26 23:59:00 Research Knowledge 23 False
plant-pathology-2021-fgvc8 2021-05-26 23:59:00 Research Knowledge 60 False
hotel-id-2021-fgvc8 2021-05-26 23:59:00 Research Knowledge 20 False
hashcode-2021-oqr-extension 2021-05-25 23:59:00 Playground Knowledge 90 False
indoor-location-navigation 2021-05-17 23:59:00 Research $10,000 650 False
hpa-single-cell-image-classification 2021-05-11 23:59:00 Featured $25,000 374 False
shopee-product-matching 2021-05-10 23:59:00 Featured $30,000 570 False
Filtering
Find only the competitions that have a monetary reward, and sort them.
$ cat kagglecompetitions | awk '$5~/^\$/{print $0}' | sort -k5 -r
jane-street-market-prediction 2021-08-23 23:59:00 Featured $100,000 4245 False
bms-molecular-translation 2021-06-02 23:59:00 Featured $50,000 281 False
shopee-product-matching 2021-05-10 23:59:00 Featured $30,000 570 False
hpa-single-cell-image-classification 2021-05-11 23:59:00 Featured $25,000 374 False
indoor-location-navigation 2021-05-17 23:59:00 Research $10,000 650 False
Sorting
You can find more information here.
$ cat kagglecompetitions | sed -n '3,$ p' | sort -k2n,2
bms-molecular-translation 2021-06-02 23:59:00 Featured $50,000 281 False
hashcode-2021-oqr-extension 2021-05-25 23:59:00 Playground Knowledge 90 False
herbarium-2021-fgvc8 2021-05-26 23:59:00 Research Knowledge 23 False
hotel-id-2021-fgvc8 2021-05-26 23:59:00 Research Knowledge 20 False
hpa-single-cell-image-classification 2021-05-11 23:59:00 Featured $25,000 374 False
hungry-geese 2021-07-26 23:59:00 Playground Prizes 395 False
indoor-location-navigation 2021-05-17 23:59:00 Research $10,000 650 False
iwildcam2021-fgvc8 2021-05-26 23:59:00 Research Knowledge 4 False
jane-street-market-prediction 2021-08-23 23:59:00 Featured $100,000 4245 False
plant-pathology-2021-fgvc8 2021-05-26 23:59:00 Research Knowledge 60 False
shopee-product-matching 2021-05-10 23:59:00 Featured $30,000 570 False
competitive-data-science-predict-future-sales 2022-12-31 23:59:00 Playground Kudos 10695 False
connectx 2030-01-01 00:00:00 Getting Started Knowledge 562 False
contradictory-my-dear-watson 2030-07-01 23:59:00 Getting Started Prizes 96 False
digit-recognizer 2030-01-01 00:00:00 Getting Started Knowledge 3149 False
gan-getting-started 2030-07-01 23:59:00 Getting Started Prizes 203 False
house-prices-advanced-regression-techniques 2030-01-01 00:00:00 Getting Started Knowledge 6305 False
nlp-getting-started 2030-01-01 00:00:00 Getting Started Knowledge 1631 False
titanic 2030-01-01 00:00:00 Getting Started Knowledge 25172 False
tpu-getting-started 2030-06-03 23:59:00 Getting Started Knowledge 462 False
The following command does the following
- Read the
csv.csv
which contains some financial transactions in CSV format - print the 1st and 3rd fields as tab separated values (remember that you can use the
column -t -s,
command to do this). - truncate the 3rd column
$ awk -F',' 'BEGIN{ OFS="\t" } { print $1,$3 }' csv.csv | sed 's/\s\(...\).*/ \1***/g'
07/03/2021 PHR***
07/03/2021 Ama***
07/03/2021 PAY***
07/03/2021 PAY***
06/03/2021 PAY***
06/03/2021 PAY***
06/03/2021 PAY***
06/03/2021 PAY***
06/03/2021 PAY***
06/03/2021 PAY***
...
Summation
Sum the transactions in the 5th field
$ cat csv.csv |cut -d, -f5 | gawk '{ sum += $1 }; END { print sum }'
-1895.51