There are some utils that analyze access logs, one of such util is webalizer. But I always believe that getting to know you access log files is more appropriate when you want to do some more in depth analysis on something specific.
There are several commands available such as grep, cut, sed etc that you can use for different scenarios, depending on what kind of information you actually need, but in this post I will touch base on awk command and working with access logs. awk is a pattern directed scanning and processing language. Very powerful language indeed.
awk manual can be found here
Here is an Apache Access logs example
1 2 3 4 5 6 7 8 9 10 |
192.168.0.92 - - [14/Apr/2015:18:42:12 -0400] "POST /reports/total_sales.php HTTP/1.1" 200 3063 "http://test/reports/total_sales.php" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; InfoPath.2)" 192.168.0.114 - - [14/Apr/2015:18:39:32 -0400] "POST /reports/sales_by_product.php HTTP/1.1" 200 20522 "http://test/reports/sales_by_product.php" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/40.0.2214.111 Safari/537.36" 192.168.0.114 - - [14/Apr/2015:18:39:31 -0400] "POST /reports/sales_by_country.php HTTP/1.1" 200 18580 "http://test/reports/sales_by_country.php" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/40.0.2214.111 Safari/537.36" 192.168.0.92 - - [14/Apr/2015:18:45:04 -0400] "GET /reports/sales_by_product.php HTTP/1.1" 200 2356 "http://test/reports/sales_kpis.php" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; InfoPath.2)" 192.168.0.92 - - [14/Apr/2015:18:45:12 -0400] "POST /reports/sales_by_product.php HTTP/1.1" 200 20522 "http://test/reports/sales_by_product.php" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; InfoPath.2)" 192.168.0.92 - - [14/Apr/2015:18:45:25 -0400] "GET /reports/sales_by_product.php HTTP/1.1" 200 2356 "http://test/reports/sales_kpis.php" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; InfoPath.2)" 192.168.0.92 - - [14/Apr/2015:18:46:21 -0400] "POST /reports/sales_by_product.php HTTP/1.1" 200 20540 "http://test/reports/sales_by_product.php" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; InfoPath.2)" 192.168.0.92 - - [14/Apr/2015:18:47:28 -0400] "GET /reports/sales_by_product.php HTTP/1.1" 200 2356 "http://test/reports/sales_kpis.php" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; InfoPath.2)" 192.168.0.92 - - [14/Apr/2015:18:47:36 -0400] "POST /reports/sales_by_product.php HTTP/1.1" 200 2075 "http://test/reports/sales_by_product.php" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; InfoPath.2)" 192.168.0.92 - - [14/Apr/2015:18:47:40 -0400] "POST /reports/sales_by_product.php HTTP/1.1" 200 20387 "http://test/reports/sales_by_product.php" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; InfoPath.2)" |
The format of any entry above is shown below, columns are whitespace delimited
1 2 3 4 5 6 7 8 9 10 |
column 1 = IP address of the client column 2 = RFC 1413 identity of the client column 3 = User ID if available column 4-5 = Date/Time taken to process the request column 6-7 = http method and requested resource column 8 = http version number column 9 = http status code column 10 = Size of the data returned to the client column 11 = referer column 12-22 = user agent |
To extract above columns individually we can use the following commands
1 2 3 4 5 6 7 8 9 |
awk '{print $1}' path/to/access_log # return IP address of the client awk '{print $2}' path/to/access_log # identity of the client awk '{print $3}' path/to/access_log # Userid awk '{print $4,$5}' path/to/access_log # date and time when the resource was requested awk '{print $6,$7}' path/to/access_log # request method and requested resource awk '{print $8}' path/to/access_log # http protocol with version awk '{print $9}' path/to/access_log # status code awk '{print $10}' path/to/access_log # size of the data returned ... and so on |
You will see that above commands taking a space as a delimiter, we can change this however as shown below
1 2 |
awk -F\" '{print $6}' path/to/access_log # returns full user agent string awk -F\" '{print $4}' path/to/access_log # returns requested resource |
Let’s work on some scenarios now, say if you want to get list of unique ip addresses from your logs, you can run this command to get that info
1 |
awk '{print $1}' path/to/access_log | sort | uniq |
or if you want to see which IP addresses has been accessing a specific resources then you can use either of these commands
1 2 3 |
awk '($7 ~ /revenue_by_product\.php/){print $1}' path/to/access_log # OR with delimiter " awk -F\" '($2 ~ /revenue_by_product\.php/){print $1}' path/to/access_log # this will also print date and time |
Check if the requests are coming from an automated scripts
When checking for automated scripts, we will check for an empty user agent value, generally these scripts won’t send through a user agent information
Here is the command that you can use
1 |
awk -F\" '($6 ~ /^-?$/)' /var/log/httpd/access_log | awk '{print $1}' | sort | uniq |
To check how many times a resource has been requested
You can use the following command
1 2 3 4 5 |
awk '{print $7}' /var/log/httpd/access_log | sort | uniq -c | sort -fr #above command will result in output similar to one shown below 7 /reports/revenue_by_product.php 2 /autopilot/stats.php 1 /reports/total_orders.php |
Identify issues with your web resources
Generally we will be working for 404 errors, we can get this kinda report from Google Analytics as well but Google won’t list internal linking resource errors, you can also use developer tools to check if 404 errors are being produced on a certain page, let’s use awk to do that now
1 |
awk '($9 ~ /404/)' /var/log/httpd/access_log | awk '{print $9,$7}' | sort |
We are check for column 9 and pattern matching it against string 404. We are then piping the output to another awk command to print the required data and then piping the output to sort command to sort the output
We can also use something like this
1 |
awk -F\" '{split($3,myarray," "); if(myarray[1] == "404")print "Resource: ",$2," Referer: ",$4," Status code and data size: ",$3}' /var/log/httpd/access_log | uniq | sort |
Above will produce similar result without using 2 awk commands
There is so much more you can do with awk all you have to do is to understand how the command works, combine other commands such as sed to customize your output
If you are using awk command to achieve other things please do leave your comments.
Leave a Reply