Summary : is there a way to get the unique lines from a file and the number of occurrences more efficiently than using a sort | uniq -c | sort -n
?
Details: I often pipe to sort | uniq -c | sort -n
when doing log analysis to get a general trending of which log entries show up the most/least etc. This works most of the time - except when I'm dealing with a very large log file that ends up with a very large number of duplicates (in which case sort | uniq -c
ends up taking a long time).
Example: The specific case I'm facing right now is for getting a trend from an 'un-parametrized' MySQL bin log to find out which queries are run the most. For a file of a million entries which I pass through a grep/sed combination to remove parameters - resulting in about 150 unique lines - I spend about 3 seconds grepping & sedding, and about 15s sorting/uniq'ing.
Currently, I've settled with a simple C++ program that maintains a map of < line, count > - which does the job in less than a second - but I was wondering if an existing utility already exists.