Consider the Hardware
It's common for developers, managers, and users alike to have the opinion that software that doesn't run fast enough just needs faster hardware. This line of thinking is not necessarily wrong, but like incorrect use of antibiotics, over time can become a big problem that is hard to solve. Unfortunately, software development seems to have been abstracted so much that many developers do not even consider the hardware at all. Furthermore, there is often a direct conflict of interest between best programming practices and writing code that screams on the given hardware.
First let's look at something that doesn't cause any conflicts in most developer shops' accepted best practices: Making the most of your CPUs' prefetch cache. CPU prefetch caching logic and heuristics are not trivial but there is something trivial you can do in all your branching logic to make it work for you. All branching in your code can only go two ways (jump tables aside): One way if a condition is met, another if that condition is not met. Most CPU Prefetch cache load and pre-process code that hasn't even been encountered yet by "guessing" the path your program will follow before it gets there. If the CPU discovers that the "guesses" it made were incorrect, all the preprocessing on this "wrong branch" are useless and invalidate the cache: this is a time consuming consequence in CPU cycles. So to befriend the CPU cache you need to know how to make it work for you, fortunately, its quite simple. If you code all your branch logic so that the MOST FREQUENT RESULT is the condition that is tested for, you will help your CPU's prefetch cache be "correct" more often resulting is less "CPU Expensive" cache invalidations. This sometimes may read a little awkward but systemically applying this technique over time will increase your code's execution time.
Now, let's look at some of the conflicts between writing code for hardware and writing for mainstream best practices. It's common practice to write many small functions in favor of larger ones that are harder to maintain. Now this is perfectly sound from an abstract point of view but the fact is that every function call requires moving data to and from the stack both to prepare for the function call and to return properly from it. Most applications that are written using this design paradigm actually spend more time getting ready and recovering from doing work than they do actual work. Some developers might cringe at the mention of the fact that the often ridiculed GOTO command is the most efficient method to get around a code block; rivaled only by the lightning fast jump table - another hardware savvy technique that many developers aren't even aware of.