The HLT instruction implements the shallowest idle power state (C-State) available for an individual thread, whereas the MWAIT instruction allows you to request all available idle power states as well as sub-states.
At the hardware level, executing HLT is equivalent to executing MWAIT with a state hint of 0. This puts the processor in the C1 state, which is clock gating for the core. If you want to enter deeper C-States in order to power gate the core and potentially power gate the package, you must use MWAIT.
There's always a tradeoff between power savings and exit latency for various power states. The deeper the C-State, the more power savings, but the longer it takes to exit the C-State. You should also note that modern x86 processors will limit the depth of the power state based on the frequency of interrupts (i.e. if you're receiving break events every 1 us, hardware will not attempt to enter a C-State with a 2 us exit latency).
In addition to hardware inhibiting entered C-State, some C-States may only be entered through coordination between threads. For instance, on an Intel x86 processor with Hyper-threading, both threads in a core must request a power-gated C-State for power-gating to occur at the core level, and likewise all cores in a package must request a package-level power-gated C-State for power-gating to occur at the package level. The hardware generally abides by the shallowest request, so if 1 thread requests C1 and another requests C3, the processor enters C1.
If you aren't controlling the operating system, then it's really a
moot point (since MWAIT is only available at CPL0). If you "own" the
operating system, then it will almost always make sense to use MWAIT
instead of HLT, since it results in much higher power savings in many
cases and provides access to the same idle power state that HLT does.
------
For performance; what matters most is the time it takes for the CPU to come out of its "waiting" state whenever whatever it is waiting for (an IRQ for HLT, or either an IRQ or a memory write for MWAIT) occurs. This effects latency - e.g. how long it will take before an interrupt handler is started or before a task switch actually occurs. The time taken for a CPU to come out of its waiting state is different for different CPUs, and may also be slightly different for HLT and MWAIT on the same CPU.
The same applies to power consumption - power consumed while waiting can vary a lot between different CPUs (especially when you start thinking about things like hyper-threading); and power consumption of HLT vs. MWAIT may also be slightly different on the same CPU.
For usage, they're intended for different situations. HLT is for waiting for an IRQ, while MWAIT is for waiting for a memory write to occur. Of course if you're waiting for a memory write to occur then you need to decide whether IRQs should interrupt your waiting or not (e.g. you can do CLI then MWAIT if you only want to wait for a memory write).
However, for multi-tasking systems, mostly they're both only used for the same thing - in schedulers where the CPU is idle. Before MONITOR/MWAIT was introduced, schedulers would use HLT while waiting for work to do (to reduce power consumption a little). This means that if another CPU unblocks a task it can't just put that task into the scheduler's queue and has to send a (relatively expensive) "inter-processor interrupt" to the HLTed CPU to knock it out of its HLT state (otherwise the CPU will keep doing nothing when there's work it can/should do). With MWAIT, this "inter-processor interrupt" is (potentially) unnecessary - you can set MONITOR to watch for writes to the scheduler's queue, so that the act of putting the task onto the queue is enough to cause a waiting CPU to stop waiting.
There has also been some research into using MONITOR/MWAIT for things like spinlocks and synchronisation (e.g. waiting for a contended lock to be released). The end result of this research is that the time it takes for the CPU to come out of its "waiting" state is too high and using MONITOR/MWAIT like this causes too much performance loss (unless there are design flaws - e.g. using a spinlock when you should be using a mutex).
I can't think of any other reason (beyond schedulers and locking/synchronisation) to use HLT or MWAIT.