分类:
2010-11-02 17:43:12
Unlike the front-side bus, CSI is a cleanly defined, layered network fabric used to communicate between various agents. These ‘agents’ may be microprocessors, coprocessors, FPGAs, chipsets, or generally any device with a CSI port. There are five distinct layers in the CSI stack, from lowest to highest: Physical, Link, Routing, Transport and Protocol [27]. Table 1 below describes the different layers and responsibilities of each layer.
While all five layers are clearly defined, they are not all necessary. For example, the routing layer is optional in less complex systems, such as a desktop, where there are only two CSI agents (the MPU and chipset). Similarly, in situations where all CSI agents are directly connected, the transport layer is redundant, as end-to-end reliability is equivalent to link layer reliability.
CSI is defined as a variable width, point to point, packet-based interface implemented as two uni-directional links with low-voltage differential signaling. A full width CSI link is physically configured with 20 bit lanes in each direction; these bit lanes are divided into four quadrants of 5 bit lanes, as depicted in Figure 1 [25]. While most CSI links are full width, half width (2 quadrants) and quarter width (a single quadrant) options are also possible. Reduced width links will likely be used for connecting MPUs and chipset components. Additionally, some CSI ports can be bifurcated, so that they can connect to two different agents (for example, so that an I/O hub can directly connect to two different MPUs) [25]. The width of the link determines the physical unit of information transfer, or phit, which can be 5, 10 or 20 bits.
In order to accommodate various link widths (and hence phit sizes) and bit orderings, each nibble of output is muxed on-chip before being transmitted across the physical transmission pins and the inverse is done on the receive side [25]. The nibble muxing eliminates trace length mismatches which reduces skew and improves performance. To support port bifurcation efficiently, the bit lanes are swizzled to avoid excessive wire crossings which would require additional layers in motherboards. Together these two techniques permit a CSI port to reverse the pins (i.e. send output for pin 0 to pin 19, etc.), which is needed when the processor sockets are mounted on both sides of a motherboard.
CSI is largely defined in a way that does not require a particular clocking mechanism for the physical layer. This is essential to balance current latency requirements, which tend to favor parallel interfaces, against future scalability, which requires truly serial technology. Clock encoding and clock and data recovery are prerequisites for optical interconnects, which will eventually be used to overcome the limitations of copper. By specifying CSI in an expansive fashion, the architects created a protocol stack that can naturally be extended from a parallel implementation over copper to optical communication.
Initial implementations appear to use clock forwarding, probably with one clock lane per quadrant to reduce skew and enable certain power saving techniques [16] [19] [27]. While some documents reference a single clock lane for the entire link, this seems unlikely as it would require much tighter skew margins between different data lanes. This would result in more restrictive board design rules and more expensive motherboards.
When a CSI link first boots up, it goes through a handshake based physical layer calibration and training process [14] [15]. Initially, the link is treated like a collection of independent serial lanes. Then the transmitter sends several specially designed phit patterns that will determine and communicate back the intra-lane skew and detect any lane failures. This information is used to train the receiver circuitry to compensate for skew between the different lanes that may arise due to different trace lengths, and process, temperature and voltage variation. Once the link has been trained, it then begins to operate as a parallel interface, and the circuitry used for training is shut down to save power. The link and any de-skewing circuitry will also be periodically recalibrated, based on a timing counter; according to one patent this counter triggers every 1-10ms [13]. When the retraining occurs, all higher level functionality including flow control and data transmission is temporarily halted. This skew compensation enables motherboard designs with less restrictive design rules for trace length matching, which are less expensive as a result.
It appears some variants of CSI can designate data lanes as alternate clocking lanes, in case of a clock failure [16]. In that situation, the transmitter and receiver would then disable the alternate clock lane and probably that lane’s whole quadrant. The link would then re-initialize at reduced width, using the alternate clock lane for clock forwarding, albeit with reduced data bandwidth. The advantage is that clock failures are no longer fatal, and gracefully degrade service in the same manner as a data lane failure, which can be handled through virtualization techniques in the link layer.
Initial CSI implementations in Intel’s 65nm and 45nm high performance CMOS processes target 4.8-6.4GT/s operation, thus providing 12-16GB/s of bandwidth in each direction and 24-32GB/s for each link [30] [33]. Compared to the parallel P4 bus, CSI uses vastly fewer pins running at much higher data rates, which not only simplifies board routing, but also makes more CPU pins available for power and ground.