分类:
2009-03-26 18:21:47
Node Splitting
In order to add a new entry to a full node containing M entries, it is necessary to divide the collection of M+1 entries between two nodes. The division should be done in a way that makes it as unlikely as possible that both new nodes will need to be examined on subsequent searches. Since the decision whether to visit a node depends on whether its covering rectangle overlaps the search area, the total area of the two covering rectangles after a split should be minimized. Figure 3.1 illustrates this point. The area of the covering rectangles in the “bad split” case is much larger than in the “good split” case.
The same criterion was used in procedure ChooseLeaf to decide where to insert a new index entry at each level in the tree, the subtree chosen was the one whose covering rectangle would have to be enlarged least.
We now turn to algorithms for partitioning the set of M + 1 entries into two groups, one for each new node.
节点分裂:
为了向满节点插入新节点,所以需要将M+1个节点分开到两个节点对于节点分裂要尽量采取不会在接下来的查询时再将两个节点都查询一次。对于决定是否去访问一个节点取决与它的范围矩形是否覆盖到了要查询的区域,所以两个节点的覆盖矩形的覆盖区域要保证最小化,根据图3.1 说明,差的分裂方法比好的分裂方法形成的范围矩形浪费的空间多很多。
同样的规则也被用在ChooseLeaf过程中,在这个过程中要决定树的每一层中哪个地方可以插入新的索引记录,而对于子树的选择方法就是选择谁的范围矩形可以被最少的扩充。
我们将讨论几个算法来将M+1个记录分裂到两个组中。
3.5.2 A Quadratic-Cost Algorithm
This algorithm attempts to find a small-area split, but is not guaranteed to find one with the smallest area possible. The cost is quadratic in M and liner in the number of dimensions. The algorithm picks two of the M+1 entries to be the first elements of the two new groups by choosing the pair that would waste the most area if both were put in the same group, i.e. the area of a rectangle covering both entries, minus the areas of the entries themselves, would be greatest. The remaining entries are then assigned to groups one at a time. At each step the area expansion required to add each remaining entry to each group is calculated, and the entry assigned is the one showing the greatest difference between the two groups.
这个算法的目的是找一个比较小的,但不追求最小覆盖区域。而这个算法的花费就是M的平方大小以及维数的线性大小。从M+1个记录中选取两个记录放置到两个新组中,成为其首个元素,而对于这两个元素的选择方法是如果将这两个元素放到一组中将会浪费最大的空间。
Algorithm Quadratic Split. Divide a set of M+1 index entries into two groups.
(1) [Pick first entry for each group] Apply Algorithm PickSeeds to choose two entries to be the first elements for the groups. Assign each to a group.
(2) [Check if done] If all entries have been assigned, stop. If one group has so few entries that all the rest must be assigned to it in order for it to have the minimum number m, assign them and stop.
(3) [Select entry to assign] Invoke Algorithm PickNext to choose the next entry to assign. Add it to the group whose covering rectangle will have to be enlarged least to accommodate it. Resolve ties by adding the entry to the group with smaller area, then to the one with fewer entries, then to either. Repeat form (2)
平方分裂法:将M+1个索引记录分裂到两组中
(1) 【为每一组选择第一个记录】调用PickSeeds算法为两组选择两个记录
(2) 【检查是否完成】如果所有的记录都被处理完,则结束。如果一个组有很少的记录从而导致剩下的都需要被分配到这个组中,因为要符合m的最小值,分配后停止
(3) 【选择下次要分配的记录】调用PickNext算法选择下一个要分配的记录。将其加到一个组中(此组加进记录后范围矩形扩张的最小),通过不断选择区域可能增加比较少的组来满足R树所需要的关系,接着再选择较少记录的组来添加,不断重复,到(2)
Algorithm PickSeeds: Select two entries to be the first elements of the (two) groups
PickSeeds算法:为两组的首元素挑选两个索引记录
(1) [Calculate inefficiency of grouping entries together] For each pair of entries E1 and E2 compares a rectangle J including E1I and E2I. Calculate d = area(J) – area(E1I) – area(E2I)
(2) [Choose the most wasteful pair] Choose the pair with the largest d.
(1) 【计算无效性】对于一个包含E1I 和 E2I的矩形J,计算剩余面积d = area(J) – area(E1I) – area(E2I)
(2) 【选择最耗费的一对】选择d最大的一对
Algorithm PickNext: Select one remaining entry for classification in a group
算法PickNext:为一个组选择其他余下的记录
(1) [Determine cost of putting each entry in each group] For each entry E not yet in a group, calculate d1 = the area increase required in the covering rectangle of Group 1 to include E I. Calculate d2 similarly for Group 2.
(2) [Find entry with greatest preference for one group] Choose any entry with the maximum difference between d1 and d2。
(1) 【测算每一个记录放置到组中的面积】对于每个未放置到组中的记录,计算当每个组加入当前记录后的数值为d1,d2
(2) 【为每个组选择最适宜的记录】 选择在d1和d2之间的最大的记录
3.5.2 A Liner-Cost Algorithm
This algorithm is liner in M and in the number of dimensions Linear Split is identical to Quadratic Split, but uses a different version of PickSeeds. PickNext simply chooses any of the remaining entries.
线性耗费算法:
这个算法的耗费是在M的线性时间上,对于维度上的分裂时间与平方分裂相同,但是采用了一个不同版本的分裂,而PickNext只是简单的从余下的及记录选择
Algorithm LinerPickSeeds: Select two entries to be the first elements of the groups
LinerPickSeeds算法:为两组的首元素挑选两个索引记录
(1) [Find extreme rectangles along all dimensions] Along each dimension, find the entry whose rectangle has the highest low side, and the one with the lowest high side. Record the separation.
(2) [Adjust for shape of the rectangle cluster] Normalize the separations by dividing by the width of the entire set along the corresponding dimension
(3) [Select the most extreme pair] Choose the pair with the greatest normalized separation along any dimension.
(1) 【从所有维度上找寻不寻常的矩形】在每一维度上,找出在当前维度上每个记录的范围矩形中,谁的宽最长,谁的长最短。并记录这个差异
(2) 【为矩形簇的图形进行调整】对于整个集合的不同维度上,通过宽度来进行规范化
(3) 【选择最不同的一对记录】找出在同一纬度上区别最大的一对