分类: Mysql/postgreSQL
2008-07-10 14:01:37
Frequently when using SQL you want data from more than one table at a time. You do this by "joining" the tables in various ways.
For this section, we will be using two tables in our examples, called a and b. Here is what they look like:
a | b | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
Note that in the below examples we use SELECT *
for
simplicity, but SELECT *should never be used in Bugzilla code..
The least common (but also least understood) type of JOIN is the "cross join," also called the "cross-product join."
A cross join means, "return all possible combinations of rows in these two tables."
In SQL, there are two ways to do a cross join of tables a and b:
SELECT * FROM a, b
or SELECT * FROM a CROSS JOIN b
The two pieces of code above are identical(等同的). The first one is called an "implicit(绝对的)" join (because the database just "figures out" that we mean CROSS JOIN, from the comma), and the second is called an "explicit(详述的,直言的)" join (because we specified exactly what we want).
If we run the SQL above with our a and b tables, the result looks like:
a.user_id | a.user_name | b.user_id | b.email |
---|---|---|---|
1 | jdoe | 1 | jdoe@example.com |
1 | jdoe | 2 | bsmith@example.com |
1 | jdoe | 2 | bsmith@e2.com |
2 | bsmith | 1 | jdoe@example.com |
2 | bsmith | 2 | bsmith@example.com |
2 | bsmith | 2 | bsmith@e2.com |
3 | mkanat | 1 | jdoe@example.com |
3 | mkanat | 2 | bsmith@example.com |
3 | mkanat | 2 | bsmith@e2.com |
In general, you want to avoid cross joins unless you are certain they are exactly what you need.
An INNER JOIN is one where you try to actually join two tables together based on a column that they have in common. This is the most common type of join.
For our a and b tables, we have two ways of writing an INNER JOIN:
SELECT * FROM a INNER JOIN b ON a.user_id = b.user_id
or
SELECT * FROM a, b WHERE a.user_id = b.user_id
Just like with the cross join, the first version is an "explcit" join, and the second one is an "implicit" join.
The inner join above looks like this:
user_id | user_name | |
---|---|---|
1 | jdoe | jdoe@example.com |
2 | bsmith | bsmith@example.com |
2 | bsmith | bsmith@e2.com |
Note that "mkanat" (with user_id 3) is left out, because he doesn't have an entry in both tables.
A LEFT JOIN is like an INNER JOIN, but it includes records that have an entry in a but no corresponding entry in b. That is, where "mkanat" was left out in our INNER JOIN above, he would be included in our LEFT JOIN.
A LEFT JOIN is normally only done one way:
SELECT * FROM a LEFT JOIN b ON a.user_id = b.user_id
The result of that code looks like this:
user_id | user_name | |
---|---|---|
1 | jdoe | jdoe@example.com |
2 | bsmith | bsmith@example.com |
2 | bsmith | bsmith@e2.com |
3 | mkanat | NULL |
There was no "email" for "mkanat," so the database
returns NULL
there.
It's called a "left" join because we include entries from the table on the left (table a, note that it's on the left side of the words "LEFT JOIN") that don't have a match in the table on the right side.
Warning: If we add a WHERE email = ?
to
the above SQL statement, the row containing mkanat will
never be returned. This is because NULL
is never equal to anything.
Indexes speed up SELECT statements.
Normally, if you have a table of 10,000 rows, and you want row number 8000, the database has to go through all 10,000 rows to find your row. That's called a "table scan," and it's usually slow.
With an index, you ask it for "row 8000" and it gives it to you instantly.
OK, so in reality you wouldn't be asking for "row 8000," you'd
be doing something like WHERE bugs.bug_id = 8000
in your SQL.
In order for the database to do that quickly, it needs an index on the
bugs
table, on the bug_id
column.
Generally, you should add an index any time you plan to use a column in a WHERE clause, or in the "ON" part of a JOIN statement.
Adding too many indexes to a table will slow down INSERT and UPDATE statements, so don't add indexes you don't need.
Many databases (MySQL in particular) will only use one index per table per query.
That is, imagine that we do: SELECT * FROM bugs WHERE priority = 'P1'
OR op_sys = 'Windows'
. We have two indexes on the bugs table, one
for "priority" and one for "op_sys." Only one
of them will be used. The one that's used is up to the database,
it will try to pick the "best" one to use, and then do a table
scan for the other one.
If we want that to be fast, we can create a single index that
contains the values for both priority
and
op_sys
. That is called a multi-column index.
Multi-column indexes are always in a certain order. For example, either "op_sys, priority" or "priority, op_sys." For your purposes, those two indexes are totally identical, with one exception: The first column of a multi-column index can be used like it's a single-column index. So, for example, having an index on both "priority, op_sys" and "priority" would be redundant. Don't do it.
When writing a query, give a brief thought to how it will use indexes. Don't try to solve performance problems before you know they exist (remember our rules from the top of the Developer's Guide), but do give a brief thought to it, particularly the fact that only one index will be used per table, per query, by common databases.
Different databases support different things, and different databases
use different syntax for similar features. Bugzilla->dbh
provides many functions that will generate correct SQL for the database that
Bugzilla is using. These functions all have names that start with
sql_
.
Here are the MySQL constructs that must be replaced with Bugzilla::DB
sql_
functions:
LOCK TABLES / UNLOCK TABLES
(use bz_lock_tables
and bz_unlock_tables
instead)BEGIN/START [TRANSACTION] / COMMIT [TRANSACTION]
REGEXP / NOT REGEXP
LIMIT
TO_DAYS
DATE_FORMAT
INTERVAL
CONCAT
INSTR / POSITION
GROUP BY
MATCH
or anything having to do with "fulltext"
indexes.Also, if you absolutely need to do a case-insensitive string comparison,
use sql_istrcmp
instead of =
or other
operators.
The following SQL constructs should never be used. Replace them with something else instead:
ENCRYPT()
- use Perl's
encrypt
function instead.AND
,
OR
and NOT
on
numbers)REPLACE INTO
- instead use a check if the
row is there, and then either an INSERT or an UPDATE.$dbh->quote()
.IFNULL
(use COALESCE
)ENUM
or TIMESTAMP
.SELECT (column IS NOT NULL) FROM b
will not work. (Use CASE WHEN column IS NOT NULL THEN 1 ELSE
0 END
instead.)WHERE
statements without an operator. For example,
SELECT column FROM b WHERE a
will not work.
That needs to have: WHERE a = 1
.IF
- use CASE WHEN
instead.INSERT
statements with multiple groups of parentheses
after VALUES. Instead, just call INSERT multiple times.The following query features are not currently available in all databases that Bugzilla supports, and so should be avoided:
START TRANSACTION
or COMMIT
)SELECT * FROM (SELECT * FROM b)
)In general, implicit INNER JOINs (FROM a, b WHERE
a.user_id = b.user_id
) are faster than explicit INNER JOINs
(FROM a INNER JOIN b ON a.user_id = b.user_id
). This is
because explicit INNER JOINs bypass the query optimizer on some
databases and specify that the database must use that table
order.
However, you can't mix explicit and implicit joins. It works on older MySQL versions (before 5.0), but that's it. So, if you have any LEFT JOIN statements in your SQL, you cannot have any "comma" joins.
In general, the performance penalty for explicit INNER JOINs should only happen on older databases (PostgreSQL before 7.4, for example), so don't worry about it too much. When in doubt, do an explicit join.
Bugzilla uses a capitalisation convention for SQL. SQL keywords such as
SELECT
, WHERE
and
AND
should be all upper case. Even
function names like NOW()
should be upper-case.
Database, table and field names should be all lower case, for example:
SELECT fieldname1 FROM tablename WHERE fieldname2 = 2
AND fieldname3 = 'HELLO' AND time = NOW()
SELECT DISTINCT groups.id, groups.name, groups.description
FROM groups
CROSS JOIN user_group_map
CROSS JOIN group_group_map AS ggm
WHERE user_group_map.user_id = ?
AND ((user_group_map.isbless = 1
AND groups.id = user_group_map.group_id)
OR (groups.id = ggm.grantor_id
AND ggm.grant_type = ?
AND ggm.member_id IN(1, 2, 3))))
ORDER BY groups.id
Note how all the major keywords are right-aligned. This makes it very easy to scan through the statement. Also, look at how the WHERE clause is aligned--in such a way as to show the logical groupings, but without a lot of extra spacing.
Also, because it was such a complex query, we put the table name at the beginning of each column name. This makes it much easier to see where data is coming from, even if it's not always necessary. (For simple queries that query only one table, don't do that. It's obvious where the data is coming from.)