Chinaunix首页 | 论坛 | 博客
  • 博客访问: 1959663
  • 博文数量: 1000
  • 博客积分: 0
  • 博客等级: 民兵
  • 技术积分: 7921
  • 用 户 组: 普通用户
  • 注册时间: 2013-08-20 09:23
个人简介

storage R&D guy.

文章分类

全部博文(1000)

文章存档

2019年(5)

2017年(47)

2016年(38)

2015年(539)

2014年(193)

2013年(178)

分类: 服务器与存储

2014-01-02 08:43:23

Date: 	Thu, 22 Nov 2001 14:21:15 -0800 (PST)
From: Linus Torvalds 
Subject:  Newsgroups: fa.linux.kernel

On Thu, 22 Nov 2001, Leif Sawyer wrote:
>
> adding the 'pci=biosirq' to my kernel boot command line causes an oops:

Well, you seem to have a buggered BIOS - the oops is actually in the BIOS
segment, and the BIOS appears to try to re-load the ES segment register
with some strange non-existing segment.

Your BIOS PCI irq routing routines probably only work in real-mode or
something like that.

This is the reason Linux avoids BIOS calls like the plague, and why you
have to ask for them explicitly - the likelihood of any random BIOS being
broken is actually rather high. That's probably because

 - the BIOS is written mostly in assembly
 - the BIOS is tested exclusively with DOS and Windows
 - most BIOS writers appear to simply be incompetent, or just not care.

Not a good combination, in short.

I'd love to just remove the support for BIOS calls entirely, but for every
ten broken machines there is one machine that actually works, so..

		Linus




From: Linus Torvalds 
Newsgroups: fa.linux.kernel
Subject: Re: [PATCH] resource/x86: add sticky resource type
Date: Thu, 28 Aug 2008 17:21:01 UTC
Message-ID: <>

On Thu, 28 Aug 2008, Ingo Molnar wrote:
>
> Does anyone have any suggestions of how to improve this some more? (or
> do it differently)

As far as I can see, THIS IS TOTALLY BROKEN.

There's a reason why we add the broken resources LATE. There's a reason we
_have_ to add them late.

Trying to come up with these braindamaged schemes to avoid doing it right
is wrong. Don't do it.

The reason? Those bogus resources that the BIOS reports are simply NOT
TRUSTWORTHY. They may actually be in all the wrong places, including
covering a resource half-way, or crossing two real resources.

Yes, it's rare. But it happens. Which is why we should not do these broken
things that _incorrectly_ assume that the BIOS resources will only ever be
totally contained within a BAR, or will totally contain one. Think about
the partial overlap case.

Guys, until you learn that the BIOS resources are _crap_ and at most
random guesses, you shouldn't touch this code! And it has nothing to do
with writing "clean" code, or making things "simple", because quite
frankly, things simply ARE NOT simple!

The fact is, the only reliable way to handle these things has _always_
been to ask the hardware first. Add the broken resources from ACPI and
other BIOS tables _later_. If they conflict, it is the ACPI/BIOS tables
that should be removed.

Why do I have to tell this to people over and over again?

		Linus


From: Linus Torvalds 
Newsgroups: fa.linux.kernel
Subject: Re: [PATCH] x86: split e820 reserved entries record to late
Date: Thu, 28 Aug 2008 20:06:34 UTC
Message-ID: <>

On Thu, 28 Aug 2008, Yinghai Lu wrote:
>
> so could let BAR res register at first, or even pnp?

Well, I'm not sure whether PnP or e820 should be first, as long as any
"real hardware" probing takes precedence over either. I _suspect_ that
e820 is more trustworthy, which implies that PnP should probably be added
last. It would be good to have some idea what Windows does, since usually
all the firmware bugs are essentially hidden by whatever that other OS
happens to do.

The basic rule really should be: "What do we trust most?" and probe things
in that order.

So e820 is fairly trustworthy, but we know that it will have various
random things marked as reserved because they are special in some way (but
we don't know _how_ they are special - they may well be real BAR's that
just have a fixed meaning to ACPI or whatever).

But we obviously trust _part_ of it (the RAM stuff) more than we trust
other parts. So it does make sense to consider that separately.

PnP I personally wouldn't trust at all, except as a way to keep dynamic
resources away from those things, which is why I'd put it last. But that's
just my personal gut feeling.

Hardware we generally trust more than any firmware, but even hardware can
have bugs. And some classes of hardware tends to be less buggy than others
(ie I'd trust some on-die APIC base pointer before I would trust a Cardbus
controller BAR, for example).

But yes, I think your patch looks like it is definitely moving in the
right direction. If this means that we can now do PCI probing without
having the BAR's move around because they also happened to be covered by
an e820 map, then that sounds like a good thing.

Of course, I bet there will be cases where this causes problems. It feels
like we have _never_ worked around some PCI BAR allocation problem without
hitting another unexpected one..

		Linus


From: Linus Torvalds 
Newsgroups: fa.linux.kernel
Subject: Re: [PATCH] x86: split e820 reserved entries record to late
Date: Thu, 28 Aug 2008 20:39:40 UTC
Message-ID: <>

On Thu, 28 Aug 2008, H. Peter Anvin wrote:
>
>     The sucky case, of course, would be an uninitialized BAR pointing
> into unusable address space which happens to be reserved in e820.  This
> seems very difficult to disambiguate from the above case through any
> algorithm that I can think of.

Yeah, well, the good news is that it should be fairly rare. Any sane PCI
device will come out of reset with IO and MEM disabled, and even if some
crazy BIOS enables IO/MEM on it and activates the BAR's with some random
content, I'm not seeing how that would work well with Windows either if it
really was overlapping with some critical real other piece of hardware.

So I'd _assume_ that something like that would break Windows too, and thus
not actually make it into a real product.

Maybe.

			Linus


From: Linus Torvalds 
Newsgroups: fa.linux.kernel
Subject: Re: Linux 2.6.27-rc5: System boot regression caused by commit
Date: Sun, 31 Aug 2008 04:14:31 UTC
Message-ID: <>

On Sat, 30 Aug 2008, Linus Torvalds wrote:
>
> Short recap:
>
>  - we need to populate the resource map with as much possible information
>    about the system as we can..
>
>  - .. because when we assign _dynamic_ resources, we need to make sure
>    that they don't clash with random system resources that we don't really
>    otherwise have a lot of visibility into.
>
> So the resource tree is not just about resources we control, it's also
> about resources that others control(led) and we don't necessarily know a
> lot about.

Btw, this is a problem that we seldom actually have on most desktops,
because the BIOS will normally set up just about _all_ the resources, and
we seldom have to worry about anything but just enumerating them (and the
occasional buggy setup).

The problems with resource allocation mostly happen on laptops, and
especially with cardbus controllers. Now, that's obviously going away
(people mostly use USB for most things that Cardbus/PCMCIA was used for a
few years ago), but it still exists and with docking stations etc it can
actually be even worse (although that's mainly because access to docking
stations is much more limited, I suspect).

So what used to happen _all_ the time was that cardbus worked fine on 99%
of all machines, but then some machines would lock up when you inserted a
card in them, or the card just wouldn't work. And the reason was that some
stupid motherboard resource (like the ACPI sleeping registers or the LPC
control regs) were not done as a normal BAR, so the kernel wouldn't know
about them, and the BIOS didn't necessarily even list it because it never
mattered with Windows (since Windows has a different algorithm for laying
out the bus resources, and wouldn't hit the magic resource).

So this is why we populate the resources with everything we can _possibly_
try to find, including hardware-specific quirks (see things like
quirk_ali7101_acpi or all the quirk_ich4_lpc_acpi things etc) for finding
resources that aren't done by BAR's.

And the hardware quirks have generally worked pretty well. I'd love to add
some quirk for the RD790 chipset, but I'd like to know what the rules are
for it. I know we have some AMD contacts, I wonder if they could give docs
(I don't personally do NDA's, but I can do "gentleman's agreements" where
I just say I won't spread things further, as long as I can write code
based on them. I know other kernel developers do similar things).

Jordan?

			Linus


From: Linus Torvalds 
Newsgroups: fa.linux.kernel
Subject: Re: 2.6.32-rc4: Reported regressions from 2.6.31
Date: Mon, 12 Oct 2009 15:28:25 UTC
Message-ID: <>

On Mon, 12 Oct 2009, David Woodhouse wrote:
>
> Well, according to the design, the IOMMU code is doing the right thing?.
>
> The theory is that the BIOS _tells_

There is no "theory". There is only crap BIOSes. Stop living in a dream
world, and stop making arguments that are only relevant in that dream
world.

> The only viable solution (short of an open source BIOS written by sober
> people)

Again, you're living in that dream world. Wake up, sheeple.

BIOS writers write crap, because it's a crap job. It's that simple. Yes,
they're probably drunk or drugged up, but they need it to deal with the
hand they have been dealt.

There are absolutely _zero_ incentives to write good firmware for any
particular device (and nothing else matters), because the projects are all
(a) largely "throw-over-the-fence-and-forget-about-it" for any particular
machine (ie any fixes are mostly relevant for the _next_ generation of
machines), and (b) they have no actual user incentives or feedback, since
nobody really "runs" the firmware anyway - it's supposed to be invisible
by design and just sets things up.

So outside of the testing that it gets (_before_ it ever hits any users
hands) there is never _ever_ going to be any QA.  Once it is in user
hands, it is what it is. Almost nobody upgrades their firmware in
practice. Yeah, geeks may do. Normal people? Not so much.

And that would _not_ change even with an open source BIOS. Why? Because
the _code_ generally isn't the problem. The problem tends to be all the
localization for any particular machine.

Even if the code was open source and perfect, we'd still have trouble with
firmware - because 99% of it is driven by tables that by design have to be
changed for each machine (yes, yes, a BIOS is easily a megabyte of
compressed code too, but a lot of it is the nice GUI for setup etc - the
actual code that runs on _bootup_ is rather small, and is all about those
tables).

We might have a few _fewer_ problems if the code was perfect, and maybe
we'd have missed this particular issue (but I doubt it, it's still a
localized table). But it wouldn't really change any of the fundamentals in
the issue.

BTW, in this case, it's not even necessarily the firmware people who are
to blame. It's a combination of (a) USB is crazy polled DMA and (b) IOMMU
is a whole new thing and (c) ACPI is too f*cking complex so nobody will
_ever_ get it right anyway, even if the firmware people didn't have
everything against them.

So your arguments about RMRR tables and "should do" are POINTLESS. Just
give it up.

The sane thing to do is to have a legacy IOMMU mapping until all devices
are initialized, so that things work _despite_ the inevitable BIOS
crapola. End of story.

So stop blaming the BIOS. We _know_ firmware is crap - there is no point
in blaming it. The response to "firmware bug" should be "oh, of course -
and our code was too fragile, since it didn't take that into account".

And stop saying these problems would magically go away with open-source
firmware. That just shows that you don't understand the realities of the
situation. Even an open-source bios would end up having buggy tables, and
even with an opensource bios, users generally wouldn't upgrade it.

"If wishes were horses, beggars would ride"

> was to quiesce the USB controllers before enabling the IOMMU.

The other solution would be to just _enable_ (and do all the setup) of the
IOMMU early. And then just leave a legacy mapping for the IOMMU, and then
_after_all_devices_are_set_up_ can you then remove the legacy mapping.

I don't much care. Just as long as the DMA works for the whole bus setup.

> The final PCI quirks are currently run from pci_init() which is a
> device_initcall(), which is too late -- in fact, it could actually be
> _after_ some of the real device drivers have taken control of the same
> hardware.

Now, this I can certainly agree 100% with. We can and should move the
final quirks up a bit. And yes, I absolutely think we should at least
_guarantee_ that those quirks are run before any other device_initcalls,
regardless of any IOMMU issues (ie now it looks like it depends on link
ordering whether a built-in driver might run before or after pci_init()).

So I agree that we may well be able to fix this by changing the ordering.
What I disagree with is your continual "I wish things were different".
I've seen you make that "open source firmware" argument before. It's
pointless. Stop doing it.

But your patch looks fine.

That said, I think your choice of initcall is odd, even if it is
understandable. Right now we have, at least on x86:

	fs_initcall(pcibios_assign_resources);

and I assume you picked fs_initcall_sync() so that it happens after that
one. No?

Which makes sense, but at the same time, it all looks just random. And
different architectures actually do it in different places (some seem to
do it inside pcibios_init() at subsys_initcall time). So I'm not even sure
fs_initcall_sync() will do it for other architectures, although it looks
like most others do their final PCI setup _earlier_ rather than later.

I'm wondering if we should not get _rid_ of that whole pci_init() (you
renamed it, and I agree - it's clearly not really "pci_init()"), and move
it to just be the last thing that pci_subsys_init() does, or perhaps all
the way into pcibios_resource_survey().

But I guess there could be random ACPI initcalls etc involved too, and
subtle ordering constraints with _those_. And we have way too many
arch-specific details here. So your patch may be the simplest one, but I
wish we could also make some of this be less of a jungle of different
initcalls.

Basically, it seems silly to have this kind of subtlety for just the final
quirk, when the _other_ quirks are all handled by generic code in very
well-defined places.

For example, one of the effects of this mess is that as far as I can tell,
PCI hotplug (or cardbus, which is a special case of PCI hotplug) will
never run the pci_fixup_late fixups at all, even though it _will_ run the
pci_fixup_header/early/enable ones (because those are done by generic
code: pci_setup_device(), pci_device_add() and pci_enable_device()).

			Linus


From: Linus Torvalds 
Newsgroups: fa.linux.kernel
Subject: Re: 2.6.32-rc4: Reported regressions from 2.6.31
Date: Mon, 12 Oct 2009 17:37:32 UTC
Message-ID: <>

On Mon, 12 Oct 2009, David Woodhouse wrote:
>
> > The other solution would be to just _enable_ (and do all the setup) of the
> > IOMMU early. And then just leave a legacy mapping for the IOMMU, and then
> > _after_all_devices_are_set_up_ can you then remove the legacy mapping.
>
> That involves allocating a _shitload_ of page tables for a 1:1 mapping
> of all of physical memory.

I don't think that's true.

Nobody cares about "all physical memory". For one thing, we know that
anything that we consider to be normal memory (ie it's listed in the e820
tables as RAM) can't be all that interesting, since the BIOS definitely
released that to us.

That said, as long as the IOMMU is clearly enabled after the quirks have
run, for this particular case we don't much care.

But I could also imagine something similar happening for some BIOS-enabled
ethernet device being set up to listen to packets into some BIOS data
areas (left-overs from whatever bootp thing or other), which doesn't have
a quirk, and which ends up doing DMA until we actually load the driver.

Of course, we'd hope that the DMA just fails and nothing bad really
happens (hopefully the driver re-init will clear up any hung device). But
I can also imagine the hardware simply being really really unhappy, and
not recovering.

So in many ways it would be safest to leave memory that we don't know
about and we don't own as DMA'able in the IOMMU.

And no, I don't think it would be a shitload of pages. Quite the reverse.
It's probably not very many at all.

		Linus
阅读(924) | 评论(0) | 转发(0) |
给主人留下些什么吧!~~