Why we have to wait for Android on the Neo 1973-fangdikui-ChinaUnix博客

自由人

首页　| 　博文目录　| 　关于我

fangdikui

博客访问： 1511203
博文数量： 108
博客积分： 0
博客等级：民兵
技术积分： 997
用户组：普通用户
注册时间： 2013-06-29 09:58

个人简介

兴趣是坚持一件事永不衰竭的动力

文章分类

全部博文（108）

TTLV（13）
Java（22）
android 驱（8）
linux（62）
未分配的博文（3）

文章存档

2021年（1）

2020年（10）

2019年（19）

2018年（9）

2016年（23）

2015年（43）

2013年（3）

我的朋友

ARMv4 vs. ARMv5

So, it turns out that my hello binary (and all the android binaries) are compiled for an ARM926Ej-S chip. This is a problem because the neo1973 has an ARM920T core. Now you would think that ARM926 and ARM920 would be pretty close. But if you thought that you would, unfortunately, be wrong, wrong, wrong! The ARM926EJ-S implement the ARMv5TEJ instruction set, but the ARM920T implements the ARMv4T instruction set. So what happens in my hello program is that we hit an ARMv5 instruction, which is undefined in the earlier ARMv5 ISA, which generates an undefined instruction trap to the kernel, and the kernel responds by sending SIGILL to the running process. Assuming that the program hasn't installed any special signal handlers this will kill the process. And this is what was happening to my hello program, and what I assumed was happening to init as well. (Of course, assumptions make an ass out of u and me, or in this case, mostly me.)

Now I really wasn't going to be daunted by a pesky little thing such as the CPU not implementing the instructions stand in my way! (Note: I could of course have compiled hello for ARMv4 architecture, but that isn't an option for the rest of the stack, and I was only interested in getting hello running so I could get the rest of the stack running). So, in an act of stupid defiance, I decided, if the CPU can't implement the instruction, I'll do it myself.

Luckily the kernel provides a neat infrastructure for managing undefined instructions, and even emulating them. So the first instruction to emulate was the ARM clz instruction. This is the instruction that counts the number of leading zero bits. The code below implements this. The only other thing to do is ensure that this hook is registered at startup using: register_undef_hook(&clz_hook);

static int clz_trap(struct pt_regs *regs, unsigned int instr)
{
  /* Extract the source register index */
  int src = instr & 0xf;
  /* Extract the destination (result) register index */
  int dst = (instr >> 12) & 0xf;
  /* Extract the conditional code */
  int cc = (instr >> 28) & 0xf;
  /* Test if the conditional code passes */
  if (handle_cc(regs, cc)) {
      /* Implement the instruction */
      regs->uregs[dst] = 32 - fls(regs->uregs[src]);
  }
  /* Print some stuff for debugging */
  printk("emulating clz: %x src=%d (%lx) dst=%d (%lx) @ %p\n", instr, 
	 src, regs->uregs[src], dst, regs->uregs[dst], (void*) regs->ARM_pc);
  /* Increment the PC register */
  regs->ARM_pc += 4;
  
  return 0;
}

static struct undef_hook clz_hook = {
	.instr_mask	= 0x0fff0ff0,
	.instr_val	= 0x016f0f10,
	.cpsr_mask	= PSR_T_BIT,
	.cpsr_val	= 0,
	.fn		= clz_trap,
};

One thing that may not be clear from the comments is that ARM supports conditionally executed instructions. The top 4 bits of the instruction are its condition field. Depending on the condition field, and the value of the N, Z, C and V flags (which are stored in the CPSR register), the instruction may or may not be executed. This is used to avoid having to branch for all ifstatements and the associated problems... but you didn't come here for an introduction to computer architecture. To correctly implement this, some code is needed, and I clag it here for posterity.

/* Return true if conditional code should be executed */
static int handle_cc(struct pt_regs *regs, int cc)
{
  int doit = 0;
  int cpsr = regs->ARM_cpsr;

  int n = (cpsr >> 31) & 1;
  int z = (cpsr >> 30) & 1;
  int c = (cpsr >> 29) & 1;
  int v = (cpsr >> 28) & 1;

  switch (cc) {
  case 0:
    doit = z;
    break;
  case 1:
    doit = !z;
    break;
  case 2:
    doit = c;
    break;
  case 3:
    doit = !c;
    break;
  case 4:
    doit = n;
    break;
  case 5:
    doit = !n;
    break;
  case 6:
    doit = v;
    break;
  case 7:
    doit = !v;
    break;
  case 8:
    doit = c && !z;
    break;
  case 9:
    doit = !c || z;
    break;
  case 10:
    doit = (n == v);
    break;
  case 11:
    doit = (n != v);
    break;
  case 12:
    doit = ((z == 0) && (n == v));
    break;
  case 13:
    doit = ((z == 1) || (n != v));
    break;
  case 14:
    doit = 1;
    break;
  case 15:
    doit = 0;
    break;
  default:
    printk("Error, should get here!\n");
  }
  return doit;
}

OK, one down. That wasn't so hard. The next one gets a little bit tricker. The compiler will use the BLX instruction if it is available. This is the Branch, Link and Exchange instruction. There are two versions of the instruction, and at this stage we only really care about version 2. In this version the address to branch to is stored in a register, and a flag indicates whether or not anexchange is required. (You can ignore exchange for now, more about that later.). This instruction is a little bit more effort to implement, but it is not too hard:

static int blxv2_trap(struct pt_regs *regs, unsigned int instr)
{
  int rm = instr & 0xf;
  int cc = (instr >> 28) & 0xf;
  
  printk("emulate blxv2: %x rm=%d (%lx) @ %p CC: %d cc(%lx) cpsr(%lx)\n", instr, rm,
	 regs->uregs[rm], (void*) regs->ARM_pc, handle_cc(regs, cc), cc, regs->ARM_cpsr);

  if (handle_cc(regs, cc)) {
    /* Update the link register with the return address 8/
    regs->ARM_lr = regs->ARM_pc + 4;
    /* Update the CPSR is this is an 'exchange' */
    regs->ARM_cpsr = (regs->ARM_cpsr & (~32))  | ((regs->uregs[rm] & 1) << 5);
    /* Jump to the register value */
    regs->ARM_pc = regs->uregs[rm] & 0xfffffffe;
  } else {
    /* If the condition code fail, just go to the next instruction. */
    regs->ARM_pc += 4;
  }
  return 0;
}

static struct undef_hook blxv2_hook = {
	.instr_mask	= 0x0ffffff0,
	.instr_val	= 0x012fff30,
	.cpsr_mask	= PSR_T_BIT,
	.cpsr_val	= 0,
	.fn		= blxv2_trap,
};

After this, success! Hello world ran correctly. Of course this emulation isn't going to be particularly fast, but it is still infinitely faster than not running at all. (Well, OK, not really, divide by zero is undefined, not infinite.) At this point we were feeling pretty good with ourselves. At this point I must acknowledgeCarl and Matt for there assistance with this.

Thumb interworking

So now I really thought I was home free, but wrong once again. (A pattern emerging maybe?) So first a bit of a primer on ARM's Thumb mode (so punny!). ARM has two different instruction sets, the ARM instruction set, and the Thumb instruction set. The Thumb instruction set is a 16-bit instruction set, which has a higher code density than the ARM instruction set. Now the neat thing about this is that you can actually combine both ARM and Thumb instruction in the same program. So if your compiler is smart, it should be able to use both instruction sets for optimisation. The CPU knows whether code is executing in ARM or Thumb mode by a bit in the CPSR register. When the bit is set the instruction stream is assumed to be 16-bit Thumb instruction. Now if you are running in ARM mode, and want to enter Thumb mode, you need to do an exchange operation, which is part of the bx and blx instructions. Now it turns out that Android is compiled with Thumb mode, so this means it uses blxto switch from ARM to Thumb mode. So at this stage I ended up needing to implement blx (version 1) function. This is shown below:

static int blxv1_trap(struct pt_regs *regs, unsigned int instr)
{
  int h = (instr >> 24) & 1;
  long imm = ((instr & 0xffffff) << 8);
  /* should be signed extended */
  imm = imm >> 8;
  imm = imm << 2;
  imm = imm | (h << 1);
  printk("emulate blxv1: %x imm=%lx @ %p\n", instr, imm, (void*) regs->ARM_pc);

  regs->ARM_lr = regs->ARM_pc + 4;
  regs->ARM_cpsr = regs->ARM_cpsr | (1 << 5);
  regs->ARM_pc += imm;
  return 0;
}

static struct undef_hook blxv1_hook = {
	.instr_mask	= 0xfe000000,
	.instr_val	= 0xfa000000,
	.cpsr_mask	= PSR_T_BIT,
	.cpsr_val	= 0,
	.fn		= blxv1_trap,
};

Now, we get a bit further. But still no go. It turns out that Thumb also has a new BLX instruction in V5. So, we have to go through and emulate this instruction for Thumb as well. Below is the code for that.

static int blxv1_t_trap(struct pt_regs *regs, unsigned int instr)
{
  u16 offset_11 = instr & 0x7ff;
  u16 h = (instr >> 11) & 3;
  printk("blx thumb %lx ofs: %lx h: %d @ %p\n", 
	 instr, offset_11, h, (void*) regs->ARM_pc);
  
  if (h == 2) {
    long imm = (offset_11 << 12) << 9;
    imm = imm >> 9;
    regs->ARM_lr = regs->ARM_pc + (imm << 12);
    regs->ARM_pc = regs->ARM_pc + 2;
  } else {
    long new_pc = regs->ARM_lr + (offset_11 << 1);
    /* We set the top bit for mega hack! */
    regs->ARM_lr = (regs->ARM_pc + 2) | 1;
    regs->ARM_pc = new_pc;
    if (h == 1) {
      regs->ARM_lr = (1 << 31) | (((regs->ARM_lr & 2) >> 1) << 30) | regs->ARM_lr;
      regs->ARM_pc = regs->ARM_pc & 0xfffffffc;
      regs->ARM_cpsr = regs->ARM_cpsr & (~32);
    }
  }

  printk(" blx thumb after: pc: %lx lr: %lx cpsr: %lx\n", regs->ARM_pc, regs->ARM_lr, regs->ARM_cpsr);
  return 0;
}

static struct undef_hook blxv1_t_hook = {
	.instr_mask	= 0xe000,
	.instr_val	= 0xe000,
	.cpsr_mask	= PSR_T_BIT,
	.cpsr_val	= PSR_T_BIT,
	.fn		= blxv1_t_trap,
};

Now if you are still with me, and actually read the code, you might recognise some pretty interesting code. Spot it? No? OK, so the problem is the way in which ARM code returns to Thumb mode. The blx instruction updates the link register with the return address. In Thumb mode it also sets the lowest bit. This ensures that when bx is called from ARM mode it will jump back into Thumb mode. It turns out that having to use bx to return from functions is a bit of a pain, so in ARMv5, the architecture was updated so that if you popped values from the stack into thepc register, the CPU would also check the low bit and switch to Thumb mode if required. Unfortunately ARMv4 doesn't do this. Rather than checking the lower bit, it simply ignores it and masks it off, which means you jump back to the return address but remain in ARM mode, so you end up executing 16-bit instructions as though they were 32-bit instructions. It may not surprise you to learn that this generally doesn't work so well.

Which gets us to the truly evil code found above. As well as setting the low bit, we also go and set the top bit of the LR. When the ARM code returns from the function, rather than going to the correct location, it ends up at an unmapped location, which causes a pre-fetch abort. The prefetch abort handler was then updated to handle this error case.

asmlinkage void __exception
do_PrefetchAbort(unsigned long addr, struct pt_regs *regs)
{
  if (addr >> 31) {
    printk("Magic prefetch abort happened on: %x\n", addr);
    regs->ARM_pc = addr & 0x3ffffffe | ((addr >> 29) & 2);
    printk("  jumping to : %x\n", regs->ARM_pc);
    /* Enable thumb mode */
    regs->ARM_cpsr = (regs->ARM_cpsr | 32);
    return;
  }
  do_translation_fault(addr, 0, regs);
}

Now, at this stage, we have something pretty hacked up, but all these hacks are pretty solid. Unfortunately it still doesn't work. We have successfully ensured that ARM code returns correctly when called from Thumb mode, what we have failed to do is ensure that Thumb code returns correct to ARM code. In ARMv4, this is only possible through the bx instruction, which correctly sets the Thumb bit, in the CPSR. Unfortunately on ARMv5, the popinstruction was extended to also correctly update the thumb bit. But we aren't on an ARMv5, so it is simply ignored. Which means we get stuck in Thumb mode and can't correctly return to ARM code.

The prefetch abort trick works to an extent the other way as well, e.g: for getting from Thumb, back into ARM, but it relies on the ARM code using the blx instruction. Unfortunately this isn't always the case, and it is perfectly reasonably for code to use abl followed by a bx. As none of these trap it is not possible to put our magic fake value into the LR register.

The only other option left at this stage is some kind of code scanning technique. In this we scan the object code looking for the unsafe pop instructions, and replace them with undefined instruction so that we safely emulate them with the ARMv5 behaviour. Unfortunately ARM makes this approach basically impossible. It is not possible to tell if any block of code is Thumb or ARM instructions. More importantly, it is impossible to determine if a random word in the text segment is actually an instruction, or is in fact a literal value. Simply scanning for popcould actually modify some constants, which would lead to potentially subtle bugs. If ARM had separate execute and read permissions we could use the MMU to distinguish between code and data, but unfortunately the ARM MMU can't really do this. Which means that this approach is basically a no-go, at least not without some pretty nasty heuristics, or some really awesome static analysis. Of course we could just emulate every instruction, but this isn't exactly appealing to me. (And the performance would really suck!)

Conclusion

In summary, Android is compiled for ARMv5, Neo 1937 is ARMv4. These instruction sets are not compatible. Therefore Android will not run on the Neo 1937. Solutions to this problem would be either:

FIC releasing a version of the Neo based around an ARM926 core.
Google compiling for ARMv4 and making that available.
Google releasing the source and someone else compiling for ARMv4.

My guess is none of those three things is going to happen any time soon (although I'll be really happy to be disproved!), so it is better to focus on trying to get this running on an actual ARMv5 based chipset. (E.g: PXA270, i.MX21).

Finally, thanks to Update: Thanks to andrzej for spotting the bug in my clz() emulation. It should of course be 32 - fls(), not fls(). This is now updated.

阅读(2288) | 评论(0) | 转发(0) |

上一篇：armv4t.diff (Compiling the Android source code for ARMv4T)

下一篇：数据结构之顺序栈

给主人留下些什么吧！~~

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6