当前位置:   article > 正文

B2.2 Atomicity in the Arm architecture_dmbst指令

dmbst指令

B2.1 About the Arm memory model

B2.2.1 Requirements for single-copy atomicity

single-copy atomicity
• A read that is generated by a load instruction that loads a single general-purpose register and is aligned to the size of the read in the instruction is single-copy atomic.将一个地址的内容加载到寄存器中,并且是和读的size地址对齐的。
• A write that is generated by a store instruction that stores a single general-purpose register and is aligned to the size of the write in the instruction is single-copy atomic. 
Reads that are generated by a Load Pair instruction that loads two general-purpose registers and are aligned to the size of the load to each register are treated as two single-copy atomic reads, one for each register being loaded.

单条指令加载到两个寄存器,作为两个single-copy处理。

猜测原因是:这种指令不会被本处理器的异常打断,但不能保证其他处理器同时访问。

• Writes that are generated by a Store pair instruction that stores two general-purpose registers and are aligned to the size of the store of each register are treated as two single-copy atomic writes, one for each register being stored. 
• Load-Exclusive Pair instructions of two 32-bit quantities and Store-Exclusive Pair instructions of 32-bit quantities are single-copy atomic.

Load-Exclusive Pair读取两个32位操作数是single-copy

那Load-Pare读取两个32为操作数为什么就不是呢?

When the Store-Exclusive of a Load-Exclusive/Store-Exclusive pair instruction using two 64-bit quantities succeeds, it causes a single-copy atomic update of the entire memory location being updated.

当Store-Exclusize Pair成功执行对两个64位的操作时,认为是single-copy atomic。

从这里猜测:single-copy并不是指一次内存访问,只要能保证此指令访问内存过程中不会被其他访问干扰就可以。

• Where translation table walks generate a read of a translation table entry, this read is single-copy atomic. 
For the atomicity of instruction fetches 
Reads to SIMD and floating-point registers of a single 64-bit or smaller quantity that is aligned to the size of the quantity being loaded are treated as single-copy atomic reads.SIMD读取64位数据
Writes from SIMD and floating-point registers of a single 64-bit or smaller quantity that is aligned to the size of the quantity being stored are treated as single-copy atomic writes.SIMD写入64位数据
Element or Structure Reads to SIMD and floating-point registers of 64-bit or smaller elements, where each
element is aligned to the size of the element being loaded, have each element treated as a single-copy atomic
read.
 
Element or Structure Writes from SIMD and floating-point registers of 64-bit or smaller elements, where
each element is aligned to the size of the element being stored, have each element treated as a single-copy
atomic store.
 
Reads to SIMD and floating-point registers of a 128-bit value that is 64-bit aligned in memory are treated as
a pair of single-copy atomic 64-bit reads.
 
Writes from SIMD and floating-point registers of a 128-bit value that is 64-bit aligned in memory are treated
as a pair of single-copy atomic 64-bit writes.
 

上面说的有两个,single-copy atomic和two single-copy atomic,猜测:

以上的指令都不会被异常中断,能保证每次指令都能被完整的执行,这一点有别于sequence of accesses(会被异常打断,然后重新执行),但是single-copy和two single-copy的区别是single-copy能保证在多处理器上也是atomic的,但是two single copy在两次single-copy之间可能会被其他处理器的操作插入。

  
All other memory accesses are regarded as streams of accesses to bytes, and no atomicity between accesses to different bytes is ensured by the architecture.其他的操作都是streams of accesses to bytes,不同字节之间就不是atomicity了。
If, according to these rules, an instruction is executed as a sequence of accesses, exceptions, including interrupts, can be taken during that sequence, regardless of the memory type being accessed. If any of these exceptions are returned from using their preferred return address, the instruction that generated the sequence of accesses is re-executed, and so any access performed before the exception was taken is repeated.作为 sequence of accesses的指令,异常(包括中断)是可以打断这些指令的执行的,当从异常中返回时,这些指令会重新执行。

疑问:

1. 对于load pair和store pair这一类的two single-copy的指令,异常是否会打断其执行?还是说这些指令只要执行了就不会被打断----目前猜测是后者。

2. 对于LDADD这类atomic指令,他们是否会被异常打断,能否用于device属性的内存,还是多次使用load-exclusize/store exclusize仅仅保证处理器之间的原子性------目前猜测属于后者。

B2.2.2 Properties of single-copy atomic accesses

what is single copy
armv81. For a pair of overlapping single-copy atomic store instructions, all of the overlapping writes generated by one of the stores are Coherence-after the corresponding overlapping writes generated by the other store.两个具有重叠地址的写操作,他们是coherence-after的关系,就是说两个不能交叉写,一个要等另一个写完才能执行,或者说这两个写操作执行完毕之后,重叠部分的值只能是最后写的那个的,不能是两个写的组合。
 2. For a single-copy atomic load instruction L1 that overlaps a single-copy atomic store instruction S2, if one of the overlapping reads generated by L1 Reads-from one of the overlapping writes generated by S2, then none the overlapping writes generated by S2 are Coherence-after the corresponding overlapping reads generated by L1. 

B2.2.3 Multi-copy atomicity

following conditions are both true
• All writes to the same location are serialized, meaning they are observed in the same order by all observers, although some observers might not observe all of the writes.所有的写入都是serialized,被所有的observers观察到的顺序是一致的。
• A read of a location does not return the value of a write until all observers observe that write.所有的observers在任何时刻同时观察到的值是一致的。
  
  
  

B2.2.4 Requirements for multi-copy atomicity

For Normal memory, writes are not required to be multi-copy atomic 
For Device memory, writes are not required to be multi-copy atomic.为什么对于device不是nulti-topy atomicity?对于nGnRnE的device memory不应该是multi-copy atomicity的吗?还是说这里仅仅是说不做要求啊(因为还有GRE类型的device)?

 

B2.2.5 Concurrent modification and execution of instructions

Concurrent modification and execution of instructions can lead to the resulting instruction performing any behavior that can be achieved by executing any sequence of instructions that can be executed from the same Exception level, except where each of the instruction before modification and the instruction after modification is one of a B, BL, BRK, HVC, ISB, NOP, SMC, or SVC instruction.如果修改前后都是B, BL, BRK, HVC, ISB, NOP, SMC, and SVC指令中的一个,文档说是可以不用做同步,但是并不是说保证你的修改会被执行,执行的可能还是原来的修改之前的指令。

这一节讲的是一个core修改了另一个core的指令要怎么做:

结论:要做同步

为什么其他指令一定要做同步呢?

原因是依赖和乱序,设想:

mov r1 0

mov r2 r3

将第一条指令修改为

mov r3 0

则后面的mov r2 r3就和修改后的指令有依赖关系,有依赖关系的指令是由执行顺序要求的,所以一定要同步。

B2.2.6 Possible implementation restrictions on using atomic instructions

In some implementations, and for some memory types, the properties of atomicity can be met only by functionality
outside the PE. Some system implementations might not support atomic instructions for all regions of the memory.
In particular, this can apply to:
• Any type of memory in the system that does not support hardware cache coherency.
• Device, Non-cacheable memory, or memory that is treated as Non-cacheable, in an implementation that does
support hardware cache coherency.
In such implementations, it is defined by the system:
• Whether the atomic instructions are atomic in regard to other agents that access memory.
• If the atomic instructions are atomic in regard to other agents that access memory, which address ranges or
memory types this applies to.
An implementation can choose which memory type is treated as Non-cacheable.
The memory types for which it is architecturally guaranteed that the atomic instructions will be atomic are:
• Inner Shareable, Inner Write-Back, Outer Write-Back Normal memory with Read allocation hints and Write
allocation hints and not transient.
• Outer Shareable, Inner Write-Back, Outer Write-Back Normal memory with Read allocation hints and Write
allocation hints and not transient.
If the atomic instructions are not atomic in regard to other agents that access memory, then performing an atomic
instruction to such a location can have one or more of the following effects:
• The instruction generates a synchronous External abort.
• The instruction generates a System Error interrupt.

• The instruction generates an IMPLEMENTATION DEFINED MMU fault reported using the Data Abort Fault
status code of ESR_ELx.DFSC = 110101.
For the EL1&0 translation regime, if the atomic instruction is not supported because of the memory type that
is defined in the first stage of translation, or the second stage of translation is not enabled, then this exception
is a first stage abort and is taken to EL1. Otherwise, the exception is a second stage abort and is taken to EL2.
• The instruction is treated as a NOP.
• The instructions are performed, but there is no guarantee that the memory accesses were performed
atomically in regard to other agents that access memory. In this case, the instruction might also generate a
System Error interrupt.

完全没看明白,先放这里

B2.3 Definition of the ARMv8 memory model

B2.3.1 Locations

LocationsA Location refers to a single byte in memory.就是我们平时说的地址
Memory effectThe Memory effects of an instruction are the read, write, or barrier effects of that instruction. For an
instruction that accesses memory:
• A read effect is generated for each Location that is read by the instruction.
• A write effect is generated for each Location that is written by the instruction.
An instruction can generate both read and write effects.
所有跟内存有关的操作,包括读写以及内存屏障
ObserverAn Observer refers to either a processing element, or some other memory accessing agent that can generate reads from or writes to memory.能操作内存的部件都是observer,如cpu,gpu,isp等
Common Shareability DomainA Common Shareability Domain for a program is the smallest Shareability domain that contains all of the active Observers of the Memory effects generated by a program.所有active的observers都能看到的内存(不对或不准确)

B2.3.2 Ordering and observability

Register value dependencies
Register dependencyA Register dependency from a first data value V1 to a second data value V2 exists within a PE if and
only if either:
• The register, excluding the AArch64 zero register (XZR or WZR), that is used to hold V1 is used in the calculation of V2.
• There is a Register dependency from V1 to a third data value V3 and there is a register dependency from V3 to V2.

寄存器依赖,例如:

mov r1 5

add r2 r1 4

这里r2的值的计算要使用到r1,就构成了r2对r1的寄存器依赖。

 

Register data dependencyA Register data dependency from a first data value V1 to a second data value V2 exists within a PE
if and only if either:
• The register, excluding the AArch64 zero register (XZR or WZR) and the AArch32 PC, that
is used to hold V1 and is used in the calculation of V2, and the calculation between V1 and
the V2 does not consist of either:
— A conditional branch whose condition is determined by V1.
— A conditional selection, move, or computation whose condition is determined by V1,
where the input data values for the selection, move, or computation do not have a data
dependency on V1.
• There is a Register data dependency from V1 to a third data value V3, and there is a Register
data dependency from V3 to V2.
我无论怎样都看不出来和Register dependency的区别
Address dependencyAn Address dependency from a read R1 to a subsequent read R2 exists if and only if there is a Register data dependency from the data value that is returned by R1 to the address used by R2.
An Address dependency from a read R1 to a subsequent write W2 exists if and only if there is a Register dependency from the data value that is returned by R1 to the address used by W2.

R1是一条读指令,后面的另一条指令的地址依赖于R1的结果。

mov r1 0x80000

str r0  [r1,#2]

 

Data dependencyA Data dependency from a read R1 to a subsequent write W2 exists if and only if there is a Register dependency from the data value returned by R1 to the data value written by W2.

和Address dependency的区别是下一步要写的值依赖于上次读的结果:

mov r1 0x80000

str r1  [r0,#2]

Control dependencyA Control dependency from a read R1 to a subsequent instruction I2 exists if and only if either:
• There is a Register dependency from the data value returned by R1 to the data value used in the evaluation of a conditional branch, and I2 is only executed as a result of one of the possible outcomes of that conditional branch.
There is a Register dependency from the data value returned by R1 to the data value used in the determination of a synchronous exception on an instruction I3, and I2 appears in program order after I3.--没看懂

R1的结果作为分支执行的评估,分支中的指令对R1构成Control dependency

mov r1 9

if(r1 == 0){

I2

}else{

I3

}

这里I2和I3都对对一条R1指令构成Control dependency

 

所有dependency的第一条指令都是读取指令,后面的指令需要用到读取的结果,就构成了dependency

 

Ordering and observability at a Location
Reads-fromA Reads-from relation that couples reads and writes to the same Location such that each read is paired with a single write in the program. A read R2 of a Location Reads-from a write W1 to the same Location if and only if R2 takes its data from W1.
Note
The Reads-from relation represents a read being satisfied by a write and then returning the written data.
一对读写才会构成read-from的关系,要求是写在前,读在后,读的结果是之前写进去的值
Coherence orderA Coherence order relation for each Location in the program that provides a total order on all writes from all coherent Observers to that Location, starting with a notional write of the initial value.
Note
The Coherence order of a Location represents the order in which writes to the Location arrive at memory.

英文没怎么看懂,目前理解如下:

Coherence order是多个写之间的关系,这些写入是顺序的,能被所有的Observers观察到的。

Coherence-afterA write W2 to a Location is Coherence-after another write W1 to the same Location if and only if W2 is sequenced after W1 in the Coherence order of the Location.
A write W2 to a Location is Coherence-after a read R1 of the same location if and only if R1 Reads-from a write W3 to the same Location and W2 is Coherence-after W3.
W2 Coherence-after RW1是说:W2的操作要发生在RW1之后。
Overlapping accessesTwo Memory effect overlap if and only if they access the same Location. Two instructions overlap if and only if one or more of their generated Memory effects overlap.两条指令操作的内存空间有重叠部分。
Observed-by

A read or a write RW1 from an Observer is Observed-by a write W2 from a different Observer if and only if W2 is coherence-after RW1.
A write W1 from an Observer is Observed-by a read R2 from a different Observer if and only if R2 Reads-from W1.
Note The Observed-by relation only relates accesses generated by different Observers

两个不同的observers,这个在程序中具体怎么体现?

一个Observer 的操作被另一个Observer 看到,具体的是:

Observer A的读写指令被Observer B的写指令看到,就是说ObserverA的读写要发生在Observer B的写之前。

ObserverA的写指令被Observer B的读指令看见,是说Observer B读到的内容要是Observer A写入的内容

DMB FULL

A DMB FULL is a DMB with neither the LD or the ST qualifier.Where this section refers to DMB without any qualification, then it is referring to all types of DMB.
Unless a specific shareability domain is defined, a DMB applies to the Common Shareability Domain.

All properties that apply to DMB also apply to the corresponding DSB.

内存屏障
上面的所有都表示一种关系,reads-from表示读和写的关系,Coherence order表示一种顺序关系,Coherence-after表示一种先后关系,Observed-by表示是否要让其他的Observers看到
Ordering relations

Dependency-ordered-before

 

 


是一条读指令和后面的读或写指令的关系。
A dependency creates externally-visible order between a read and another Memory effect generated by the same Observer. A read R1 is Dependency-ordered-before a read or write RW2 from the same Observer if and only if R1 appears in program order before RW2 and any of the following cases apply:在代码顺序上,一条读指令R1在另一条读或写指令RW2之前,如果下面条件的任意一条成立,则R1和RW2的关系就是Dependency-ordered-before
• There is an Address dependency or a Data dependency from R1 to RW2.

RW2对R1有地址或数据依赖,例如:

mov r1 0x100;

str r1 r2;

• RW2 is a write W2 and there is a Control dependency from R1 to W2.

RW2是条写指令W2,且他们之间有控制依赖关系。注意:这里只有写,没有读

r1 = 0;

if(r1 ==1){

ldr r2 [r5];

str r3 [r6]

}

此分支中,str r3 [r6]和r1满足条件

• RW2 is a read R2 generated by an instruction appearing in program order after an instruction I3 that generates a Context synchronization event, and there is a Control dependency from R1 to I3.

I3和R1是Control dependency,而I3会产生Context synchronization event,R2在I3之后。

对于context synchronization event的正常要求是(• No instructions appearing in program order after an instruction that causes a Context synchronization event
will have performed any part of their functionality until the Context synchronization event has occurred.:即在此命令之后的所有命令必须要等待此命令完成之后才能执行),所以对于像下面的指令:

1.   ld x1 [x2]

2.   context sync instruction

3.   ld x3 [x4]

2本身就能保证1在3之前执行,所以猜测这条规则是对下面的情况生效的:

1. ld x1 [x2]

if(x1 ==0){

context sync instruction

}

2.   ld x3 [x4]

保证1在2之前执行,仅仅是猜测

• RW2 is a write W2 appearing in program order after a read or a write RW3 and there is an Address dependency from R1 to RW3.R1和RW3之间存在Address dependency,W2在RW3之后。就是说有Address dependency后面的所有的W操作和R1都构成Dependency-ordered-before的关系,这条规则的伤害较大
• RW2 is a write W2 that is Coherence-after a write W3 and there is a Control dependency or a Data dependency from R1 to W3.W3和R1是Control dependency or a Data dependency,二W2和W3是Coherence-after的关系
• RW2 is a read R2 that Reads-from a write W3 and there is an Address dependency or a Data dependency from R1 to W3.W3和R1是Address dependency or a Data dependency,二R2和W3是Reads-from
Atomic-ordered-beforeLoad-Exclusive and Store-Exclusive instructions provide some ordering guarantees, even in the absence of dependencies. A read or a write RW1 is Atomic-ordered-before a read or a write RW2 from the same Observer if and only if RW1 appears in program order before RW2 and either of the following cases apply:

即使在没有依赖关系的情况下,Load-Exclusive and Store-Exclusive可以提供某种顺序上的保证,如果RW1在程序上在RW2之前:

如果不是这样的话,程序的结果都是不对的,那为什么还要强调一下呢?

RW1 is a read R1 and RW2 is a write W2 such that R1 and W2 are generated by an atomic instruction or a successful Load-Exclusive/Store-Exclusive instruction pair to the same Location.

一条指令产生的读和写(是否要求同一地址?)

或者对同一个地址的load-exclusize/stroe-exclusive指令对,并且store-exclusive执行成功。

这条规则隐含在Internal visibility requirement规则中,为什么还要单独写出来

• RW1 is a write W1 generated by an atomic instruction or a successful Store-Exclusive
instruction and RW2 is a read R2 generated by an instruction with Acquire or AcquirePC
semantics such that R2 Reads-from W1.

R2是带有acquire的指令,w1是atomic instruction或者successful store-exclusize。

read-from本身就已经保证了这种关系了吗?为什么还要但对列出来,而且加了acquire的要求。

Barrier-ordered-beforeBarrier instructions order prior Memory effects before subsequent Memory effects generated by the
same Observer. A read or a write RW1 is Barrier ordered-before a read or a write RW2 from the same Observer if and only if RW1 appears in program order before RW2 and any of the following cases apply:
 
• RW1 appears in program order before a DMB FULL or an atomic instruction with both Acquire and Release semantics that appears in program order before RW2.RW1和RW2之前包含DMB FULL指令或者是包含Acquire and Release semantics的原子操作指令
• RW1 is a write W1 generated by an instruction with Release semantics and RW2 is a read R2
generated by an instruction with Acquire semantics.
W1包含Release semantics,R2包含Acquire semantics
• RW1 is a read R1 and either:
— R1 appears in program order before a DMB LD that appears in program order before RW2.
— R1 is generated by an instruction with Acquire or AcquirePC semantics.

1.形如

R1

DMBLD

RW2

2.形如

R1_ACQ

RW2

RW2 is a write W2 and either:
— RW1 is a write W1 appearing in program order before a DMB ST that appears in program order before W2.

— W2 is generated by an instruction with Release semantics.
— RW1 appears in program order before a write W3 generated by an instruction with Release semantics and W2 is Coherence-after W3.

1. 

W1

DMBST

W2

2.

RW1

W2_RLS

3.

RW1

W3_RLS location_a

W2 location_a

W2和W3操作的是同一地址

 

Ordered-beforeAn arbitrary pair of Memory effects is ordered if it can be linked by a chain of ordered accesses consistent with external observation. A read or a write RW1 is Ordered-before a read or a write RW2
if and only if any of the following cases apply:
• RW1 is Observed-by RW2.
• RW1 is Dependency-ordered-before RW2.
• RW1 is Atomic-ordered-before RW2.
• RW1 is Barrier-ordered-before RW2.
• RW1 is Ordered-before a read or a write that is Ordered-before RW2.
 
   

 

 

B2.3.3 Ordering constraints

 

 

 

Other-multi-copy atomicIn an Other-multi-copy atomic system, it is required that a write from an Observer, if observed by a different Observer, is then observed by all other Observers that access the Location coherently. It is, however, permitted for an Observer to observe its own writes prior to making them visible to other
observers in the system.
一个observer的写操作可以被自己先看到,对其他的observers可以晚一点被看到,但其他observers要同时能观察到,就是说要么都看不到,要么都看到。
Architecturally well-formedInternal visibility requirement For a read or a write RW1 that appears in program order before a read or a write RW2 to the same Location, the internal visibility requirement requires that exactly one of the following statements is true:
• RW2 is a write W2 that is Coherence-after RW1.
• RW1 is a write W1 and RW2 is a read R2 such that either:
— R2 Reads-from W1.
— R2 Reads-from another write that is Coherence-after W1.
• RW1 and RW2 are both reads R1 and R2 such that R1 Reads-from a write W3 and either:
— R2 Reads-from W3.
— R2 Reads-from another write that is Coherence-after W3.
Note
If a Memory effect M1 from an Observer appears in program order before a Memory effect M2 from the same Observer, then M1 will be seen to occur before M2 by that Observer.
对单个observer的要求是:对同一地址的操作是顺序的,不同地址顺序没有要求。
External visibility requirement For a read or a write RW1 from an Observer that is Ordered-before a read or a write RW2 from a different Observer, the external visibility constraint requires that RW2 is not Observed-by RW1. This means that an Architecturally well-formed execution must not exhibit a cycle in the Ordered-before relation.
Note
If a Memory effect M1 from an Observer is Ordered-before another Memory effect M2, from a different Observer, then M1 will be seen to occur before M2 by all Observers in the system.

对不同的observers的要求是:若RW1 Ordered-before RW2,则对所有的observers,RW1都发生在RW2之前。

 

   

 

B2.3.4 Completion and endpoint ordering

Normal Memory Completion
readA read R1 to a Location is complete for a shareability domain when all of the following are true:
— Any write to the same Location by an Observer within the shareability domain will be Coherence-after R1.
— Any translation table walks associated with R1 are complete for that shareability domain.
--不会读到后面写入的值
write• A write W1 to a Location is complete for a shareability domain when all of the following are true:
— Any write to the same Location by an Observer within the shareability domain will be Coherence-after W1.
— Any read to the same Location by an Observer within the shareability domain will either Reads-from W1 or Reads-from a write that is Coherence-after W1.
— Any translation table walks associated with the write are complete for that shareability domain.

--后面的写入在和此次写入的关系是Coherence-after

--后面的读会读到此次写入的值,或者是比这个写入更晚写入的值。

translation table walkA translation table walk is complete for a shareability domain when the memory accesses, including the updates to translation table entries, associated with the translation table walk are complete for that shareability domain, and the TLB is updated. 
cache maintenanceA cache maintenance instruction is complete for a shareability domain when the memory effects of the instruction are complete for that shareability domain, and any translation table walks that arise from the instruction are complete for that shareability domain. 
TLB invalidateA TLB invalidate instruction is complete when all memory accesses using the TLB entries that have been invalidated are complete. 
These completion rules mean that, for example, a cache maintenance instruction that operates by VA to the PoC completes only after memory at the PoC has been updated.

 

 

 

 

Device-nGnRnE memory Completion
Additionally, for Device-nGnRnE memory, a read or write of a Location in a Memory-mapped peripheral that exhibits side-effects is complete only when the read or write both:
• Can begin to affect the state of the Memory-mapped peripheral.
• Can trigger all associated side-effects, whether they affect other peripheral devices, PEs, or memory.
对于Device-nGnRnE的memory来说,只有当其side-effects都触发了(不一定完成,只要触发就可以?)才算completed

 

Peripherals
Memory-mapped peripheralA Memory-mapped peripheral occupies a memory region of IMPLEMENTATION DEFINED size and can be accessed using load and store instructions. Memory effects to a Memory-mapped peripheral can have side-effects, such as causing the peripheral to perform an action. Values that are read from addresses within a Memory-mapped peripheral might not correspond to the last data value written to those addresses. As such, Memory effects to a Memory-mapped peripheral might not appear in the Reads-from or Coherence order relations.有size-effects(例如导致外设执行一个action,读取到的值不是上次写入的值,同样的,不遵循Reads-from or Coherence order的关系)
Peripheral coherence orderFor a read or a write RW1 and a read or a write RW2 to the same peripheral, then RW1 will appear in the Peripheral coherence order for the peripheral before RW2 if either of the following cases apply:下列两条满足任意一条,RW1和RW2就构成Peripheral coherence order的关系
• RW1 and RW2 are accesses using Non-cacheable or Device attributes and RW1 is
Ordered-before RW2.
RW1和RW2的属性是Non-cacheable or Device attributes,而且他们的关系是ordered-before
• RW1 and RW2 are accesses using Device-nGnRE or Device-nGnRnE attributes and RW1 appears in program order before RW2.他们的属性是 Device-nGnRE or Device-nGnRnE,而且RW1 appears in program order before RW2.
Out-of-band-ordered-before

A read or a write RW1 is Out-of-band-ordered-before a read or a write RW2 if and only if either of the following cases apply:
• RW1 appears in program order before a DSB instruction that begins an IMPLEMENTATION DEFINED instruction sequence indirectly leading to the generation of RW2.
• RW1 is Ordered-before a read or a write RW3 and RW3 is Out-of-band-ordered-before RW2.

If a Memory effect M1 is Out-of-band-ordered-before a read or a write M2, then M1 is seen to occur before M2 by all Observers.

 

 

B2.3.5 Memory barriers

Memory barriers
Instruction Synchronization Barrier (ISB)An ISB instruction ensures that all instructions that come after the ISB instruction in program order are fetched from the cache(注意这里,可以从cache加载,PE A修改了PE B的指令必须要进行cache同步,即使 是cache B自己修改了自己的指令,因为ICache和DCache的分离,也需要同步) or memory after the ISB instruction has completed. Using an ISB ensures that the effects of context-changing operations executed before the ISB are visible to the instructions fetched after the ISB instruction. Examples of context-changing operations that require the insertion of an ISB instruction to ensure the effects of the operation are visible to instructions fetched after the ISB instruction are:
• Completed cache and TLB maintenance instructions.
• Changes to System registers.
Any context-changing operations appearing in program order after the ISB instruction only take effect after the ISB has been executed.

1. 在ISB之后的指令会被refetch,在ISB之前所有的修改,在ISB之后都可见。

网上资料:指令同步隔离。最严格:它会清洗流水线,以保证所有它前面的指令都执行完毕之后,才执行它后面的指令。

Data Memory Barrier (DMB)The DMB instruction is a memory barrier instruction that ensures the relative order of memory accesses before the barrier with memory accesses after the barrier. The DMB instruction does not ensure the completion of any of the memory accesses for which it ensures relative order.
The full definition of the DMB is covered formally in the Definition of the ARMv8 memory model on page B2-97 and this introduction to the DMB instruction is not intended to contradict that section.
The basic principle of a DMB instruction is to introduce order between memory accesses that are specified to be affected by the DMB options supplied as arguments to the DMB instruction. The DMB instruction ensures that all affected memory accesses by the PE executing the DMB that appear in program order before the DMB and those which originate from a different PE, to the extent required by the DMB options, which have been Observed-by the PE before the DMB is executed, are Observed-by each PE, to the extent required by the DMB options, before any affected memory accesses that appear in program order after the DMB are Observed-by that PE.
The use of a DMB creates order between the Memory effects of instructions as described in the definition of Barrier-ordered-before.
DMB only affects memory accesses and the operation of data cache and unified cache maintenance instructions, see A64 Cache maintenance instructions on page D4-2364. It has no effect on the ordering of any other instructions executing on the PE. A DMB instruction intended to ensure the completion of cache maintenance instructions must have an access type of both loads and stores.

在DMA执行之前能够被本PE observed的操作,要能够被所欲的PE observe(根据DMB option决定)。

 

Consumption of Speculative Data Barrier (CSDB)The CSDB instruction is a memory barrier instruction that controls speculative execution and data value prediction.
This includes:
• Data value predictions of any instructions.
• PSTATE.{N,Z,C,V} predictions of any instructions other than conditional branch instructions appearing in
program order before the CSDB that have not been architecturally resolved.
• Predictions of SVE predication state for any SVE instructions.
For purposes of the definition of CSDB, PSTATE.{N,Z,C,V} is not considered a data value. This definition permits:
• Control flow speculation before and after the CSDB.
• Speculative execution of conditional data processing instructions after the CSDB, unless they use the results
of data value or PSTATE.{N,Z,C,V} predictions of instructions appearing in program order before the CSDB
that have not been architecturally resolved.
 
Speculative Store Bypass Barrier (SSBB)The SSBB is a memory barrier that prevents speculative loads from bypassing earlier stores to the same virtual
address under certain conditions.
The semantics of the Speculative Store Bypass Barrier are:
• When a load to a location appears in program order after the SSBB, then the load does not speculatively read
an entry earlier in the coherence order for that location than the entry generated by the latest store satisfying
all of the following conditions:
— The store is to the same location as the load.
— The store uses the same virtual address as the load.
— The store appears in program order before the SSBB.
• When a load to a location appears in program order before the SSBB, then the load does not speculatively read
data from any store satisfying all of the following conditions:
— The store is to the same location as the load.
— The store uses the same virtual address as the load.
— The store appears in program order before the SSBB.
 
Physical Speculative Store Bypass Barrier (PSSBB)The PSSBB is a memory barrier that prevents speculative loads from bypassing earlier stores to the same physical
address under certain conditions.
The semantics of the Physical Speculative Store Bypass Barrier are:
• When a load to a location appears in program order after the PSSBB, then the load does not speculatively read
an entry earlier in the coherence order for that location than the entry generated by the latest store satisfying
all of the following conditions:
— The store is to the same location as the load.
— The store appears in program order before the PSSBB.
• When a load to a location appears in program order before the PSSBB, then the load does not speculatively read
data from any store satisfying all of the following conditions:
— The store is to the same location as the load.
— The store appears in program order before the SSBB.
 
Trace Synchronization Barrier (TSB CSYNC)The TSB CSYNC is a memory barrier instruction that preserves the relative order of memory accesses to System
registers due to trace operations and other memory accesses to the same registers.
A trace operation is an operation of the PE Trace Unit generating trace for an instruction when ARMv8.4-Trace is
implemented and enabled.
A TSB CSYNC is not required to execute in program order with respect to other instructions. This includes being
reordered with respect to other trace instructions. One or more context synchronization events are required to ensure
that TSB CSYNC is executed in the necessary order.
If trace is generated between a context synchronization event and a TSB CSYNC operation, these trace operations may
be reordered with respect to the TSB CSYNC operation, and therefore may not be synchronized.
The following situations are synchronized using a TSB CSYNC:
• A direct write B to a System register is ordered after an indirect read or indirect write of the same register by
a trace operation A, if all of the following are true:
— A is executed in program order before a context synchronization event C.
— C is in program order before a TSB CSYNC operation T.
— B is executed in program order after T.
• A direct read B of a System register is ordered after an indirect write to the same register by a trace operation
if all the following are true:
— A is executed in program order before a context synchronization event C1.
— C1 is in program order before TSB CSYNC operation T.
— T is executed in program order before a second context synchronization event C2.
— B is executed in program order after C2.
A TSB CSYNC operation is not needed to ensure a direct write B to a System register is ordered before an indirect read
or indirect write of the same register by a trace operation A, if all the following are true:
• A is executed in program order after a context synchronization event C.
• B is executed in program order before C.
The pseudocode function for the operation of a TSB CSYNC is TraceSynchronizationBarrier().
 
Data Synchronization Barrier (DSB)

A DSB is a memory barrier that ensures that memory accesses that occur before the DSB have completed before the completion of the DSB instruction. In doing this, it acts as a stronger barrier than a DMB and all ordering that is created by a DMB with specific options is also generated by a DSB with the same options.
Execution of a DSB:
• At EL2 ensures that any memory accesses caused by Speculative translation table walks from the EL1&0 translation regime have been observed.
• At EL3 ensures that any memory accesses caused by speculative translation table walks from the EL2 or EL2&0 translation regime.
For more information, see Use of out-of-context translation regimes on page D5-2406. A DSB executed by a PE, PEe, completes when all of the following apply:
• All explicit memory accesses of the required access types appearing in program order before the DSB are complete for the set of observers in the required shareability domain.
• If the required access types of the DSB is reads and writes, then all cache maintenance instructions and all TLB maintenance instructions issued by PEe before the DSB are complete for the required shareability domain. 

In addition, no instruction that appears in program order after the DSB instruction can alter any state of the system or perform any part of its functionality until the DSB completes other than:
• Being fetched from memory and decoded.
• Reading the general-purpose, SIMD and floating-point, Special-purpose, or System registers that are directly or indirectly read without causing side-effects.

网络资料:DMB 在双口 RAM 以及多核架构的操作中很有用。如果 RAM 的访问是带缓冲的,并且写完之后马上读,就必须让它“喘口气”——用 DMB 指令来隔离,以保证缓冲中的数据已经落实到 RAM 中。 DSB 比 DMB 更保险(当然也是有执行代价的),它是宁可错杀也不漏网——清空了写缓冲,使得任 何它后面的指令,不管要不要使用先前的存储器访问结果,通通等待访问完成。大虾们可以在有绝 对信心时使用 DMB,新手还是使用 DSB 比较保险。
同 DMB/DSB 相比,ISB 指令看起来似乎最强悍,但是却一身都是“愣劲”,不由分说就“动粗”。不过它还有其它的用场——对于高级底层技巧:“自我更新”(self-mofifying)代码,非常有用。举例 来说,如果某个程序从下一条要执行的指令处更新了自己,但是先前的旧指令已经被预取到流水线 中去了,此时就必须清洗流水线,把旧版本的指令洗出去,再预取新版本的指令。因此,必须在被 更新代码段的前面使用 ISB,以保证旧的代码从流水线中被清洗出去,不再有机会执行(译者觉得 这种做法太工于技巧,有点“作秀”,现实编程中应该极少会用到,因此读者不必太钻它)
Shareability and access limitations on the data barrier operationsThe DMB and DSB instructions take an argument that specifies:
• The shareability domain over which the instruction must operate. This is one of:
— Full system.
— Outer Shareable.
— Inner Shareable.
— Non-shareable.
• The accesses for which the instruction operates. This is one of:
— Read and write accesses, both before and after the barrier instruction.
— Write accesses only, before and after the barrier instruction.
— Read accesses before the barrier instruction, and read and write accesses after the barrier instruction.
Note
This form of a DMB or DSB instruction can be described as a Load-Load/Store barrier.
DMB和DSB指令有控制作用范围和作用类型的选项
Load-AcquireThe basic principle of both Load-Acquire and Load-AcquirePC instructions is to introduce order between the memory access generated by the Load-Acquire or Load-AcquirePC instruction and the memory accesses appearing in program order after the Load-Acquire or Load-AcquirePC instruction, such that the memory access generated by the Load-Acquire or Load-AcquirePC instruction is Observed-by each PE, to the extent that the PE is required to observe the access coherently, before any of the memory accesses appearing in program order after the Load-Acquire or Load-AcquirePC instruction are Observed-by that PE, to the extent that the PE is required to observe the accesses coherently.

Load Acquire 要求此指令的内存操作要比在其后的内存操作先被所有的observers observer到。

 

Load-AcquirePC
Store-ReleaseThe basic principle of a Store-Release instruction is to introduce order between the memory accesses generated bythe PE executing the Store-Release instruction, together with those which originate from a different PE, to the extentthat the PE is required to observe them coherently, Observed-by the PE before executing the Store-release.store release要求此指令之前的内存操作要在此指令执行之前被所有的observers observe到。
LoadLOAcquire  
StoreLORelease  
    1. Load-Acquire (含有Acquire语义的读操作)
        1. 相当于单向向后的屏障 (半个DMB)
         只保证该指令之后引起的内存访问只能在该指令结束之后开始
    2. Store-Release (含有Release语义的写操作)
        1. 相当于单向向前的屏障 (半个DMB)
         只保证该指令前的所有内存访问结束后开始执行

 

B2.4 Caches and memory hierarchy

B2.4.2 Memory hierarchy

 

cacheability and shareability
CacheabilityThis attribute defines whether memory locations are allowed to be allocated into a cache or not. Cacheability is defined independently for Inner and Outer Cacheability locations.定义一个内存是否可被cache,可否cache和预读没有关系,不能cache的内存也可以readahead的。
ShareabilityThis attribute defines whether memory locations are shareable between different agents in a system. Marking a memory location as shareable for a particular domain requires hardware to ensure that the location is coherent for all agents in that domain. Shareability is defined independently for Inner and Outer Shareability domains.Shareability说的是一个内存是否会被多个observers访问,
   
Point of Coherency (PoC)The point at which all agents that can access memory are guaranteed to see the same copy of a memory location for accesses of any memory type or cacheability attribute. In many cases this is effectively the main system memory, although the architecture does not prohibit the implementation of caches beyond the PoC that have no effect on the coherency between memory system agents.PoC就是系统中所有的agents看到的内存是一致的点。
Point of Unification (PoU)The PoU for a PE is the point by which the instruction and data caches and the translation table walks of that PE are guaranteed to see the same copy of a memory location. In many cases, the Point of Unification is the point in a uniprocessor memory system by which the instruction and data caches and the translation table walks have merged. The PoU for an Inner Shareable shareability domain is the point by which the instruction and data caches and the translation table walks of all the PEs in that Inner Shareable shareability domain are guaranteed to see the same copy of a memory location. Defining this point permits self-modifying software to ensure future instruction fetches are associated with the modified
version of the software by using the standard correctness policy of:
1. Clean data cache entry by address.
2. Invalidate instruction cache entry by address.
PoU是对一个PE而言的,在PoU的点,此PE的各种cache(data instruction tlb)看到的内存是一致的
Point of Persistence (PoP)The point in a memory system, if it exists, at or beyond the Point of Coherency, where a write to memory is maintained when system power is removed, and reliably recovered when power is restored to the affected locations in memory.此内存是可持久的

Offline Xingguang Feng 5 年前

Hi Larry,

PoU和PoC的概念可以用下面的图片形象地描述:

PoU:(注:PoU的观测主体还包括TLB)

1.png

PoC:

2.png

> 1.感觉PoC和PoU只是看待内存的主体不一样而已,PoC是具体的agent看同一块内存的数据应该保持一致,

> 而PoU是指的cache L1看同一块内存的数据应该保持一致,这样理解对么?

可以这样理解。

>2.还有就是什么地方会用到PoC和PoU的概念呢?

在cache maintenance operation里会用到。你会看到这样的指令:

<operation> to PoU

<operation> to PoC

每个处理器的PoC/PoU位置不尽相同。对于Cortex-A7而言,假设已经配置了内嵌L2 cache,则:

- PoU位于L2 cache

- PoC位于ACE master interface

3.png

BR,

Xingguang Feng

   

B2.4.3 Application level access to functionality related to caches

B2.4.4 Implication of caches for the application programmer

Context synchronization event
  

以下情况会导致 Context synchronization event

One of:
• Performing an ISB operation. An ISB operation is performed when an ISB instruction is executed and does not fail its condition code check.
• Taking an exception.
• Returning from an exception.
• Exit from Debug state.
• Executing a DCPS instruction.
• Executing a DRPS instruction.

1. ISB 指令

2. 产生一个异常

3. 从异常返回

4. 从Debug状态返回

5. 执行DCPS指令

6. 执行DRPS指令

5.6 DCPS和DRPS都和debug有关。

The effects of a Context synchronization event are:
• All unmasked interrupts that are pending at the time of the Context synchronization event are taken before the first instruction after the Context synchronization event.
• If halting is allowed, all Halting debug events that are pending at the time of the Context synchronization
event are taken before the first instruction after the Context synchronization event.
• No instructions appearing in program order after an instruction that causes a Context synchronization event will have performed any part of their functionality until the Context synchronization event has occurred.
• All direct and indirect writes to System registers that are made before the Context synchronization event affect any instruction, including a direct read, that appears in program order after the instruction causing the Context synchronization event.
• All completed changes to the translation tables for entries that, before the change, were not permitted to be cached in a TLB, affect all instruction fetches that appear in program order after the instruction causing the Context synchronization event.
• All invalidations of TLBs, instruction caches, and, in AArch32 state, branch predictors, that are completed before the Context synchronization event affect all instructions that appear in program order after an instruction causing a Context synchronization event.
• In AArch32 state, all Non-cacheable writes that are completed before the Context synchronization event affect all instructions that appear in program order after an instruction causing a Context synchronization event.

• Changes to the Debug external authentication interfaces that are made before the Context synchronization event affect any instruction that appears in program order after the instruction causing the Context synchronization event.

效果:

1. 未屏蔽的中断会执行。

2. Halting debug events会被执行

3. 在 Context synchronization event发生之前,任何in program order after它的指令都不能被执行。

4. 所有在Context synchronization event之前对系统寄存器的修改会影响在Context synchronization event之后的指令----不知所云

5. 

6

7

8

 

 

Note:

• The architecture requires that instructions that generate Context synchronization events do not appear to beexecuted speculatively, except that the performance monitor counters are permitted to reveal such speculation.

 

 

一些总要知识点
The PE might have fetched the instructions from memory at any time since the last Context synchronization event on that PE.取指令动作可能发生在任何时候。
Any instructions fetched in this way might be executed multiple times, if this is required by the execution of the program, without being refetched from memory. In the absence of a Context synchronization event, there is no limit on the number of times such an instruction might be executed without being refetched from
memory.
一条已经被取出的指令可能会被执行很多次(这里应该是在PE内部,从外边看应该是只被执行一次)。
The Arm architecture does not require the hardware to ensure coherency between instruction caches and memory, even for locations of shared memory.arm体系结构不保证Icache和内存一致(这里想说的应该是同一个PE写了一个值到Dcache同时更新到内存,但ICache不会被置为无效或更新)
If software requires coherency between instruction execution and memory, it must manage this coherency using Context synchronization events and cache maintenance instructions.如果程序要保证被执行指令(可能被修改了)和内存的一致性,必须使用cache管理指令和Context synchronization events。(先使用cache管理指令将内容写入到内存,然后使ICache无效,最后使用Context synchronization events指令清空流水线。
In AArch64 state, instruction accesses to Non-cacheable Normal memory can be held in instruction caches.Non-cacheable Normal memory也是可以放在ICache的。
How far ahead of the current point of execution instructions are fetched from is IMPLEMENTATION DEFINED. Such prefetching can be either a fixed or a dynamically varying number of instructions, and can follow any or all possible future execution paths. For all types of memory:
• The PE might have fetched the instructions from memory at any time since the last Context synchronization event on that PE.
• Any instructions fetched in this way might be executed multiple times, if this is required by the execution of the program, without being refetched from memory. In the absence of a Context synchronization event, there
is no limit on the number of times such an instruction might be executed without being refetched from memory

指令预取的长度是不固定的。

每条指令可能会被多次预取。

每条预取的指令可能会被多次执行。

 

 

B2.5 Alignment support

B2.5.1 Instruction alignment

PC alignment checking
PC alignment checking generates a PC alignment fault exception associated with the instruction fetch if, in AArch64 state, there is an attempt to architecturally execute an instruction that was fetched with a misaligned PC. A misaligned PC is when bits[1:0] of the PC are not 0b00.指令是32bit对其的,没有对其的指令会产生alignment fault
As with Instruction Aborts, speculative fetching of an instruction does not generate an exception. An exception occurs only on an attempt to architecturally execute the instruction.speculative fetching不会产生exception,只有当此指令被开始执行时异常才会产生。
AArch64.CheckPCAlignment()
    bits(64) pc = ThisInstrAddr();
    if pc<1:0> != '00' then
        AArch64.PCAlignmentFault();
 

 

SP alignment checking
A misaligned stack pointer is where bits[3:0] of the stack pointer are not 0b0000, when the stack pointer is used as the base address of the calculation, regardless of any offset applied by the instruction.sp alignment是16字节对齐的,可以打开或关闭
As with Data Aborts, a speculative data access to memory using the stack pointer does not generate the exception. The exception occurs only on an attempt to architecturally execute the instruction.speculative操作不会导致fault,只有当执行到此指令时才会产生fault。
Prefetch memory abort instructions do not cause synchronous exceptionsprefetch不会导致异常

B2.5.2 Alignment of data accesses

Alignment of data accesses
An unaligned access to any type of Device memory causes an Alignment fault.Device操作必须要求对齐
The alignment requirements for accesses to Normal memory are as follows: 
• For all instructions that load or store a single or multiple registers, other thanLoad-Exclusive/Store-Exclusive and Load-Acquire/Store-Release, if the address that is accessed is notaligned to the size of the data element being accessed, then one of the following occurs:
— An Alignment fault is generated.
— An unaligned access is performed.

对于非对其的访问有两种可能的结果(可以通过寄存器配置):

1. 产生fault

2. 按非对齐访问正常访问

All Load-Exclusive/Store-Exclusive, Load-Acquire/Store-Release, and Compare and Swap memory accesses that access a single element or a pair of elements generate an Alignment fault if the address being accessed is not aligned to the size of the data structure being accessed.Load-Exclusive/Store-Exclusive, Load-Acquire/Store-Release, and Compare and Swap这几个内存访问操作必须要求对其,否则fault

B2.5.3 Unaligned data access restrictions

Unaligned data access restrictions
Accesses are not guaranteed to be single-copy atomic except at the byte access level,不能保证single copy atomic
Unaligned accesses typically take a number of additional cycles to complete compared to a naturally-aligned access.增加访问时钟数
An operation that performs an unaligned access can abort on any memory access that it makes, and can abort on more than one access. This means that an unaligned access that occurs across a page boundary can generate an abort on either side of the boundary.

可能在任何时候abort

 

 

B2.6 Endian support

In ARMv8-A, A64 instructions have a fixed length of 32 bits and are always little-endian.指令总是小端模式
All memory-mapped peripherals defined in the Arm architecture must be little-endian.外设寄存器总是小端模式
  
  
  
  
  
  
  
  

 

 

 

B2.7 Memory types and attributes

B2.7.1 Normal memory

Normal memory
 It indicates that the hardware is permitted by the architecture to perform Speculative data read accesses to these locations, regardless of the access permissions for these locations.normal memory是可以进行speculative read的
   
   
Inner Shareability domainEach Inner Shareability domain contains a set of observers that are data coherent for each member of that set for data accesses with the Inner Shareable attribute made by any member of that set.包含一组observers,这个集合的任意成员使用带有Inner Shareable attribute的数据操作能保证他们具有 data coherent,
Outer Shareability domainEach Outer Shareability domain contains a set of observers that are data coherent for each member of that set for data accesses with the Outer Shareable attribute made by any member of that set.同上,只不过要使用带有Out Shareable attribute的数据操作
Non-shareable Normal memory

For Normal memory locations, the Non-shareable attribute identifies Normal memory that is likely to be accessed only by a single PE.

A location in Normal memory with the Non-shareable attribute does not require the hardware to make data accesses by different observers coherent, unless the memory is Non-cacheable. For a Non-shareable location, if other observers share the memory system, software must use cache maintenance instructions, if the presence of caches might lead to coherency issues when communicating between the observers. This cache maintenance requirement is in addition to the barrier operations that are required to ensure memory ordering.
For Non-shareable Normal memory, it is IMPLEMENTATION DEFINED whether the Load-Exclusive and Store-Exclusive synchronization primitives take account of the possibility of accesses by more than one observer.

简单的说就是只有一个PE会访问Non-shareable Normal memory
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   

 

 

 

 

 

 

 

 

 

 

 

声明:本文内容由网友自发贡献,转载请注明出处:【wpsshop】
推荐阅读
相关标签
  

闽ICP备14008679号