性能优化思路

[TOC]

内存优化

思路大致为：

节省对象分配可以减少GC时扫描的对象数
避免频繁创建销毁临时对象造成GC压力
进行预分配，减少扩容次数
尽可能分配一段连续且足够大的内存buffer进行数据处理
通过一些内置函数或方法减少内存拷贝
在必要场景下，刻意地进行逃逸分析，尽可能将对象分配在栈上

分配连续内存

当我们需要进行[]*A转换为[]*B操作时可以，先通过make([]B, len(A))的方式分配一段连续内存。

好处是：1. 内存是连续的，在循环查找时更快。2. 减少len(A)-1次内存分配。

package main


// A ...
type A struct {
    A1 int32
    A2 int32
}

// B ...
type B struct {
    B1 int
    B2 int
}

func conv(sliceA []*A) []*B {
    var (
        tempSliceB = make([]B, len(sliceA))
        sliceB     = make([]*B, len(sliceA))
    )
    for i := 0; i < len(sliceA); i++ {
        tempSliceB[i].B1 = int(sliceA[i].A1)
        tempSliceB[i].B2 = int(sliceA[i].A2)
        sliceB[i] = &tempSliceB[i]
    }
    return sliceB
}

func main() {
    var sliceA = []*A{{A1: 0}, {A1: 1}, {A1: 2}}
    conv(sliceA)
}

内存对齐

结构体字段合理排序

package main

import (
    "fmt"
    "unsafe"
)

type A struct {
    a int32
    b int32
    c int64
}

type B struct {
    a int32
    c int64
    b int32
}

func main() {
    fmt.Printf("size of A is %d\n", unsafe.Sizeof(A{})) // size of A is 16
    fmt.Printf("size of B is %d\n", unsafe.Sizeof(B{})) // size of B is 24
}

按类型聚合，比如map中按key1、key2、key3、value1、value2、value3连续紧密排列，减少不必要的填充

// https://github.com/golang/go/blob/e9e0d1ef704c4bba3927522be86937164a61100c/src/runtime/map.go#L150-L150
// A bucket for a Go map.
type bmap struct {
  // tophash generally contains the top byte of the hash value
  // for each key in this bucket. If tophash[0] < minTopHash,
  // tophash[0] is a bucket evacuation state instead.
  tophash [bucketCnt]uint8
  // Followed by bucketCnt keys and then bucketCnt elems.
  // NOTE: packing all the keys together and then all the elems together makes the
  // code a bit more complicated than alternating key/elem/key/elem/... but it allows
  // us to eliminate padding which would be needed for, e.g., map[int64]int8.
  // Followed by an overflow pointer.
}

// https://github.com/golang/go/blob/e9e0d1ef704c4bba3927522be86937164a61100c/src/sync/pool.go#L70-L70
type poolLocal struct {
  poolLocalInternal

  // Prevents false sharing on widespread platforms with
  // 128 mod (cache line size) = 0 .
  pad [128 - unsafe.Sizeof(poolLocalInternal{})%128]byte
}

合理地减少对象数：

小对象结构体合并

类似于如下的组合模式，对于小对象B组合到对象A使用不需要使用指针，并且当我们new(A)时只需要进行一次对象创建。可以节省对象数量，从而减少GC时扫描的对象数

package main

type A struct {
    a1 int32
    a2 int32
    B
}

type B struct {
    b1 int32
    b2 int32
}

有策略地进行字符串拼接

直接通过+进行字符串拼接时会额外创建临时对象【在元素在5个以内时，性能比较好】

使用strings.Join()可以减少临时对象的创建，但是有构造字符串切片的开销【给定字符串切片进行拼接，使用strings.Join()性能较好】

使用strings.Builder或者bytes.Buffer通过创建一个缓存区来进行字符串拼接【元素大于5个时，性能比较好】

提前进行边界检查

// binary.BigEndian.PutUint64()
// https://github.com/golang/go/blob/e9e0d1ef704c4bba3927522be86937164a61100c/src/encoding/binary/binary.go#L77-L77
func (littleEndian) Uint64(b []byte) uint64 {
  _ = b[7] // bounds check hint to compiler; see golang.org/issue/14808
  return uint64(b[0]) | uint64(b[1])<<8 | uint64(b[2])<<16 | uint64(b[3])<<24 | uint64(b[4])<<32 | uint64(b[5])<<40 
| uint64(b[6])<<48 | uint64(b[7])<<56
}

基准测试

//BenchmarkBoundLow
//BenchmarkBoundLow-12        868246250             1.34 ns/op           0 B/op           0 allocs/op
//BenchmarkBoundTop
//BenchmarkBoundTop-12        1000000000             0.511 ns/op           0 B/op           0 allocs/op

package demo

import (
    "testing"
)

var list = []int64{0, 1, 2, 3, 4, 5, 6, 7, 8}

func BenchmarkBoundLow(b *testing.B) {
    b.ReportAllocs()
    b.ResetTimer()
    for i := 0; i < b.N; i++ {
        _ = list[0]
        _ = list[1]
        _ = list[2]
        _ = list[3]
        _ = list[5]
        _ = list[6]
        _ = list[7]
        _ = list[8]
    }
}

func BenchmarkBoundTop(b *testing.B) {
    b.ReportAllocs()
    b.ResetTimer()
    for i := 0; i < b.N; i++ {
        _ = list[8]
        _ = list[7]
        _ = list[6]
        _ = list[5]
        _ = list[4]
        _ = list[3]
        _ = list[2]
        _ = list[1]
        _ = list[0]
    }
}

避免频繁创建临时对象

使用sync.Pool缓存

减少长调用栈

goroutine的调用栈默认大小是2K（1.7版本后），它采用连续栈机制，当栈空间不够时，Go runtime会不停扩容：

当栈空间不够时，按2倍增加，原有栈的变量崆直接copy到新的栈空间，变量指针指向新的空间地址；
退栈会释放栈空间的占用，GC时发现栈空间占用不到1/4时，则栈空间减少一半。

比如栈的最终大小2M，则极端情况下，就会有10次的扩栈操作，这会带来性能下降。

建议：

控制调用栈和函数的复杂度，不要在一个goroutine做完所有逻辑；
如查的确需要长调用栈，而考虑goroutine池化，避免频繁创建goroutine带来栈空间的变化。

预估容量，减少扩容次数

bytes.Buffer

会分配一段连续的内存，在使用的时候可以设置一个足够大的数。

需要刻意阅读下源码实现，确认在buffer容量不足的时候是否会触发grow导致二次分配对象以及内存拷贝。

slice、map预分配

减少不必要的memory copy

比如使用io.Copy等操作进行数据拷贝，而不是额外再开辟buffer进行中转

比如使用Readv、Writev将非连续内存一次读、写，减少buffer合并中转

对象逃逸分析

并发优化

高并发的任务处理使用goroutine池

避免高并发调用同步系统接口

高并发时减少或避免共享对象互斥粒度

内联优化

Go 编译器会在编译期自动把适合条件的函数内联到调用函数中，以减少函数调用返回时参数传递入栈出栈等性能耗损。

当被调用的函数很长时，可以进行拆分，以使部分比较常命中的逻辑分支内联到调用函数中。

比如 sync.Once 里面的的这段代码

// https://github.com/golang/go/blob/e9e0d1ef704c4bba3927522be86937164a61100c/src/sync/once.go#L58-L58
func (o *Once) Do(f func ()) {
  // Note: Here is an incorrect implementation of Do:
  //
  //    if atomic.CompareAndSwapUint32(&o.done, 0, 1) {
  //        f()
  //    }
  //
  // Do guarantees that when it returns, f has finished.
  // This implementation would not implement that guarantee:
  // given two simultaneous calls, the winner of the cas would
  // call f, and the second would return immediately, without
  // waiting for the first's call to f to complete.
  // This is why the slow path falls back to a mutex, and why
  // the atomic.StoreUint32 must be delayed until after f returns.

  if atomic.LoadUint32(&o.done) == 0 {
  // Outlined slow-path to allow inlining of the fast-path.Carlo Alberto Ferraris, 3 years ago: • sync: allow inlining the Once.Do fast path
  o.doSlow(f)
  }
}

func (o *Once) doSlow(f func ()) {
  o.m.Lock()
  defer o.m.Unlock()
  if o.done == 0 {
  defer atomic.StoreUint32(&o.done, 1)
  f()
  }
}

使用位运算代替分支跳转

reference

最后更新于3年前

这有帮助吗？

性能优化思路

[TOC]

内存优化

思路大致为：

节省对象分配可以减少GC时扫描的对象数
避免频繁创建销毁临时对象造成GC压力
进行预分配，减少扩容次数
尽可能分配一段连续且足够大的内存buffer进行数据处理
通过一些内置函数或方法减少内存拷贝
在必要场景下，刻意地进行逃逸分析，尽可能将对象分配在栈上

分配连续内存

当我们需要进行[]*A转换为[]*B操作时可以，先通过make([]B, len(A))的方式分配一段连续内存。

好处是：1. 内存是连续的，在循环查找时更快。2. 减少len(A)-1次内存分配。

package main


// A ...
type A struct {
    A1 int32
    A2 int32
}

// B ...
type B struct {
    B1 int
    B2 int
}

func conv(sliceA []*A) []*B {
    var (
        tempSliceB = make([]B, len(sliceA))
        sliceB     = make([]*B, len(sliceA))
    )
    for i := 0; i < len(sliceA); i++ {
        tempSliceB[i].B1 = int(sliceA[i].A1)
        tempSliceB[i].B2 = int(sliceA[i].A2)
        sliceB[i] = &tempSliceB[i]
    }
    return sliceB
}

func main() {
    var sliceA = []*A{{A1: 0}, {A1: 1}, {A1: 2}}
    conv(sliceA)
}

内存对齐

结构体字段合理排序

package main

import (
    "fmt"
    "unsafe"
)

type A struct {
    a int32
    b int32
    c int64
}

type B struct {
    a int32
    c int64
    b int32
}

func main() {
    fmt.Printf("size of A is %d\n", unsafe.Sizeof(A{})) // size of A is 16
    fmt.Printf("size of B is %d\n", unsafe.Sizeof(B{})) // size of B is 24
}

按类型聚合，比如map中按key1、key2、key3、value1、value2、value3连续紧密排列，减少不必要的填充

// https://github.com/golang/go/blob/e9e0d1ef704c4bba3927522be86937164a61100c/src/runtime/map.go#L150-L150
// A bucket for a Go map.
type bmap struct {
  // tophash generally contains the top byte of the hash value
  // for each key in this bucket. If tophash[0] < minTopHash,
  // tophash[0] is a bucket evacuation state instead.
  tophash [bucketCnt]uint8
  // Followed by bucketCnt keys and then bucketCnt elems.
  // NOTE: packing all the keys together and then all the elems together makes the
  // code a bit more complicated than alternating key/elem/key/elem/... but it allows
  // us to eliminate padding which would be needed for, e.g., map[int64]int8.
  // Followed by an overflow pointer.
}

// https://github.com/golang/go/blob/e9e0d1ef704c4bba3927522be86937164a61100c/src/sync/pool.go#L70-L70
type poolLocal struct {
  poolLocalInternal

  // Prevents false sharing on widespread platforms with
  // 128 mod (cache line size) = 0 .
  pad [128 - unsafe.Sizeof(poolLocalInternal{})%128]byte
}

合理地减少对象数：

小对象结构体合并

package main

type A struct {
    a1 int32
    a2 int32
    B
}

type B struct {
    b1 int32
    b2 int32
}

有策略地进行字符串拼接

直接通过+进行字符串拼接时会额外创建临时对象【在元素在5个以内时，性能比较好】

使用strings.Join()可以减少临时对象的创建，但是有构造字符串切片的开销【给定字符串切片进行拼接，使用strings.Join()性能较好】

使用strings.Builder或者bytes.Buffer通过创建一个缓存区来进行字符串拼接【元素大于5个时，性能比较好】

提前进行边界检查

// binary.BigEndian.PutUint64()
// https://github.com/golang/go/blob/e9e0d1ef704c4bba3927522be86937164a61100c/src/encoding/binary/binary.go#L77-L77
func (littleEndian) Uint64(b []byte) uint64 {
  _ = b[7] // bounds check hint to compiler; see golang.org/issue/14808
  return uint64(b[0]) | uint64(b[1])<<8 | uint64(b[2])<<16 | uint64(b[3])<<24 | uint64(b[4])<<32 | uint64(b[5])<<40 
| uint64(b[6])<<48 | uint64(b[7])<<56
}

基准测试

//BenchmarkBoundLow
//BenchmarkBoundLow-12        868246250             1.34 ns/op           0 B/op           0 allocs/op
//BenchmarkBoundTop
//BenchmarkBoundTop-12        1000000000             0.511 ns/op           0 B/op           0 allocs/op

package demo

import (
    "testing"
)

var list = []int64{0, 1, 2, 3, 4, 5, 6, 7, 8}

func BenchmarkBoundLow(b *testing.B) {
    b.ReportAllocs()
    b.ResetTimer()
    for i := 0; i < b.N; i++ {
        _ = list[0]
        _ = list[1]
        _ = list[2]
        _ = list[3]
        _ = list[5]
        _ = list[6]
        _ = list[7]
        _ = list[8]
    }
}

func BenchmarkBoundTop(b *testing.B) {
    b.ReportAllocs()
    b.ResetTimer()
    for i := 0; i < b.N; i++ {
        _ = list[8]
        _ = list[7]
        _ = list[6]
        _ = list[5]
        _ = list[4]
        _ = list[3]
        _ = list[2]
        _ = list[1]
        _ = list[0]
    }
}

避免频繁创建临时对象

使用sync.Pool缓存

减少长调用栈

goroutine的调用栈默认大小是2K（1.7版本后），它采用连续栈机制，当栈空间不够时，Go runtime会不停扩容：

当栈空间不够时，按2倍增加，原有栈的变量崆直接copy到新的栈空间，变量指针指向新的空间地址；
退栈会释放栈空间的占用，GC时发现栈空间占用不到1/4时，则栈空间减少一半。

比如栈的最终大小2M，则极端情况下，就会有10次的扩栈操作，这会带来性能下降。

建议：

控制调用栈和函数的复杂度，不要在一个goroutine做完所有逻辑；
如查的确需要长调用栈，而考虑goroutine池化，避免频繁创建goroutine带来栈空间的变化。

预估容量，减少扩容次数

bytes.Buffer

会分配一段连续的内存，在使用的时候可以设置一个足够大的数。

需要刻意阅读下源码实现，确认在buffer容量不足的时候是否会触发grow导致二次分配对象以及内存拷贝。

slice、map预分配

减少不必要的memory copy

比如使用io.Copy等操作进行数据拷贝，而不是额外再开辟buffer进行中转

比如使用Readv、Writev将非连续内存一次读、写，减少buffer合并中转

对象逃逸分析

并发优化

高并发的任务处理使用goroutine池

避免高并发调用同步系统接口

高并发时减少或避免共享对象互斥粒度

内联优化

Go 编译器会在编译期自动把适合条件的函数内联到调用函数中，以减少函数调用返回时参数传递入栈出栈等性能耗损。

当被调用的函数很长时，可以进行拆分，以使部分比较常命中的逻辑分支内联到调用函数中。

比如 sync.Once 里面的的这段代码

// https://github.com/golang/go/blob/e9e0d1ef704c4bba3927522be86937164a61100c/src/sync/once.go#L58-L58
func (o *Once) Do(f func ()) {
  // Note: Here is an incorrect implementation of Do:
  //
  //    if atomic.CompareAndSwapUint32(&o.done, 0, 1) {
  //        f()
  //    }
  //
  // Do guarantees that when it returns, f has finished.
  // This implementation would not implement that guarantee:
  // given two simultaneous calls, the winner of the cas would
  // call f, and the second would return immediately, without
  // waiting for the first's call to f to complete.
  // This is why the slow path falls back to a mutex, and why
  // the atomic.StoreUint32 must be delayed until after f returns.

  if atomic.LoadUint32(&o.done) == 0 {
  // Outlined slow-path to allow inlining of the fast-path.Carlo Alberto Ferraris, 3 years ago: • sync: allow inlining the Once.Do fast path
  o.doSlow(f)
  }
}

func (o *Once) doSlow(f func ()) {
  o.m.Lock()
  defer o.m.Unlock()
  if o.done == 0 {
  defer atomic.StoreUint32(&o.done, 1)
  f()
  }
}

使用位运算代替分支跳转

reference

最后更新于3年前

这有帮助吗？

内存优化

分配连续内存

内存对齐

结构体字段合理排序

按类型聚合，比如map中按key1、key2、key3、value1、value2、value3连续紧密排列，减少不必要的填充

通过显式填充避免 false sharing:

合理地减少对象数：

小对象结构体合并

有策略地进行字符串拼接

提前进行边界检查

避免频繁创建临时对象

使用sync.Pool缓存

减少长调用栈

预估容量，减少扩容次数

bytes.Buffer

slice、map预分配

减少不必要的memory copy

对象逃逸分析

并发优化

高并发的任务处理使用goroutine池

避免高并发调用同步系统接口

高并发时减少或避免共享对象互斥粒度

内联优化

使用位运算代替分支跳转

reference

内存优化

分配连续内存

内存对齐

结构体字段合理排序

按类型聚合，比如map中按key1、key2、key3、value1、value2、value3连续紧密排列，减少不必要的填充

通过显式填充避免 false sharing:

合理地减少对象数：

小对象结构体合并

有策略地进行字符串拼接

提前进行边界检查

避免频繁创建临时对象

使用sync.Pool缓存

减少长调用栈

预估容量，减少扩容次数

bytes.Buffer

slice、map预分配

减少不必要的memory copy

对象逃逸分析

并发优化

高并发的任务处理使用goroutine池

避免高并发调用同步系统接口

高并发时减少或避免共享对象互斥粒度

内联优化

使用位运算代替分支跳转

reference