Slow performance populating tree #48

phemmer · 2023-02-27T05:53:39Z

I was looking at various modules that can provide a fast lookup for whether an IP is in a list, and while testing this module, it turned out to have significantly slower load time compared to the other alternatives I looked at. The lookup is indeed faster, but not enough to justify the increased load time, which for me is prohibitively slow (20x slower than the fastest alternative, and takes over a minute to load a list of about 5 million IPs). I just wanted to create this issue so you're aware, and in case there's something which can be done to increase performance.

Benchmark code

package main

import (
	"encoding/binary"
	"math/rand"
	"net"
	"strconv"
	"testing"

	"github.com/asergeyev/nradix"
	"github.com/infobloxopen/go-trees/iptree"
	"github.com/yl2chen/cidranger"
)


var rng = rand.New(rand.NewSource(0))

func randIP(bits int) net.IP {
	var ipa [4]byte
	ip := net.IP(ipa[:])
	binary.BigEndian.PutUint32(ip[:], uint32(rng.Intn(1<<bits)<<(32-bits)))
	return ip
}

var IPs []string
func init() {
	for len(IPs) < 100000 {
		ip := randIP(24)
		mask := strconv.Itoa(rand.Intn(25) + 8)
		IPs = append(IPs, ip.String() + "/" + mask)
	}
}

func BenchmarkRangerLoad(b *testing.B) {
	for n := 0; n < b.N; n++ {
		ranger := cidranger.NewPCTrieRanger()
		for _, ipStr := range IPs {
			_, ipnet, _ := net.ParseCIDR(ipStr)
			ranger.Insert(cidranger.NewBasicRangerEntry(*ipnet))
		}
	}
}

func BenchmarkInfobloxLoad(b *testing.B) {
	for n := 0; n < b.N; n++ {
		ipt := iptree.NewTree()
		for _, ipStr := range IPs {
			_, ipnet, _ := net.ParseCIDR(ipStr)
			ipt.InplaceInsertNet(ipnet, nil)
		}
	}
}

func BenchmarkNRadixLoad(b *testing.B) {
	for n := 0; n < b.N; n++ {
		tree := nradix.NewTree(0)
		for _, ipStr := range IPs {
			tree.AddCIDR(ipStr, nil)
		}
	}
}

func BenchmarkRangerRead(b *testing.B) {
	b.StopTimer()
	ranger := cidranger.NewPCTrieRanger()
	for _, ipStr := range IPs {
		_, ipnet, _ := net.ParseCIDR(ipStr)
		ranger.Insert(cidranger.NewBasicRangerEntry(*ipnet))
	}
	b.StartTimer()

	for n := 0; n < b.N; n++ {
		ranger.Contains(randIP(32))
	}
}

func BenchmarkInfobloxRead(b *testing.B) {
	b.StopTimer()
	ipt := iptree.NewTree()
	for _, ipStr := range IPs {
		_, ipnet, _ := net.ParseCIDR(ipStr)
		ipt.InplaceInsertNet(ipnet, nil)
	}
	b.StartTimer()

	for n := 0; n < b.N; n++ {
		ipt.GetByIP(randIP(32))
	}
}

func BenchmarkNRadixRead(b *testing.B) {
	b.StopTimer()
	tree := nradix.NewTree(0)
	for _, ipStr := range IPs {
		tree.AddCIDR(ipStr, nil)
	}
	b.StartTimer()

	for n := 0; n < b.N; n++ {
		ip := randIP(32)
		tree.FindCIDR(ip.String() + "/32")
	}
}

BenchmarkRangerLoad-16      	       2	 797361462 ns/op
BenchmarkInfobloxLoad-16    	      14	  88521051 ns/op
BenchmarkNRadixLoad-16      	      27	  40973037 ns/op
BenchmarkRangerRead-16      	 8325171	       156.6 ns/op
BenchmarkInfobloxRead-16    	 2999112	       450.3 ns/op
BenchmarkNRadixRead-16      	 3116860	       362.2 ns/op

Edit: Update: In the end I ended up with a private fork that strips out a ton of code from cidranger to get the performance up. Mostly I removed support for all the different address types and used only netip types, as well as adding a loader which uses the last insert point as the starting point to search for a location for the new node. This got load times over 20x faster, and lookups 1.5x faster.
https://gist.github.com/phemmer/6231b12d5207ea93a1690ddc44a2c811

The text was updated successfully, but these errors were encountered:

AmitThakkar · 2024-03-06T10:25:39Z

@phemmer What are the other alternatives you found?

phemmer · 2024-03-07T17:34:44Z

They're listed in the benchmark code on the issue description.

ldkingvivi · 2024-03-23T19:52:23Z

@phemmer Find func may not seem do what you want, it doesn't return most specific result. Try two entry in trie like 8.8.8.0/24 and 8.8.8.0/25 each with some value, and Find "8.8.8.8", it will show up /24 value instead of /25 value.

I think the original fn just Contain, which return bool and doesn't matter if that's more specific or not, and your Find fn try to return most specific entry(smallest prefix), which require follow the children path

maybe change the find fn to

func (p *Trie) find(number netip.Addr) any {
	if !netContains(p.network, number) {
		return nil
	}

	if p.network.Bits() == 128 {
		return nil
	}
	bit := p.discriminatorBitFromIP(number)
	child := p.children[bit]
	if child != nil {
           if r := child.find(number); r != nil{
		return r
           }
        }
    
        if p.value != nil {
		return p.value
	}

	return nil
}

Use netip from yl2chen#48

gaby · 2024-03-24T21:51:31Z

@phemmer I'm guessing your gist is under MIT License, since it's based on cidranger. I can develop/publish a go module with this code, tests, benchmarks, docs and give you credit in the README. Let me know if that is fine with you.

gaby · 2024-03-25T03:56:09Z

@ldkingvivi Seems like you integrated those changes already and added even more functionality. The Contains() bool function is very useful!

phemmer · 2024-03-25T14:03:53Z

Yes, you may do what you want with it.

Also as a side note, I'd encourage discussion on the gist itself, as it's linked to from places other than here.

I also updated it to address the issue regarding find(). Solution was basically the same as what @ldkingvivi proposed.
I also included another fix when matching against v4 /32 or v6 /128 addresses.

phemmer · 2024-03-26T14:01:57Z

Since there seems to be moderate interest in my implementation, I've published it as a full package here: https://github.com/phemmer/go-iptrie/

gaby · 2024-03-26T14:06:06Z

@phemmer Awesome, from my local benchmarks it seems the changes made by @ldkingvivi make it even faster.

See here: https://github.com/ldkingvivi/cidranger/blob/master/iptire/trie.go

ldkingvivi · 2024-03-26T17:46:19Z

@phemmer my change was mainly adding Entry with generic, so containing and cover networking will all return something instead of pure prefix, and of course some tests and benchmark using existing way how cidrange tested/benchmakred. Do you want me create a PR to your repo? I will likely have time during weekend.

ldkingvivi@d5fcd23

phemmer · 2024-03-26T18:24:45Z

@phemmer my change was mainly adding Entry with generic

I had considered using generics, but stuck with any types as the usage is simpler (in some scenarios). I think using generics is a valid design that some might prefer, but I think it should likely be a new package entirely. It's possible one might be able to implement a design using any types on top of a generics implementation. But I didn't feel the effort was justified. If you do have a solid design that still allows both to be used, then it might be worth considering.

containing and cover networking will all return something instead of pure prefix

The Contains method stops searching on the first (largest) network it finds. Thus it's not going to be have access to the data from the most specific network (largest prefix) entry. If you do want the data from the largest network that matches, you can use the FindLargest method instead. I don't want to change Contains to be anything other than a boolean.

Having CoveredNetworks return the data, or something which allows obtaining both the data and the network might be valid. However I feel like a good design that allows retrieving the data along with the network itself would be rather complicated. Though I could be wrong, as I may not have considered all possibilities. I'm instead considering ripping out CoveredNetworks entirely. It's there because it was in cidranger. However I don't know of any valid use cases for it (if you have one, I'd love to hear it). So rather than expand its capability, I'm more inclined to remove it until someone actually needs it. I would rather provide minimal functionality, and expand it in the future, than have to maintain useless functionality in perpetuity.

and of course some tests and benchmark using existing way how cidrange tested/benchmakred

I omitted tests from my gist because gists are meant to be concise, not because they didn't exist. They exist, and were published in the package.

As far as benchmarks, the package also contains a benchmarking suite. If you don't feel the benchmarks are meaningful, I'd be willing to consider additional scenarios. I do know that the random networks aren't likely representative of real-world scenarios, as it creates too many networks which are too large. I've considered solving that using a curve to make smaller prefixes more rare. But I don't think it'd change the overall results significantly, so I haven't bothered.

Do you want me create a PR to your repo? I will likely have time during weekend.

If you like, though I'd recommend proposing specific changes for discussion first. The proposals should be in that repo as well, as this one should be for discussions about cidranger only, and I feel we've hijacked it enough.

ldkingvivi · 2024-03-26T20:35:13Z

I didn't referring to Contains, but ContainingNetworks. Also for the comments about tests and benchmark, I just simply mentioned the fact what I added, nothing criticize your gist. At this point, I will just continue use my branch directly, since I do have use case to fetch data other than just prefix, as well as generic for my use case.

One thing I agree, we probably hijacked this place enough, I will stop doing that now.

ldkingvivi added a commit to ldkingvivi/cidranger that referenced this issue Mar 23, 2024

Merge pull request #1 from ldkingvivi/use_netip

d5fcd23

Use netip from yl2chen#48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow performance populating tree #48

Slow performance populating tree #48

phemmer commented Feb 27, 2023 •

edited

AmitThakkar commented Mar 6, 2024

phemmer commented Mar 7, 2024

ldkingvivi commented Mar 23, 2024 •

edited

gaby commented Mar 24, 2024

gaby commented Mar 25, 2024

phemmer commented Mar 25, 2024

phemmer commented Mar 26, 2024 •

edited

gaby commented Mar 26, 2024

ldkingvivi commented Mar 26, 2024 •

edited

phemmer commented Mar 26, 2024 •

edited

ldkingvivi commented Mar 26, 2024 •

edited

Slow performance populating tree #48

Slow performance populating tree #48

Comments

phemmer commented Feb 27, 2023 • edited

AmitThakkar commented Mar 6, 2024

phemmer commented Mar 7, 2024

ldkingvivi commented Mar 23, 2024 • edited

gaby commented Mar 24, 2024

gaby commented Mar 25, 2024

phemmer commented Mar 25, 2024

phemmer commented Mar 26, 2024 • edited

gaby commented Mar 26, 2024

ldkingvivi commented Mar 26, 2024 • edited

phemmer commented Mar 26, 2024 • edited

ldkingvivi commented Mar 26, 2024 • edited

phemmer commented Feb 27, 2023 •

edited

ldkingvivi commented Mar 23, 2024 •

edited

phemmer commented Mar 26, 2024 •

edited

ldkingvivi commented Mar 26, 2024 •

edited

phemmer commented Mar 26, 2024 •

edited

ldkingvivi commented Mar 26, 2024 •

edited