How bloom filters made SQLite 10x faster (avi.im)

submitted 5 days ago by rimu@piefed.social to c/programming@programming.dev

12 comments fedilink hide all child comments

top 12 comments

sorted by: hot top controversial new old

[-] 0x0@programming.dev 12 points 4 days ago* (last edited 4 days ago)

The results of this research have been applied to SQLite already and were released in v3.38.0.

tl;dr: Bloom filters were great because: minimal memory overhead, goes well with SQLite’s simple implementation, and worked within existing query engine.

[-] mox@lemmy.sdf.org 13 points 5 days ago

Bloom filters can be handy. They're worth learning even if you don't care about SQLite.

[-] livingcoder@programming.dev 2 points 3 days ago

Do you happen to know of a few situations where bloom filters are super useful? I need to identify when to use them.

[-] mox@lemmy.sdf.org 3 points 3 days ago

Here's a clever use of them: https://blog.mozilla.org/security/2020/01/09/crlite-part-2-end-to-end-design/

[-] slazer2au@lemmy.world 12 points 5 days ago

Neat. Most of that went over my head but always good to see more performance out of existing tech.

[-] magic_lobster_party@fedia.io 13 points 4 days ago

Basically it’s just an optimization of a double nested for loop. It’s a way to avoid running the inner for loop when it is known there will be no hit.

This is useful when we for example want to find all product orders of customers in a particular country. The way we can do this is to first filter all customers by their country, and then match orders by the remaining customers. The matching step is the double for loop.

Something like this:

for order in orders:
    for customer in customers_in_country:
        if order.customer_id == customer.id:
            …

Many orders won’t match a customer in the above query, so we want to single out these orders before we run the expensive inner for loop. The way they do it is to create a cache using a Bloom filter. I’d recommend looking it up, but it’s a probabilistic cache that’s fast and space efficient, at the cost of letting through some false positives. With this particular use case it’s ok to have some false positives. The worst thing that can happen is that the inner for loop is run more times than necessary.

The final code is something like this:

bloom_filter = create_bloom(customers_in_country)
for order in orders:
    if bloom_filter.contains(order.customer_id):
        for customer in customers_in_country:
            if order.customer_id == customer.id:
                …

Edit: this comment probably contain many inaccuracies, as I’ve never done this kind of stuff in practice, so don’t rely too much on it.

[-] ShawiniganHandshake@sh.itjust.works 5 points 4 days ago

I haven't read the article but I work with Bloom filters at work sometimes.

Bloom filters basically tell you "this thing might be present" or "this thing is definitely not present".

If you're looking for a piece of data in a set of large files, being able to say "this data is definitely not in this file" saves you a bunch of time because you can skip over the file instead of searching through the whole thing just to figure out what you're looking for isn't there.

[-] mac@lemm.ee 5 points 4 days ago

We use them for our password blacklisting. Crazy fast even with millions of entries

this post was submitted on 22 Dec 2024

79 points (98.8% liked)

Programming

17680 readers

123 users here now

Welcome to the main community in programming.dev! Feel free to post anything relating to programming here!

Cross posting is strongly encouraged in the instance. If you feel your post or another person's post makes sense in another community cross post into it.

Hope you enjoy the instance!

Rules

Follow the programming.dev instance rules
Keep content related to programming in some way
If you're posting long videos try to add in some form of tldr for those who don't want to watch videos

Wormhole

Follow the wormhole through a path of communities !webdev@programming.dev

founded 2 years ago

MODERATORS

snowe@programming.dev

Ategon@programming.dev

MaungaHikoi@lemmy.nz