What is the Web Discovery Project? Follow
The Web Discovery Project is a privacy-preserving way for you to contribute to the growth and independence of Brave Search. If you opt in, you’ll contribute some anonymous data about searches and web page visits made within the Brave Browser (including pages arrived at via some, but not all, other search engines). This data helps build the Brave Search independent index, and ensure we show results relevant to your search queries. By “data” we mean search queries, search result clicks, the URLs of pages visited in the browser, time spent on those pages, and some metadata about the pages themselves.
The Web Discovery Project runs in the background, so it requires no effort on the part of contributors. Data contributed cannot be linked back to whoever contributed it, or grouped together, which prevents deanonymization attempts. Opt out at any time.
Why we built the Web Discovery Project
Providing relevant search results is essential to building a search engine people want to use. It’s how we create a private search engine that still competes with big tech on quality and completeness. To ensure search results are as relevant as possible, Brave needs to understand some key things, including:
- How closely search results match the search keywords (matching to exact words, parts of words, or synonyms)
- How recent searches are for those keywords
- How often a search result is clicked for a given keyword
- How popular search keywords are
- What pages are popular or novel
- Which sites only allow crawling by the Google search bot
Ensuring relevance also means reducing the “noise” from web content that makes a search less relevant. For example, if you search for “Europe weather” and see results relating to European history or European business, you would say the results are less relevant to your query. Learning through the Web Discovery Project enables Brave Search to filter this noise out, but in a privacy-preserving way. Making search more relevant shouldn’t come at the expense of your online privacy.
Most search providers—like Google and Microsoft—collect data about your search behavior, both in the search engine and the browser (like Chrome or Edge). This data includes your queries, what search results you click, the URLs of the pages you visit, time spent on those pages, and metadata (such as page title, content-type, etc) about the pages themselves. Other, non-independent search engines (like DuckDuckGo) don’t necessarily collect data themselves. But they still rely on this kind of collection via their dependence on other big tech indexes (like Bing). And this data can—and often is—associated with you personally.
Search providers collect this kind of data to continuously grow their indexes—the list of billions of web pages they draw from to deliver results—and ensure results are relevant and never stale. This collection isn’t inherently bad. But it’s shortcomings become apparent when you look at Brave’s alternative way:
- The Web Discovery Project allows you to contribute anonymous, generalized data.
- The Web Discovery Project is designed to prevent us from associating this data with you. This means there’s no data for Brave to sell to advertisers, or lose to theft or hacking, allowing us to promise through technology rather than words.
- Brave’s Web Discovery Project is opt-in only, and totally transparent.
The protection of unlinkability
Brave doesn’t follow the sneaky practices of other big tech search engines. The Web Discovery Project is opt-in, and the data collected under the Web Discovery Project has specific protections to ensure anonymity. In addition to these protections, the Web Discovery Project adheres to the principle of “unlinkability.” This means we do not link data to you, your browser, or your device. Brave Search has no concept of a user or session ID, which prevents record linkability. Further, the Web Discovery Project includes multiple protections to prevent websites or searches specific to you, or that include personal or sensitive information, from inclusion.
What keywords are being searched most often? What websites do those keywords lead to? How are those websites interacted with? These kinds of directional questions help Brave Search navigate the world of available web pages, and separate signal from noise. And this, in turn, helps us understand the parts of the web worth indexing for users.
If you opt-in to the Web Discovery Project, your browser will process the following data on your device, and securely send it to Brave’s servers:
- A fraction of the addresses (URLs) of the web pages visited in the Brave Browser, along with engagement metrics (how much time is spent on the page)
- A fraction of the queries (e.g. “New York weather today”) conducted in some search engines (outside of Brave Search) within the Brave Browser, along with the associated click on a result (if any)
- Metadata of those visited pages (e.g. if the page contains a video, info about page author or owner, page title, etc.), never the content of the page itself.
- For a complete list, check out Brave’s GitHub repo
With this data, Brave can learn (in a private, unlinkable way) things like how many visits to a website (e.g. Wikipedia) lasted longer than 20 seconds, or how many times a given query (e.g. “What is Wikipedia?”) led a user to click through to that website. This calibrates Brave Search to know a website is legitimate, and that users find the content valuable. This, in turn, allows the search engine to understand result relevance, and to serve pages with higher relevance at the top of search results.
This data does not allow Brave to know things like associated queries (e.g. other queries conducted by people who searched “What is Wikipedia?”) or the other websites visited. And it of course tells us nothing that would allow us to link the data to an individual or their device.
By default, all users are opted out of the Web Discovery Project. If you’ve chosen to opt in, you can opt out again at any time. Whatever you choose—opt in or opt out—your experience in Brave or Brave Search will not change.To opt out, open a new tab in the Brave browser and click Settings. Scroll to “Web Discovery Project,” and toggle this setting off.
The Web Discovery Project is lightweight and runs only in the background. There should be no noticeable impact on browsing speed, page-rendering speed, or other similar metrics. However, there may be some small (but likely unnoticeable) overhead in the form of extra CPU and bandwidth consumed. Note that the Web Discover Project only runs on desktop devices, so there is no impact on mobile data plans. If you notice performance issues, please notify us immediately.
All URLs sent must be publicly available—that is, they must have the same content regardless of who is contributing them. This can only be true if the pages are not behind a log-in, individual session, or other authentication. All URLs sent must have been visited by at least 20 different people, which establishes a distributed quorum similar to k-anonymity.
Additionally, there are a variety of heuristics applied to rule out URLs that encode access i.e. capability URLs (such as shared docs, Dropbox links, invoice links, etc). By design, none of these URLs are sent. And, even if they somehow were, the record-unlinkability protocol means no one with access to the data could recover other URLs from the same origin, or associate any data with anyone.
The above protections also apply to search queries. Any query containing what appears to be personal data, such as emails, phone numbers, or hashes, are automatically discarded rather than sent.
- An overview of the Web Discovery Project can be found on Brave’s GitHub repo.
- Read the top-level README.
- View the source code.
If you spot a potential problem, please create an issue on the repo, or contact us.