49
submitted 1 month ago* (last edited 1 month ago) by Fijxu@programming.dev to c/privacy@lemmy.ml

This is not a long post, but I wanted to post this somewhere. This may be useful if someone is doing an article about Google or something like that.

While I was changing some things in my server configuration, some user accessed a public folder on my site, I was looking at the access logs of it at the time, everything completely normal up to that point until 10 SECONDS AFTER the user request, a request coming from a Google IP address with Googlebot/2.1; +http://www.google.com/bot.html user-agent hits the same public folder. Then I noticed that the user-agent of the user that accessed that folder was Chrome/131.0.0.0.

I have a subdomain and there is some folders of that subdomain that are actually indexed on the Google search engine, but that specific public folder doesn't appear to be indexed at all and it doesn't show up on searches.

May be that google uses Google Chrome users to discover unindexed paths of the internet and add them to their index?

I know it doesn't sound very shocking because most people here know that Google Chrome is a privacy nightmare and it should be avoided at all times, but I never saw this type of behavior on articles about "why you should avoid Google Chrome" or similar.

I'm not against anyone scrapping the page either since it's public anyways, but the fact they discover new pages of the internet making use of Google Chrome impressed me a little.

Edit: Fixed a typo

all 11 comments
sorted by: hot top controversial new old
[-] solrize@lemmy.world 25 points 1 month ago* (last edited 1 month ago)

I had some private pages a while back that linked to unrelated pages on other sites. I had to go somewhat crazy to stop the private urls from leaking to the external sites through referer headers when my users clicked on the links.

If chrome is sending people's browser histories to Google that is invasive.

[-] dysprosium@lemmy.dbzer0.com 5 points 1 month ago

So how did you stop the referer header from doing that. I'd imagine it to be a clear simple command since it ought to be. Or was it not that straightforward?

[-] solrize@lemmy.world 5 points 1 month ago

It's easier now that there are some control headers for it. At the time I tried a lot of things like bouncing through javascript opening a new window. Results varied by browser. The simplest way was to inconvenience users a bit by supplying text urls for them to paste into the nav bar, instead of clickable links.

[-] chevy9294@monero.town 14 points 1 month ago

100% if you have enabled "Safe browsing" (which is enabled by default). This also applies to Firefox, but I don't know if there is enabled by default.

[-] Fijxu@programming.dev 7 points 1 month ago

That makes perfect sense since Google Chrome has safe search enabled by default and most people don't bother about changing their settings.

[-] bamboo@lemmy.blahaj.zone 11 points 1 month ago

Do any of the pages in the directory link to other websites? It could be that if you link to a website that is using Google analytics, it may see that referrer header when the person using chrome opened the link. If it knew that your site didn't have links to the third party site before, maybe that triggered a refresh.

You could test this by making a page linking to CNN or another site which is using Google analytics, and using Firefox (without anything that would block Google Analytics) and click on the link on your site to the other site. if the Google bot checks your site within 10 seconds then you could rule out chrome as the culprit.

[-] Fijxu@programming.dev 4 points 1 month ago

Nope, is just a file indexer that I host publicly. I don't care about sharing the URL to provide more context.

The user accesed https://luna.nadeko.net/Movies/Ch3k0p3t3/ with Google Chrome

And 10 seconds after, Googlebot scrapes the folder.

Simple as that, I don't have privacy invasive trackers on any of my webpages/services

[-] TheOctonaut@mander.xyz 4 points 1 month ago

Are you using Google's DNS?

[-] pupbiru@aussie.zone 6 points 1 month ago

DNS will only leak domains (and subdomains); not paths

[-] Fijxu@programming.dev 2 points 1 month ago

DNS doesn't affect at all in this case

this post was submitted on 28 Nov 2024
49 points (93.0% liked)

Privacy

32525 readers
106 users here now

A place to discuss privacy and freedom in the digital world.

Privacy has become a very important issue in modern society, with companies and governments constantly abusing their power, more and more people are waking up to the importance of digital privacy.

In this community everyone is welcome to post links and discuss topics related to privacy.

Some Rules

Related communities

much thanks to @gary_host_laptop for the logo design :)

founded 5 years ago
MODERATORS