Hackers, Scrapers & Fakers: What’s Really Inside the Latest LinkedIn Dataset

By admin
Edit (

1 day later
DATE

): After posting this, the party responsible for leaking the data turned around and said "that was only a small part of it, here’s the whole thing", and released records encompassing a further

14
CARDINAL

M records. I’ve added those into

HIBP
ORG

and will shortly be re-sending notifications to people monitoring domains as the count of impacted addresses will likely have changed. Everything else about the subsequent dataset is consistent with what you’ll read below in terms of structure, patterns and conclusions.

I like to think of investigating data breaches as a sort of scientific search for truth. You start out with a theory (a set of data coming from an alleged source), but you don’t have a vested interested in whether the claim is true or not, rather you follow the evidence and see where it leads. Verification that supports the alleged source is usually quite straightforward, but disproving a claim can be a rather time consuming exercise, especially when a dataset contains fragments of truth mixed in with data that is anything but. Which is what we have here

today
DATE

.

To lead with the conclusion and save you reading all the details if you’re not inclined, the dataset so many people flagged me

this week
DATE

titled "

Linkedin Database 2023
WORK_OF_ART

2.5 Millions" turned out to be a combination of publicly available

LinkedIn
ORG

profile data and

5.8
CARDINAL

M email addresses mostly fabricated from a combination of

first
ORDINAL

and last name. It all began with this tweet:

All good lies are believable at face value; is it feasible a massive corpus of

LinkedIn
ORG

data is floating around? Well, they were proper breached in

2012
DATE

to the tune of

164
CARDINAL

M records (by which I mean that incident was genuinely internal data such as email addresses and passwords extracted out by a vulnerability), then they were massively scraped in

2021
DATE

with another

126
CARDINAL

M records going into Have I Been Pwned (

HIBP
ORG

). So, when you see a claim like the one above, it seems highly feasible at face value which is what many people take it at. But I’m a bit more suspicious than most people 🙂


First
ORDINAL

, the claim:

This one is similar to my twitter data scrapped [sic] but for linkedin plus 2023

Now, there’s a whole debate about whether scraped data is breached data and indeed whether the definition of it even matters. With the rising prevalence of scraped data, this topic came up enough that I wrote a dedicated blog post about it

a couple of years ago
DATE

and concluded the following in terms of how we should define the term "breach":

A data breach occurs when information is obtained by an unauthorised party in a fashion in which it was not intended to be made available

Which makes scrapes like this alleged one a breach. If indeed it was accurate,

LinkedIn
ORG

data had been taken and redistributed in a way it was never intended to be by either the service itself or the individuals whose data was in this corpus. So, it’s something to take seriously, and that warranted further investigation.

I scrolled through the

10M+
CARDINAL

rows of data (many records spanned multiple rows due to line returns), and my eyes fell on a fellow

Aussie
ORG

who for the purposes of this exercise we’ll call "EM", being the initials of her

first
ORDINAL

and last name. Whilst the data I’m going to refer to is either public by design or fabricated, I don’t want to use a real person as an example without their consent so let’s just play it safe. Here’s a fragment of

EM
ORG

‘s record:

There are

5
CARDINAL

noteworthy parts of this I that immediately caught my attention:

There are

5
CARDINAL

different email addresses here with the alias for each one represented in "[

first
ORDINAL

name].[last name]@" form. These exist in a column titled "

PROFILE_USERNAMES
WORK_OF_ART

". (Incidentally, this is why the headline of

2.5
CARDINAL

M accounts expands out to

5.8
CARDINAL

M email addresses as there are often multiple addresses per account.) There’s a

LinkedIn
ORG

profile ID in the form of "[

first
ORDINAL

name]-[last name]-[random hexadecimal chars]" under a column titled "

PROFILE_LINKEDIN_ID
ORG

". That successfully loaded

EM
ORG

‘s legitimate profile at https://www.linkedin.com/in/[id]/ The numeric value in the "

PROFILE_LINKEDIN_MEMBER_ID
ORG

" column matched with the value on

EM
ORG

‘s profile from the previous point. The

2
CARDINAL

dates starting with "

2020-
DATE

" are in columns titled "PROFILE_FETCHED_AT" and "PROFILE_LINKEDIN_FETCHED_AT". I assume these are self-explanatory.

EM
ORG

‘s

first
ORDINAL

and last name, precisely as it appears in each of her

5
CARDINAL

email addresses.

On its own, this record would be unremarkable. It’d be entirely feasible – this could very well be legit – except when you keep looking through the remainder of the data. A pattern quickly emerged and I’m going to bold it here because it’s the smoking gun that ultimately indicates that a bunch of this data is fake:

Every single record with multiple email addresses had exactly the same alias on completely unrelated domains and it was almost always in the form of "[

first
ORDINAL

name].[last name]@".

Representing email addresses in this fashion is certainly common, but it’s far from ubiquitous, and that’s easy to demonstrate. For example, I have tons of emails from Pluralsight so I dig one out from my friend "CU":

There’s no dot, rather a dash. Every single real Pluralsight email address I looked at was a dash rather than a dot, yet when I delved into the alleged

LinkedIn
ORG

data and dig out another sample Pluralsight address, here’s what I found:

That’s not LM’s real address because it has a dot instead of a dash. Every. Single. One. Is. Fake.

Let’s try this the other way around and load up the existing breached accounts in

HIBP
ORG

for the domain of

one
CARDINAL

of

EM
ORG

‘s alleged email addresses and see how they’re formed:

That’s definitely not the same format as

EM
ORG

‘s address, not by a long shot. And time and time again, the same pattern of addresses in the corpus of data in the original tweet emerged, drawing me to what seems to be a pretty logical conclusion:

Each email address was fabricated by taking the actual domain of a company the individual legitimately worked at and then constructing the alias from their name.

And these are legitimate companies too because every single

LinkedIn
ORG

profile I checked had all the cues of accurate information and each domain I checked in the corpus of data was indeed the correct one for the company they worked at. I imagine someone has effectively worked through the following logic:

Get a list of

LinkedIn
ORG

profiles whether that be by ID or username or simply parsing them out of crawler results

Scrape
ORG

the profiles and pull down legitimate information about each individual, including their employment history Resolve the domain for each company they worked at and construct the email addresses Profit?

On that final point, what is the point? The data wasn’t being sold in that original tweet, rather it was freely downloadable. But per the date on

EM
ORG

‘s profile, the data could have been obtained much earlier and previously monetised. And on that, the date wasn’t constant across records, rather there was a broad range of them

as recent as July last year
DATE

and as old as… well, I stopped when the records got older than me. What is this?!

I suspect the answer may partly lie in the column headings which I’ve pasted here in their entirety:

"PROFILE_KEY", "PROFILE_USERNAMES", "

PROFILE_SPENDESK_IDS
PERSON

", "PROFILE_LINKEDIN_PUBLIC_IDENTIFIER", "

PROFILE_LINKEDIN_ID
ORG

", "PROFILE_SALES_NAVIGATOR_ID", "

PROFILE_LINKEDIN_MEMBER_ID
ORG

", "PROFILE_SALESFORCE_IDS", "PROFILE_AUTOPILOT_IDS", "PROFILE_PIPL_IDS", "PROFILE_HUBSPOT_IDS", "PROFILE_HAS_LINKEDIN_SOURCE", "

PROFILE_HAS_SALES_NAVIGATOR_SOURCE
WORK_OF_ART

", "PROFILE_HAS_SALESFORCE_SOURCE", "PROFILE_HAS_SPENDESK_SOURCE", "PROFILE_HAS_ASGARD_SOURCE", "PROFILE_HAS_AUTOPILOT_SOURCE", "

PROFILE_HAS_PIPL_SOURCE
ORG

", "PROFILE_HAS_HUBSPOT_SOURCE", "PROFILE_FETCHED_AT", "PROFILE_LINKEDIN_FETCHED_AT", "

PROFILE_SALES_NAVIGATOR_FETCHED_AT
ORG

", "PROFILE_SALESFORCE_FETCHED_AT", "PROFILE_SPENDESK_FETCHED_AT", "PROFILE_ASGARD_FETCHED_AT", "

PROFILE_AUTOPILOT_FETCHED_AT
WORK_OF_ART

", "

PROFILE_PIPL_FETCHED_AT
DATE

", "PROFILE_HUBSPOT_FETCHED_AT", "PROFILE_LINKEDIN_IS_NOT_FOUND", "PROFILE_SALES_NAVIGATOR_IS_NOT_FOUND", "PROFILE_EMAILS", "PROFILE_PERSONAL_EMAILS", "

PROFILE_PHONES
ORG

", "PROFILE_FIRST_NAME", "PROFILE_LAST_NAME", "PROFILE_TEAM", "

PROFILE_HIERARCHY
WORK_OF_ART

", "PROFILE_PERSONA", "

PROFILE_GENDER
WORK_OF_ART

", "PROFILE_COUNTRY_CODE", "

PROFILE_SUMMARY
GPE

", "PROFILE_INDUSTRY_NAME", "PROFILE_BIRTH_YEAR", "

PROFILE_MARVIN_SEARCHES
WORK_OF_ART

", "PROFILE_POSITION_STARTED_AT", "

PROFILE_POSITION_TITLE
WORK_OF_ART

", "PROFILE_POSITION_LOCATION", "

PROFILE_POSITION_DESCRIPTION
WORK_OF_ART

", "PROFILE_COMPANY_NAME", "PROFILE_COMPANY_LINKEDIN_ID", "PROFILE_COMPANY_LINKEDIN_UNIVERSAL_NAME", "PROFILE_COMPANY_SALESFORCE_ID", "

PROFILE_COMPANY_SPENDESK_ID
ORG

", "

PROFILE_COMPANY_HUBSPOT_ID
ORG

", "

PROFILE_SKILLS
ORG

", "

PROFILE_LANGUAGES
WORK_OF_ART

", "

PROFILE_SCHOOLS
WORK_OF_ART

", "

PROFILE_EXTERNAL_SEARCHES
PERSON

", "PROFILE_LINKEDIN_HEADLINE", "

PROFILE_LINKEDIN_LOCATION
PERSON

", "PROFILE_SALESFORCE_CREATED_AT", "PROFILE_SALESFORCE_STATUS", "PROFILE_SALESFORCE_LAST_ACTIVITY_AT", "

PROFILE_SALESFORCE_OWNER_CONTACT_ID
ORG

", "PROFILE_SALESFORCE_OWNER_CONTACT_NAME", "PROFILE_SPENDESK_SIGNUP_AT", "PROFILE_SPENDESK_DELETED_AT", "PROFILE_SPENDESK_ROLES", "PROFILE_SPENDESK_AVERAGE_NPS_SCORE", "PROFILE_SPENDESK_NPS_SCORES_COUNT", "PROFILE_SPENDESK_FIRST_NPS_SCORE", "

PROFILE_SPENDESK_LAST_NPS_SCORE
PERSON

", "

PROFILE_SPENDESK_LAST_NPS_SCORE_SENT_AT
NORP

", "PROFILE_SPENDESK_PAYMENTS_COUNT", "PROFILE_SPENDESK_TOTAL_EUR_SPENT", "PROFILE_SPENDESK_ACTIVE_SUBSCRIPTIONS_COUNT", "PROFILE_SPENDESK_LAST_ACTIVITY_AT", "PROFILE_AUTOPILOT_MAIL_CLICKED_COUNT", "

PROFILE_AUTOPILOT_LAST_MAIL_CLICKED_AT
NORP

", "PROFILE_AUTOPILOT_MAIL_OPENED_COUNT", "PROFILE_AUTOPILOT_LAST_MAIL_OPENED_AT", "PROFILE_AUTOPILOT_MAIL_RECEIVED_COUNT", "PROFILE_AUTOPILOT_LAST_MAIL_RECEIVED_AT", "

PROFILE_AUTOPILOT_MAIL_UNSUBSCRIBED_AT
ORG

", "PROFILE_AUTOPILOT_MAIL_REPLIED_AT", "PROFILE_AUTOPILOT_LISTS", "PROFILE_AUTOPILOT_SEGMENTS", "PROFILE_HUBSPOT_CFO_CONNECT_SLACK_MEMBER_STATUS", "PROFILE_HUBSPOT_IS_CFO_CONNECT_MEETUPS_MEMBER", "

PROFILE_HUBSPOT_CFO_CONNECT_AREAS_OF_EXPERTISE
WORK_OF_ART

", "

PROFILE_HUBSPOT_CORPORATE_FINANCE_EXPERIENCE_YEARS_RANGE
WORK_OF_ART

"

Check out some of those names:

LinkedIn
ORG

is obviously there, but so is

Salesforce
PRODUCT

and

Spendesk
PERSON

and

Hubspot
ORG

, among others. This reads more like an aggregation of multiple sources than it does data solely scraped from

LinkedIn
ORG

. My hope is that in posting this someone might pop up and say "I recognise those column headings, they’re from…" Who knows.

So, here’s where that leaves us: this data is a combination of information sourced from public

LinkedIn
ORG

profiles, fabricated emails address and in part (anecdotally based on simply eyeballing the data this is a small part), the other sources in the column headings above. But the people are real, the companies are real, the domains are real and in many cases, the email addresses themselves are real. There are over

1.8k
CARDINAL


HIBP
ORG

subscribers in the data set and this is folks that have double opted-in so they’ve successfully received an email to that address in the past. Further, when the data was loaded into

HIBP
ORG

there were

nearly a million
CARDINAL

email addresses that were already in the system so evidently, they were addresses that had previously been in use. Which stands to reason because even if every address was constructed by an algorithm, the pattern is common enough that there’ll be a bunch of hits.

Because the conclusion is that there’s a significant component of legitimate data in this corpus, I’ve loaded it into

HIBP
ORG

. But because there are also a significant number of fabricated email addresses in there, I’ve flagged it as a spam list which means the addresses won’t impact the scale of anyone’s paid subscription if they’re monitoring domains. And whilst I know some people will suggest it shouldn’t go in at all, time and time again when I’ve polled the public about similar incidents the overwhelming majority of people have said "we want to know about it then we’ll make up our own minds what action needs to be taken". And in this case, even if you find an email address on your domain that doesn’t actually exist, that person who either currently works at your company or previously did has still had their personal data dumped in this corpus. That’s something most people will still want to know.

Lastly,

one
CARDINAL

of the main reasons I decided to invest

hours
TIME

into this

today
DATE

is that I loathe disinformation and I hate people using that to then make statements that are completely off base. I’m looking at my Twitter feed now and see people angry at

LinkedIn
ORG

for this, blaming an insider due to recent layoffs there, accusing them of mishandling our data and so on and so forth. No, not this time, the evidence has led us somewhere completely different.