OSINT without APIs

Feb 06, 2022

We recently published a bunch of posts about the top 5 APIs for Threat Intelligence, Attack Surface Monitoring, Security Assessments and People Investigations, but in this post we’ve asked hakluke to write about OSINT/reconnaissance techniques that don’t leverage any APIs – best of all, they are all free techniques you can use yourself with your own scripts, or in SpiderFoot.

Introduction

APIs are great – they make things almost too easy because data is validated and gathered for you, then served to you on a beautifully formatted JSON platter. They do have their downsides though – for example:

Data may not be up-to-date
Data may not be comprehensive
They usually cost money
APIs do not exist for some types of data

For these reasons, it is nice to be comfortable with collecting the data yourself as a supplement to the data you gather from APIs. In this article we’re going to cover some techniques for doing just that, along with some scripts and code snippets you can use to automate the process.

Techniques

DNS

DNS data has always been a treasure trove for OSINT practitioners, and in this section we will cover a few different techniques that you can use to automated the extraction of useful data from DNS records, starting with zone transfers!

Zone Transfers

DNS zone transfers are a mechanism for replicating DNS databases across a set of DNS servers. Really, they have no operational use other than this, as such they should only be allowed between DNS servers that explicitly trust each other (e.g. being part of the same organization). Sometimes though, DNS servers are configured to respond to zone transfer requests (AXFR DNS query types) for anyone who requests it. This is typically considered to be a bad security practice because it allows anyone to instantly dump DNS zones to gain insight into all the hosts and subdomains within an organization.

It’s 2021, and many people think that zone transfers are a relic of the 90s, but based on my tests on the majestic million, there are still a lot of domains configured this way. I have also heard reports of people who have tested 300 million domains and seen a response rate of about 1%.

If you’re interested in exploring this further, I wrote a simple Golang tool to perform AXFR requests domains en masse, which you can find here. Here’s a screenshot of the tool in action.

Brute-Forcing

If you’re investigating a domain which is one of the unlucky 99%, another method for discovering subdomains of an organization is by brute-forcing. The basic premise is that you pick a root domain, let’s say example.com. Then you have a large list of words, and you keep track of the ones that resolve. For example:

blog.example.com
dev.example.com
jira.example.com
Internal.example.com

There are many tools for this, but one good free option is subbrute because it utilizes open resolvers as a proxy to circumvent DNS rate-limiting.

The other piece to the puzzle is having a good wordlist. An excellent choice is the best-dns-wordlist.txt file from Assetnote’s wordlists page. It’s quite large though. If you’re looking for something smaller, try this one from Daniel Miessler’s “SecLists” repository.

TXT Records

DNS TXT records are simply a type of DNS resource that is used to store arbitrary text. Over the years, many different use cases for TXT records have been introduced including SPF, DKIM, DMARC and domain ownership verification. Like most other DNS records, anyone can look up TXT records. For example, these are the TXT records associated with spiderfoot.net.

spiderfoot.net. 3600 IN TXT "protonmail-verification=833bf01c36ac7f3bca7f1da3f5c5437c8ed47f13" spiderfoot.net. 3600 IN TXT "v=spf1 include:_spf.protonmail.ch include:mailgun.org ~all" spiderfoot.net. 1800 IN TXT "google-site-verification=0-sRWlSBGkThCz0IrS3Zp63yg5Suo4q7cryTn3MKAbE"

This information can be gathered by using the dig command, in this case the full command would be:

dig TXT spiderfoot.net

As you can see, the TXT records can divulge some useful information about an organization. In this case we can deduce that SpiderFoot most likely utilizes ProtonMail for their everyday emails, Mailgun to send automated emails, and have verified the domain ownership with Google at some point, probably to enable Google Analytics or something similar.

TXT records could be dumped en masse using a tool such as zdns. The command would look something like this:

cat domains.txt | zdns TXT -threads 20

The default output is JSON, which can be prettified or morphed using a tool such as jq, as in the screenshot below.

Port Scanning and Banner Grabbing

One of the most important reconnaissance techniques is port scanning. The most well-known port scanner is Nmap, not least because it has been actively maintained since September 1997. Essentially, there are 65535 TCP ports and 65535 UDP ports. A port scanner connects to a selection of these ports (or potentially all of them) to determine which ones are open. Whenever an open port is discovered, we can deduce that a service of some sort is running on that port.

In order to figure out what that service is, we need to utilize a technique called banner grabbing. Banner grabbing is just analyzing the responses from the connections to determine what service is running, i.e. we connect, grab the banner and analyze it.

We can grab these banners manually by utilizing a tool such as netcat. In the example below we connect to Gmail’s SMTP server using netcat. The server response with a 220 response, indicating that that the SMTP service is ready to receive connections.

~$ nc smtp.gmail.com 587

220 smtp.gmail.com ESMTP f7sm4144531pfc.21 - gsmtp

Nmap maintains a list of banners mapped to service names, which is very helpful for automating the process of determining the services running on multiple hosts.

The screenshot below shows the output of an Nmap scan on port 587 of smtp.gmail.com with service detection enabled.

As you can see, Nmap correctly determines that the service running on that port is Google gsmtp. The full command utilized to perform this task was:

nmap -Pn -p 587 -sV smtp.gmail.com

The -Pn tells nmap not to send a ICMP ping request to check if the host is alive before scanning (because many hosts do not respond to ping requests).
-p 587 tells Nmap to only scan port 587, just so that I could run the scan quickly for this demo.
-sV tells nmap to perform service detection by utilizing banner grabbing.

In order to automate this task over multiple hosts, we can pass a list of hostnames and/or IP addresses in a file to nmap by using the -iL option. For example:

nmap -A -p- -iL domains.txt

Web Spidering

When performing OSINT on a website, it is pertinent to know what endpoints and assets are present or utilized by that application. One fairly efficient method of determining this from a black box perspective is through web spidering.

Spidering is basically automating the process of going to the website and clicking on every link and performing every function within the site in order to map out the functionality. I wrote a CLI tool called hakrawler that achieves this, although it doesn’t handle SPAs (Single Page Applications) very well. Another excellent option is the spider feature included with Burp Suite.

Here’s hakrawler in action:

Spidering can also be a useful mechanism for determining different technologies that might be in use, for example, the following screenshot shows that WordPress is used by www.spiderfoot.net, along with discovering a bunch of client-side JavaScript dependencies.

Web Scraping

Web scraping is the process of downloading and processing HTTP responses in order to extract useful information in an automated fashion, for example if you wanted to get a list of people who work at a company from LinkedIn you might use an automated web scraping tool to collate a list of those profiles and then extract information from each of them – although you shouldn’t do this as it is against the LinkedIn terms of service.

A simplified example of web scraping is extracting emails from HTTP responses using grep. For example, take this command:

~$ curl -s https://www.spiderfoot.net | grep -E -o "\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,6}\b"

[email protected]

The curl command simply prints the HTTP response to https://www.spiderfoot.net, and the grep command contains a regular expression that finds and prints all emails within the source code. You can see that in this case, it found one, [email protected]. A similar process could be followed to extract other information such as phone numbers, names, hostnames, etc.

If you feel comfortable doing some light coding, there are a few great Python frameworks for creating your own custom web scraper. I’d recommend checking out Beautiful Soup.

WHOIS

Essentially, WHOIS is a protocol used for querying databases that store the registered users or assignees of internet resources like domain names and IP address ranges. WHOIS records are an absolute treasure trove of information when researching domains, IP addresses and ASNs, although they are becoming less useful as time goes on because or GDPR regulations as well as people utilizing WHOIS privacy services to mask their true identity.

An excellent example of a WHOIS record with full details can be seen below for facebook.com. In order to query these details I simply used the following command:

whois facebook.com

Running the same command for a Facebook owned IP address or the Facebook ASN returns similar results. These two commands both return the same results:

whois 157.240.8.35

whois 32934

The result:

Enumerating Social Media Accounts

When you are investigating a username, one technique that can yield excellent results is checking for the existence of that username (and variants) across multiple social media networks. For example, if we were investigating the username “hakluke”, we could check for the prevalence of this username on multiple social platforms by accessing the location where this username might be such as:

https://twitter.com/hakluke

https://github.com/hakluke

https://pinterest.com/hakluke

https://instagram.com/hakluke

https://www.reddit.com/user/hakluke/

There are some services that do this such as https://namechk.com/, but this blog post is all about OSINT without APIs, so let’s build our own automated solution. Essentially to do this, we need to write a script that performs basic signature checking. For a simple example of how this may be achievable, see the following bash script which checks a list of usernames to see if they exist on GitHub.

First I created a file with a list of users in it, called usernames.txt.

Then I created a bash script which accesses <a href="https://github.com/" class="redactor-autoparser-object">https://github.com/<usernam...; and checks if the response code is 200. If the response code is 200 we know that the user exists, otherwise we know that it does not.

The code looks like this:

for username in $(cat usernames.txt); do
if curl -s -o /dev/null -w "%{http_code}" https://github.com/$username | grep -q 200; then echo "User $username exists at: https://github.com/$username" else echo "User $username does not exist on GitHub" fi; done

Running the code will return the following output:

Of course, this technique could be easily expanded to check the existence of many different social media platforms, and the script could be made more efficient by utilizing multithreading. A great resource to tackle the identification of usernames across over 300 social media platforms is using WebBreacher’s WhatsMyName JSON file and associated script.

Enumerating Third Party Services

The same technique can also be utilized to enumerate the use of third party services. For example, we can check for the existence of AWS S3 buckets by analysing the response of http://s3.amazonaws.com/<bu..., or you could check if an organization utilizes Okta for SSO by navigating to https://<orgname>.okta.c.... With a bit of imagination, you could utilize this technique to check for many different third party services utilized by a specific organization.

You can also utilize results from web spidering to discover third party services. For example, if you find references to an S3 bucket in the source code of their websites, there’s a good chance that the S3 bucket will be owned by that same organization.

Maintenance Indicators

One of my favourite techniques for attacking large organizations is searching for what I call “maintenance indicators” in responses. This helps me to prioritize the targets that I will attack first by finding hosts that are most likely not maintained frequently. Some examples are below:

The presence of broken links on a home page
Copyright messages at the bottom of pages that are 10+ years old
Old version number extracted from HTTP headers or banner grabbing

Putting it All Together

The point of this article is to demonstrate that you can gather a lot of useful information manually from various data sources, without utilizing any APIs or spending any money on expensive data sources. All of the data sources mentioned above are completely free, and all of the techniques discussed above are able to be automated with a bit of work.

And yes, you guessed it – SpiderFoot already automates all of the techniques outlined in this post, it doesn’t just collate data from API data sources, it also manually gathers information using these techniques and many more.

If you’re interested to see what SpiderFoot is capable of, you can check out the open source version here. Installation is easy and only takes a few minutes. Alternatively, you can trial the premium HX version here for free, without entering any credit card information.

Happy OSINTing!

Share this article