Web wordlists in 2021

Benchmark of web wordlists available in 2021.

 March 3, 2021 -  5 min read

Web content wordlists

Summary

Note: The original article was posted on my company blog https://blog.sec-it.fr/.

Perimeter discovery is an important step during a web pentest and can, in some cases, lead to a website compromise. In order to carry out this recognition, several tools are available, including web content wordlists for web fuzzing:

Name First release Last Update Max Size (lines)
SecLists 2012/02/20 2021/02/12 1.273.833 (directory-list-2.3-big.txt)
Assetnote wordlists 2020/11/16 2021/01/28 4.319.406 (httparchive_js_2020_11_18.txt)
Dirb wordlists 2015/06/16 2015/06/16 20.469 (big.txt)
DirBuster wordlists 2013/05/01 2013/05/01 220.560 (directory-list-2.3-medium.txt)
Dirsearch dicc.txt 2013/05/22 2021/02/10 9.021 (dicc.txt)
Wfuzz wordlists 2014/10/23 2019/03/14 45.459 (megabeast.txt)
Wordlistctl (Bonus) 2018/10/28 2018/11/02 N/A

* this post has been written in Feb. 2021

Note that this post only includes routes, files and folder wordlists. Therefore, wordlists which include passwords such as rockyou.txt will not be covered.

SecLists

SecLists is a collection of multiple types of wordlists, including usernames, passwords, URLs, sensitive data patterns, fuzzing payloads, web shells, and many more.

SecLists is the security tester’s companion. […] The goal is to enable a security tester to pull this repository onto a new testing box and have access to every type of list that may be needed.

The repository is actively maintained and its last commit is less than two weeks ago. The package is provided by most of pentesting Linux releases such as Black Arch and Kali Linux.

Covered wordlists are located into Discovery/Web-Content/. We can notice that there is a lot of available wordlists (121 in the main folder). Some of them are specific for a given technology (CGIs.txt, coldfusion.txt, oracle.txt …), others are specific for a given language (common-and-french.txt, common-and-dutch.txt …). The main wordlist family present in SecList is the “RAFT Word Lists”.

RAFT wordlists has been generated from robots.txt from 1.7 million websites and were originally provided by RAFT Tool in 2011. In this family, wordlists are separated as follows :

  • 4 families (directories, extensions, files and words)
  • 3 sizes per family (large, medium and small)
  • 2 case options (normal and lowercase)
Name Size (lines) large Size (lines) medium Size (lines) small
raft-*-directories.txt 62.283 30.000 20.116
raft-*-directories-lowercase.txt 56.163 26.584 17.770
raft-*-files.txt 37.042 17.128 11.424
raft-*-files-lowercase.txt 35.324 16.243 10.848
raft-*-extensions.txt 2.449 1.289 963
raft-*-extensions-lowercase.txt 2.366 1.233 914
raft-*-words.txt 119.600 63.087 43.003
raft-*-words-lowercase.txt 107.982 56.293 38.267

Looking at raft-*-files.txt, we got the following extension repartition :

Histogram Pie chart
Raft large file repartition Raft large file repartition
Raft medium file repartition Raft medium file repartition
Raft small file repartition Raft small file repartition

SecLists also includes wordlists provided with dirbuster and dirb, covered in the rest of this post.

Assetnote wordlists

Assetnote is a company that provides security tools and services to measure exposure to external attack. The company also provides a repository named Assetnote Wordlist.

Theses wordlists are generated monthly using Google BigQuery datasets with their GO client named commonspeak2, and results in content discovery and subdomain wordlists.

As these datasets are updated on a regular basis, the wordlists generated via Commonspeak2 reflect the current technologies used on the web.

Assetnote Wordlist

Wordlists are generated per technologies, for this post we will focus on directories, API routes and PHP, ASP.NET, JSP/JSPA languages.

Note : As January 2021 wordlists seems less complete than previous wordlists, and February 2021 wordlists not available at this time, we will focus in November 2020 wordlists.

Name Technologie Size (lines)
httparchive_directories_1m_2020_11_18.txt Directories 1.000.000
httparchive_apiroutes_2020_11_20.txt API routes 953.011
httparchive_php_2020_11_18.txt PHP 74.887
httparchive_aspx_asp_cfm_svc_ashx_asmx_2020_11_18.txt ASP .NET 63.200
httparchive_jsp_jspa_do_action_2020_11_18.txt JSP 10.506
Assetnote Directories Assetnote API routes
Assetnote Directories Assetnote API routes

Note: /, - and _ are considered as a wildcard in the previous graph.

Dirb wordlists

Dirb is a web discovery tool already covered in a previous post. The tool is provided with multiple wordlists including more common ones:

Name Size (lines)
common.txt (default wordlist for dirb) 4.614
big.txt 20.469
small.txt 959
Charsets in dirb family.  
big.txt common.txt
small.txt

Those wordlist doesn’t have any extensions and only 2% of the words contain capital letters. You can also note that there is more “other” charsets in common.txt than in big.txt.

DirBuster wordlists

DirBuster is a web discovery tool that has also been covered in a previous post. The tool is provided with multiple wordlists including directory-list-2.3 wordlists family.

Name Size (lines)
directory-list-2.3-big.txt 1.273.833
directory-list-2.3-medium.txt 220.560
directory-list-2.3-small.txt 87.664

Some packaged versions may not include directory-list-2.3-big.txt.

Such as dirb wordlists, directory-list-2.3 doesn’t include any extensions.

Charsets in directory-list-2.3 family.  
directory-list-2.3-big.txt directory-list-2.3-medium.txt
directory-list-2.3-small.txt

Note: /, - and _ are considered as a wildcard in the previous graph.

Dirsearch dicc.txt

dicc.txt is a wordlist provided with dirsearch tool. The wordlist has the particularity to provide the variable extension %EXT%. Therefore, the wordlist must be used with tools that support %EXT% format (see post about web discovery tools). The wordlist has a total of 9021 lines distributed as follows :

dicc.txt  
dicc.txt dicc.txt

You can note that there is “only” 500 words containing %EXT% extension.

Wfuzz wordlists

Wfuzz tool is provided with a lot of wordlists. Some of them in “general” directory are dedicated for directories and files enumeration. That’s the case of megabeast.txt, big.txt, medium.txt and common.txt. None of those wordlist have words containing extensions. They are distributed as follows :

Charsets in wfuzz family.  
megabeast.txt big.txt
medium.txt common.txt

Wordlistctl (Bonus)

In some case, an auditor may look for a specific wordlist. Wordlistctl is a tool design to fetch, install, update and search for a given wordlists. This python script offers more than 6400 wordlists and is maintained by BlackArch Linux distribution.

$ wordlistctl search wordpress

--==[ wordlistctl by blackarch.org ]==--

    > wordpress (29.20 Kb)
    > urls-wordpress-3 (36.62 Kb)
    > wordpress-attacks-july2014 (88.00 B)
    > wordpress_usernames (541.57 Mb)
    > wordpress_attacks_july2014 (88 B)

$ wordlistctl fetch -l urls-wordpress-3
--==[ wordlistctl by blackarch.org ]==--

[*] downloading urls-wordpress-3.3.1.txt to /usr/share/wordlists/discovery/urls-wordpress-3.3.1.txt.part
[+] downloading urls-wordpress-3.3.1.txt completed

Comparative table

Without further ado, here is a comparative table of the different wordlists discussed in this post. Colored cases represent a high correlation between wordlists. To understand the matrix you should read: "N% of the wordlist at line Y is contained in wordlist at column X".

I.E.: 87% of wordlist n°17 (dirb - small) is contained in wordlist n°0 (seclists - raft-large-files).

The sources used to generate this chart are available on this repository: sec-it/WL-Comparison.

An interactive version of the chart is available online.

About

The original article was published on my company’s blog https://blog.sec-it.fr/.

You can find SEC-IT at the address https://www.sec-it.fr.

Part of this content is under MIT License / © 2021 SEC-IT.