TorScouter: why (and how) I'm crawling the entire Tor DeepWeb

If you happen to come to this page, most probably you’ve been visited by **TorScouter** and found this line in your logs:

> TorScouter/0.1b (+http://mgpf.it/TorScouter)

Well, yes, that’s me and in this page you’re going to find out what I’m doing, what I’m keeping, why I’m doing it and what I do intend to use this data for.

> <**tl;dr**>: I’m **crawling the entire space of Tor Hidden Services** within the Tor Network itself to map all the possible contents of the web. Just because it’s possible and I’m going to run an Hidden Service search engine, most probably not public. But I may release some or all data in the future, so keep up with this page or [drop me a line][1].

## What is the TorScouter architecture?
These are different components of the TorScouter architecture:

* **Crawler**: Every time the system find a new Hidden Service a Crawler instance is run in one of the several servers running in parallel. Each page (obeying [robots.txt][r]) get accessed, read and indexed. Every single link on the page is parsed and if a new Hidden Service is found it get passed to the **Discovery** process.
The system parse and store the following information:
* Title of the page
* .onion address and path
* text rendered from the html
* keywords for full-text index
* …no attachment/image/other get downloaded and/or indexed
* **Discovery**: Every time a new and unknown Hidden Service is found the Discovery process memorize the address, try to contact it and save address, title, textual content and last_seen date. If the Hidden Service is responding a **Crawler** instance is run on the service. Every day another process try to contact each and all the services on the list and if a previously offline service is found the system run a **Crawler** on the site. A Crawler is run again every month or so since the last complete crawl;
* **Indexer**: A side process index on a full-text-index the textual content of every page and prepare for search of the content itself;
* **Search Engine**: A (very crude) web Search Engine is using the **Indexer** to search on the data of all the database.

## What have you indexed?

By the last update of this document *(June 22th 2014)* the system is:

* indexing **~10M pages** and
* **~3500 unique** Hidden Services…
* of which ~2000 are online.

The crawler is not by any means now at full crawling and indexing power, but only run on several low-end machines.

## How may I block the crawling of my Hidden Service?

By using correct [robots.txt][r] directives.

## Why are you doing it?

Because I like coding and because is a very challenging task. It means concurrency, multithreading, indexing, searching, knowing the Tor architecture. I’m doing it with Ruby, Sidekiq, Redis, Mongo, ElasticSearch.

## What are you doing with the data?

Right now I’m running a private Search Engine for academic purpose, sharing the data with some trusty third parties and friends interested in analysing the Hidden Service phenomena. Some of the projects that may (or may not) be derived by this work when the first complete indexing will be terminated may include (one or several):

* running a commercial search engine access;
* publishing a sociological paper on the Hidden Services content and themes;
* releasing a complete list of Hidden Services in a live online directory;
* publishing a technical data statistic online (number of servers, technology, banners, etc…);
* …releasing all the data in a huge dump ;)

## May I contact you?

By any means! Contact me (at the end of [this page][1]) if you’re interested in the data on a commercial, academic, fun point of view ;)

[1]: http://mgpf.it/who-is-matteo-flora
[r]: http://www.robotstxt.org/

TorScouter: why (and how) I’m crawling the entire Tor DeepWeb

aggiungi commento

cancella

Matteo Flora