Right now, most of my work time is occupied with collecting data from the Dark Market. This process isn’t easy (especially taking into account that for most of the time, Dark Market is under DDoS), and it’s also not the only part of my research. To create one of those posts you see on the channel, I need to go through many steps, some extremely time-consuming. To better understand why my work takes so much time and why I don’t make posts every day, I want to describe my regular research process on the Dark Market example.
Data is the key to everything. To determine the price of certain drugs, vendors’ activity, or research whether Dark Market is overstating its listing count, I need to collect various information. For example, with the last case, to prove or refute the theory, I should receive data from all catalog pages and listings. I need to define whether all products from catalogs are unique and search for deleted listings. To do that, I should process more than 40,000 pages on the website!
As the main step is collecting the data, I first need to define what to collect. I spend much time looking through different parts of the website and defining which data to receive. The rule is pretty simple, though. It’s better to have information than not to have it, so almost every time, I collect as much as I can. For example, with the case of Dark Market listings, I only need to get links to products from catalogs and explore whether those products are working or causing 404 Error. But I will also collect information about the products’ names, categories, prices, shipping options, and much more just to have this data. Because of that, after finishing the research of listings, I will be able to discover more insights without a need to recollect information.
As soon as I understand what I want to see in the final table, I start writing the scraper. Almost in all small or medium projects, scraper creating takes the most of the time. To achieve the collection of information, I need to tell my script which data to grab from the HTML of the website. Besides defining the tags and selectors, I also need to build the infrastructure of the code. If I collect information synchronously (page after page), than the script will be very slow, and it will take ages to collect all data, so depending on the situation, I prefer to write codes with the use of multiprocessing or multithreading (opening multiple pages at one time), which significantly speeds my script up. The full scraper writing may take from one hour in easy cases to a couple of days in complicated ones.
After the scraper is written (which actually doesn’t mean that it’s completely finished, new errors will show up because of rare conditions, which will make me rewrite some parts of the code), the time for the data collection. This process may take 10 minutes, 4 hours, 1 day, or a few months. For example, with my article about the influence of CODIV-19 on the Russian internet-drug sale, I was parsing data from Hydra for 3–4 hours straight each day for the whole of April. So it depends on the goals of the research and the external environment (ongoing DDoS or technical works on the website). With Dark Market, it’s been a full day since I started to collect the info. I’ve already processed more than 35k of products, 7k is still left, and I just can’t reach them because the marketplace is currently down.
Let’s imagine for now that all of the data is collected. The next step is the analysis of it. With the same programming language, Python, I look for different insights hidden behind data rows. What is the average price of buds in England? How did the closure of borders in Europe affect drug sales? What has suffered the most, wholesale or retail trade? To answer all of those questions, I need to do my research. And the last but not the least step of my work is visualizing and describing the visualizations. That’s the only side of the research process you see, and therefore it is crucial. An unclear graph with a poor explanation will destroy all the work behind it, no matter how big it was. So I have to really do my best at that stage to turn rows of raw numbers into something like you can see below.
I think that now you have a better understanding of my work and know why do I need so much time to do new research. The full process from the data collection to creating visualizations in most cases will take days, but it really depends on the goal. My research on the influence of COVID-19 on internet drug-sale took me almost half of the year, and I’m sure that I will face even more complicated works in the future.
My Telegram channel: @DrugStat_Int