NASA is indexing the 'Deep Web' to show mankind what Google won't

There is a part of the Internet—most of it, in fact—that is hidden from Google. It is private, or illicit, or simply unknown. And NASA wants to help you reach it.

The space agency announced last month that it will join forces with the Defense Advanced Research Projects Agency (DARPA) to help make sense of that part of the Internet commonly referred to as the Deep or Dark Web. Most Internet users first heard about it, if they’ve heard about it at all, in the context of Silk Road, the now-defunct online drug marketplace that was hosted on a hidden Web service. Silk Road was only accessible using the anonymity-enhancing browser The Onion Router, or TOR.

Now, NASA’s mission to explore the universe includes the furthest reaches of cyberspace. “It’s uncharted territory,” Chris Mattmann, NASA’s lead on the project, told Fusion in a phone interview. In a press release, NASA explained that it will help DARPA with its Memex program, which is working to “access and catalog this mysterious online world.”

Perhaps government agencies hope flooding the Dark Web with sunshine will help clean the place up. In addition to being the go-to corner of the Web for scoring illicit drugs, the Internet’s hidden channels have historically harbored some pretty nasty illegal actors, including contract killers and pedophiles.

But much of the Deep Web—which accounts for about 96 percent of the Internet—has nothing to do with TOR and is inaccessible for more mundane reasons. Some sites aren’t linked to by Google because they’re private—behind paywalls, for example, or simply not worth Google’s efforts to index, like scientific data. That’s the kind of information in which NASA’s Jet Propulsion Laboratory is interested, because that’s where the information its spacecraft send back to earth winds up.

The idea is to organize access to the Deep Web’s content, and build a search engine alternative to Google, that will give NASA a better way to access data being uploaded by their machines. A not unintended byproduct of this will be, eventually, allowing everyone more access to the hidden parts of the Internet.

The goal is to not just build “a really great search engine for bad things on the hidden Internet,” JPL’s principal investigator on Memex, Chris Mattmann, told Fusion. When NASA spacecrafts send information to Earth, it’s in a file format that Google isn’t very good at understanding. “[Those files] get in the second class of the web, that normally we call the Deep Dark Web…if you go to Google, you’re probably 10-30 clicks away from the science data—the actual information.”

NASA data is dumped in this murky, unreachable (but not inaccessible) part of the web because the data sets make sense to humans but not to the web crawlers that index the Internet. With Memex, Mattmann said, Web surfers will be “just one to four clicks away from the science data.”

It won’t be easy. Mattmann explained that “most people are good at [building search engines for] their specific domain but aren’t able to pivot.” Sites like Fandango and Yelp are only really good at developing search engines that cater to specific searches, like movies in your neighborhood or reviews of local businesses. Searching the Deep Web across several domains is much more complicated.

And, Mattmann said, Memex is going through “the same kinds of search engine growing pains” as all search engines. “Being able to understand which sites are relevant, where to start crawling from. A lot of these crawl operations can take days or weeks…Google didn’t develop rank initially.”

There are other simpler versions of Memex already available.  “If you’ve ever used the the Internet Archive‘s Wayback Machine,” which gives you past versions of a website not accessible through Google, then you’ve technically searched the Deep Web, said Mattmann.

But once Memex, launched in September of last year, is fully realized, it could be a viable alternative to Google—maybe. “I don’t know if any government program could be a competitor to a commercial entity,” Mattman said,  but Memex is doing something Google has no interest in. “It’s not in their bottom line,” Mattmann explained, to do the type of web crawling DARPA and NASA are prepared to do.

For DARPA, this is coming full circle: Previously known as ARPA, the agency developed something called ARPANET in the late 1960s. ARPANET was an early version of the Internet as we know it. Now DARPA will help make it easier to mine.

“What Arpanet was to the Internet,” Mattmann said, “this is to search and search engines.”

If that’s the case, we might be on the verge of a search revolution.

Danielle Wiener-Bronner is a news reporter.

 
Join the discussion...