Bypassing Chinese Internet Censorship: How I Built a Censored Microblog Aggregator

Bypassing Chinese Internet Censorship: How I Built a Censored Microblog Aggregator


BY XIAOLEI LIU – Software Engineer @ Toptal

As is known worldwide, the Chinese government enforces strict censorship on the internet. The Chinese censorship system, commonly known as the Great Firewall of China, is operated by the Ministry of Public Security and is officially named the Golden Shield Project. The system has been in operation since 2003.

International news sites that usually contain politically sensitive content, such as the New York Times, or social media sites which are not complying with censorship rules, such as Facebook and Twitter, are usually blocked and unavailable for Chinese users. This is accomplished using a variety of sophisticated methods.

For Chinese news and social media sites, virtually everything is under the government’s surveillance. In order to be allowed to operate, ISPs and internet content providers in China usually have their own content filtering mechanism for blocking or removing the published content by its users, or even deleting users’ account directly if they are assumed to be illegal under the government policy. These companies have their own censorship software on their servers, as well as special teams or departments to manually handle the censorship tasks that automated censoring software can’t manage. These teams cooperate with the local divisions of the Ministry of Public Security, receiving new orders and policies, and usually working together with each other.

For our domestic web developers, the censorship of the Chinese internet not only filters out our freedom of speech, but also valuable professional resources from around the world. In my daily work, I have to bypass internet censorship to connect via VPN to use Gmail, Dropbox, and many other crucial sites. I still remember how awkward it was in 2010, when Google’s services became unstable or inaccessible in China after Google refused to continue complying with censorship rules. This would be unbelievable for developers in other countries.

Censorship on Sina Weibo

Sina Weibo is the biggest microblogging social network site in China. Since Twitter does not comply with China’s rules, Weibo does not have to compete with it for users. News spreads more quickly and directly on Weibo than any other media outlet in China. Members of the younger generations, such as myself, like to use it to share news and discuss public events. But of course, under Chinese internet censorship, many hot or interesting posts are deleted immediately after they are posted. Political and public event posts are most likely to be deleted, while entertainment news is least likely to be deleted. A 2013 study by computer scientists Jed Crandall and Dan Wallach found that about 12% of Chinese microblogs are being deleted every day.

On politically sensitive days like June 4th, it is expected to see a higher number of censored Microblog posts being deleted. On these days, users usually cannot even input certain sensitive words when they attempt to write a microblog.

What does it look like when a post gets censored? When you refresh a new microblog on the site, you will often see something like this:


This is a censored Chinese microblog where content was removed by the government regulatory offices or the ISP.

This is the equivalent of a retweet, where the original message typically appears in the gray box. The box now reads “Sorry. The microblog has been deleted. Please see…” The original post was a plea for justice by a mother, for the kidnap, rape, and forced prostitution of her 11-year-old daughter in 2013.

2013 is a year that a lot of political scandals were revealed through the microblog platform. The popularity of Sina Weibo soared during this time. In response, the government got nervous and started to strengthen its censorship on the social media platform.

Before the microblog, young people like me who were interested in politics usually had to use proxy servers or tunneling services to hunt down sensitive news from international websites. Suddenly, we had a relatively open Chinese social network platform. But the government stepped in quickly, and it turned out to be just a flash in the pan. This really infuriated me. I talked with friends, and we were all angry about the strengthening of censorship on the platform. My friends would ask, “Why can’t we do anything about this?” I decided I would try. So I built a website to begin bypassing internet censorship to see what exactly was being blocked or deleted from Sina Weibo.

Technical Discussion

Basically, I needed to set up a server which constantly scanned for blocked or deleted Chinese microblogs and showed them in a new website. I had planned to use a domestic cloud service like Aliyun, but it turns out that there are many constraints on the platform, such as domain redirecting, and their prices are no cheaper than other cloud services. Of course, my additional concern was that the server itself would be under surveillance if I deployed it domestically. So I ended up buying a server on Linode, and located the server in Japan. I also bought the domain freeweibo.me to begin bypassing the censorship of Sina Weibo.

The following graph shows the overall architecture of the system: MongoDB, a web server, and a crawler. I chose Node.js for the development environment, as it is more efficient and scalable for network applications and, personally, I have more experience with it. The web server was developed using the Express.js framework, and used the Weibo API to capture data. Initially, the crawler was designed to be a separate process, but later I found that bundling it as a module in the web server process was sufficient for the early stage.

The content of a microblog has two major parts of interest. One is the text data and its relevant attributes. The other is the images affiliated with the post. To save a post, we also want to download the images and save them as files on the disk. For blocked or deleted blogs, these images are very important. In China it’s very common and popular to use images for posting text content, as this content is much more difficult to catch with automated text-based filtering and censoring on the servers of internet companies.

The basic idea of detecting blocked or deleted posts is to constantly scan for new posts, from a known list of users, and then recheck the the posts’ availability at a later time. A microblog could be deleted or blocked within several minutes or several days. Thus, the crawler consists of two main tasks: the fetch task, to fetch newly posted content, and the check task, to check whether previously posted content has been censored.

At first, I configured the crawler to crawl microblogs from the top 100 well-known users on Weibo. But it turned out that there were almost no deleted blogs being detected each day. The reason is that most of the top users have no interest in political or publicly sensitive topics – they never post or forward these kinds of microblogs. For example, this blogger, who is an actress with more than 10 million followers, is one of the most popular users, but she never posts sensitive blogs.

After some experimentation and thinking, I came up with a technique to adaptively find users who consistently get censored. The social media network is topic-interconnected and users tend to gather in groups by interest. If a user has an interest in public or political topics, then he is more likely to post or forward other similar users’ blogs. These forwarded posts provide a good way to identify new users to scan.

For example, say user A is already in the database, and the crawler detects that one blog, which was reposted by user A, is deleted. If user B, the original author of the blog, is not in the database, then the crawler will save user B. Next time, when the crawler rescans new blogs, it will also scan new blogs from user B. Thus, the quantity of scannable users will automatically grow by harnessing this kind of social interest connection.

 

Chinese internet censorship can be bypassed by leveraging microblog behavior.

After tuning the crawler algorithm to take advantage of this methodology, I only needed to seed several key users who had strong interests in posting sensitive blogs and the crawler automatically discovered new users to scan. The daily total censored blogs that were detected rose steadily day by day. The following is a snapshot of archived deleted blogs in my mailbox.

  • A historic dialogue by Mao Zedong rebuking a local official for not pulling down the ancient city wall of Chengdu.
  • A post about Xu Zhiyong, who is an active rights lawyer. He has helped many underprivileged people and started the New Citizen’s Movement in China. He was sentenced to jail in January, 2014.
  • Criticism of the government’s newspaper People’s Daily
  • Comment on the arrest and trial of Wang Gongquan, a billionaire in China and leader of the New Citizen’s Movement.
  • A reference to the arrest of activists who take part in social movements.

Results

After two weeks coding and debugging my Chinese microblog bypassing system, I deployed the site to freeweibo.me. However, after several weeks running, the server detected no more new blogs. With some investigation I found two issues. One was that the Weibo platform had changed their original API interface. The other was that the crawler’s API requests were exceeding the rate limit (1000 per minute) due to the increase of blogs and users in the database. So I tuned my code to adopt the new interface and also to decrease the API request count per minute. The crawler was stable from then on.

I faced a dilemma over whether or not to let many people know about the site. I knew that the more people who visited the site, the sooner it would be sniffed out by the government and be blocked. So I only shared the site with some of my friends. Initially, there was only about 10 to 20 visits per day. But a month later, the visits hit 80 or more on some days, and I had tens of email subscribers.

And then, as I had expected, the morning came when I found my site was blocked in China. It had lasted about three months. In order to reach the site after that, users had to use a VPN tunneling services to visit the site. This is impractical for most Chinese internet users.

However, that same day I was relieved and pleased to find that another site, freeweibo.com, is providing exactly the same service, and is more sophisticated than what I built. The freeweibo.com project is very resourceful. It is active on social media, and provides different means to access the content, like RSS feeds, email subscription, and mirror sites for domestic users. It even has a mobile app! I don’t know who built the site, but I’m glad we share a same vision.

Conclusion

Based on the circumstances, it was obvious that my site was not very useful anymore, and I closed it several months later.

Despite the outcome, I don’t feel like the project was in vain. On the contrary, it’s was a marvelous experience, even though it only survived for a few months. It helped me to deeply appreciate the reality in my country.

In China, to run an internet business, you have to be very cautious about censorship, or you will get into trouble sooner or later. There is barely any way for social media sites to be successful if they do not comply with the strict censorship, and compromise on users’ privacy.

(This article was originally posted on toptal.com and reposted with the permission of Toptal and the author.)