Only two days ago the contact messaging application Twitter suffered another bout of downtime, leaving some users frustrated and others asking why the platform continues to suffer problems.
Techcrunch recently spoke to an individual who is familiar with the technical problems at Twitter as well as the challenges that lay ahead for the startup. He re-iterated his belief that the problems lay not with Blaine Cook (the former head of engineering who was shown the door), nor with NTT (their host) but with the early lack of understanding of how complex their problems would be.
The issue is that group messaging is very difficult to achieve at a grand scale. Other large sites such as Wordpress and Digg are mostly dealing with known problems, such as how to serve a large number of pages or a large number of images. Twitter is unique in that it needs to parse a large number of messages and deliver them to multiple recipients, with each user having unique connections to other users.
Social networks have similar complexity issues, but they only usually need to route a message to a single user (or at the most to a defined group). Even so, social networks like Friendster struggled for years with technical and scaling issues. Twitter is specifically dealing with text messages, and in most cases with active users those messages are very frequent and go out to hundreds of contacts (or followers, as they are referred to in Twitter). Every new Twitter user and every new connection results in an exponentially greater computational requirement.
Some of the best web applications are able to efficiently solve very complex problems to produce simple results for users (Eg. Google). The success of these applications is due to the innovative efforts by developers to solve large technical challenges, where they have often had to break new ground for solutions. For Twitter to reach a similar point of reliability they too will need a very comprehensive, ground-breaking solution.
The source that I spoke to also commented on how ill-prepared the Twitter team were and are for their current and future challenges. The small team contains a handful of engineers, with only a person or two committed to infrastructure and architecture. He goes on to point out that at Digg the team for network and systems alone is bigger than the total engineering team at Twitter, and that at Digg they are lead by well-known “A-list rockstars”.
The problems at Twitter are often attributed to their use of RubyOnRails, a web development framework. Twitter is almost certainly the largest site running on Rails, so fans of the framework and its developers have been quick to deflect the criticism and point it back at the engineers at Twitter. Utilizting a framework that has never conquered large-scale territory must certainly add to the risk and work required to find a solution. As an out-of-the box framework, Rails certainly doesn’t lend itself to large-scale application development.
Rails enabled Twitter to be developed quickly, to get to launch quickly and then to improve with new features relatively rapidly also. But the old adage of “Good, Fast, Cheap – pick two” certainly applies and Rails would do itself no harm by conceding that it isn’t a platform that can compete with Java or C when it comes to intensive tasks. Twitter is at a cross-roads as an application and Rails has served its purpose very well to date, but you are unlikely to see a computational cluster built with Ruby at Apache any time soon.
What we see at Twitter today is a very useful and popular service, but one with very complex underlying technical challenges to overcome. Twitter will require not only a new architecture approach and a big injection of the best minds they can find ($15 million can help), but will also need a little patience from users and those of us observing.
Read the rest of this entry »
Posted in Internet, Programming, Web 2.0 | 1 Comment »
Google will stop at nothing in its quest to index the world’s information. Last year it ate through 100 exabytes of data, but there’s still a lot that it can’t get access to. Known as the deep web (or hidden web, or invisibe web, etc.), it is estimated that the majority of online data is hidden safely from Google’s prying eyes — private intranets, unlinked pages, some non-textual content, and until today dynamic content returned via form input was all inaccessible to the search engine. Google today announced that its Googlebot web crawler would begin to fill out HTML forms and crawl the results.
“For text boxes, our computers automatically choose words from the site that has the form; for select menus, check boxes, and radio buttons on the form, we choose from among the values of the HTML. Having chosen the values for each input, we generate and then try to crawl URLs that correspond to a possible query a user may have made,” explained Jayant Madhavan and Alon Halevy in a blog post. “If we ascertain that the web page resulting from our query is valid, interesting, and includes content not in our index, we may include it in our index much as we would include any other web page.”
Google, which says that the crawling of dynamic form results doesn’t affect the “crawling, ranking, or selection of other web pages in any significant way,” also assured webmasters today that their enhanced crawl would respect robots.txt as usual. Any form forbidden in robots.txt won’t be crawled.
It is estimated that the deep web is several orders of magnitude larger than the regular, public world wide web. While there is some content that Google will never — and should never — get its hands on, by crawling form results Google is now peering just a little bit deeper into the Internet. As Matt Cutts points out, this is less about indexing search results (something Google has generally not liked to do) and more about finding new links that are only available via dynamically created pages.
It should be noted that Google is only crawling GET forms (i.e., forms used to retrieve dynamic content, such as search results) and not POST forms. That’s mildly disappointing as we were looking forward to befriending Googlebot on MySpace…
Read the rest of this entry »
Posted in Google, Internet, Programming | 2 Comments »
Companies can now go ahead and fire their expensive database administrators—those engineers who keep the Oracle or IBM databases humming. Amazon has just added an enterprise-class database called SimpleDB to its suite of cloud-based IT infrastructure, which also includes storage (S3) and computation (EC2) available by the drink. Today, Amazon is taking sign-ups for the SimpleDB beta, which should start in a few weeks. As it points out on the new Simple DB page:
Amazon SimpleDB is a web service for running queries on structured data in real time. This service works in close conjunction with Amazon Simple Storage Service (Amazon S3) and Amazon Elastic Compute Cloud (Amazon EC2), collectively providing the ability to store, process and query data sets in the cloud. These services are designed to make web-scale computing easier and more cost-effective for developers.
Traditionally, this type of functionality has been accomplished with a clustered relational database that requires a sizable upfront investment, brings more complexity than is typically needed, and often requires a DBA to maintain and administer. In contrast, Amazon SimpleDB is easy to use and provides the core functionality of a database – real-time lookup and simple querying of structured data – without the operational complexity. Amazon SimpleDB requires no schema, automatically indexes your data and provides a simple API for storage and access. This eliminates the administrative burden of data modeling, index maintenance, and performance tuning. Developers gain access to this functionality within Amazon’s proven computing environment, are able to scale instantly, and pay only for what they use.
This will be especially attractive for Web startups. Amazon has just taken another major infrastructure cost off the table for them. Relational databases are expensive to buy and maintain. Whatever features or performance SimpleDB lacks, it should make up for in price. Amazon wants to democratize the database by making it available to more businesses, and even individuals, thus leveling the playing field between big companies and startups even more.
And since SimpleDB operates at Web scale, larger companies will wake up to the cost saving opportunities of such a service as well. IBM, for one, is already trying to preempt any customer defections with its copycat Blue Cloud initiative. If speed is of the essence, you might still want to keep your database on your own servers. But the Web is where most software will one day live, whether consumer or enterprise. And Amazon’s got nothing to lose by speeding that day along.
Pricing for SimpleDB is as follows:
Machine Utilization – $0.14 per Amazon SimpleDB Machine Hour consumed
Data Transfer
$0.10 per GB – all data transfer in
$0.18 per GB – first 10 TB / month data transfer out
$0.16 per GB – next 40 TB / month data transfer out
$0.13 per GB – data transfer out / month over 50 TB
Data transfer “in” and “out” refers to transfer into and out of Amazon SimpleDB. Data transferred between Amazon SimpleDB and other Amazon Web Services is free of charge (i.e., $0.00 per GB).
Structured Data Storage – $1.50 per GB-month
Read the rest of this entry »
Posted in Internet, Programming, Software | No Comments »