Have you ever used the short URL service? If we post a message with a URL in a microblog, the microblog will convert the URL inside into a shorter URL. If we visit this short URL, it is the same as visiting the original URL.
The short URL service is straightforward: it converts a long URL into a short one. When the user clicks on the short URL, the short URL service will redirect the browser to the original URL. How does this process work? The browser will first visit the short URL service, get the original URL through the short URL, and then visit the page through the original URL.
How to generate short URLs via hashing?
Hashing algorithm can transform a string, no matter how long, into a fixed hash value length. We can use the hash algorithm to generate short URLs. We have already mentioned some hashing algorithms, such as MD5, SHA, etc. However, we do not need these complex hashing algorithms. In the problem of generating short URLs, after all, we do not need to consider the difficulty of reverse decryption, so we only need to care about the computational speed and conflict probability of the hash algorithm. Many hashing algorithms can meet such requirements, and one of the more famous and widely used hashing algorithms is the MurmurHash algorithm. Although this hashing algorithm was only invented in 2008, it is now widely used in Redis, MemCache, Cassandra, HBase, Lucene and much other famous software. To generate the short URLs as short as possible, we can choose the 32bits hash. For url https://jiahuanglin.xyz/posts/how-to-design-a-key-value-storage-system/, the MurmurHash hash is 2509295521, and the short URL can just be https://tiny.com/2509295521
However, as you may have seen, the short URL obtained by MurmurHash is still long and seems to be in a different format than the one we started with. But don’t worry, we can easily make the short URLs shorter by changing the representation of the hash value just a little. We can convert the hash value of 10 into a higher binary hash value, so the hash value becomes shorter. As we know, in hexadecimal, we use A to F
to represent 10 to 15
. In the URL, 62 characters like 0 to 9, a to z, A to Z
, commonly used as legal characters. To make the hash value as short as possible, we can convert the hash value from 10 to 62
. I’ve written the details of the calculation process here. The final short URL expressed in 62 notation is http://t.cn/cgSqq.
Hash conflicts
How to solve the hash conflict problem? However, as we mentioned earlier, one of the problems that hash algorithms cannot avoid is hash conflicts
. Once the conflict occurs, it will result in two original URLs being transformed into the same short URL. When the user visits the short URL, we cannot determine which original URL the user wants to visit. So how to solve this problem? In general, we store the correspondence between the short URL and the original URL so that subsequent users can find the original URL according to the correspondence when they visit the short URL. There are many ways to store this correspondence, such as designing our storage system or using an off-the-shelf database.
Let’s take MySQL as an example. Suppose that the correspondence between the short URL and the original URL is stored in the MySQL database.
- When a new original URL needs to be generated as a short URL, we first use MurmurHash algorithm to generate the short URL.
- Then, we take the newly generated short URL and look it up in the MySQL database. If no identical short URL is found, the newly generated short URL has no conflict. Then we return the short URL to the user (the user who requested the short URL) and store the correspondence between the short URL and the original URL in the MySQL database.
- If we find the same short URL in the database, it does not necessarily mean that there is a conflict.
- If the original URL in the database is the same as the original URL we are working on now, it means that someone has already requested the short URL of the original URL. We can then take the short URL and use it directly.
- If the original URL recorded in the database is not the same as the original URL we are processing, there is a conflict in the hash algorithm. Different original URLs, after calculation, get duplicate short URLs. We
So what should we do if there’s a conflict? We can append a string of special characters to the original URL, such as “[DUPLICATED]”, and then recalculate the hash value, the probability that both hash calculations conflict is obviously very low. Suppose there is a very extreme case, and there is a conflict, we can append another string, such as “[OHMYGOD]”, and then calculate the hash value. Then the calculated hash value is stored in the MySQL database together with the text of the original URL after appending the special string. When the user accesses the short URL, the short URL service first looks up the corresponding original URL in the database through the short URL. Suppose the original URL has appended special characters (which can be easily found by string matching algorithm). In that case, we will remove the special characters first, and then return the original URL without special characters to the browser.
Performance optimization for database queries
To determine whether the generated short URLs are conflicting, we need to take the generated short URLs and look them up in the database. If there is a lot of data stored in the database, the search will be very slow, which will affect the performance of the short URL service. So is there any means of optimization?
Yes, we can add a B+ tree index to the short URL field. This way, the speed of querying the original URL through the short URL increases significantly. In real software development, there is a trick we can use to increase the speed even further. In the process of short URL generation, we will deal with the database twice, that is, we will execute two SQL statements.
- The first SQL statement queries the correspondence between the short URL and the original URL through the short URL
- The second SQL statement stores the correspondence between the newly generated short URL and the original URL into the database.
Generally, the database and the application service are deployed on two separate servers or virtual servers. Then the execution of two SQL statements requires two network communications. This IO communication time consuming and SQL statement execution is the performance bottleneck of the short URL service. Therefore, we need to minimize the number of SQL statements to improve performance.
How can we reduce the SQL statements? We can add a unique index to the short URL field in the database (not only the index, but also the requirement that there should be no duplicate data in the table). When there is a new original URL to generate a short URL, we do not firstly take the generated short URL and check for the duplication in the database, instead we directly store the generated short URL and the corresponding original URL into the database. If the database can write the data normally, it means that there is no violation of the unique index, that is, this new generated short URL does not conflict. Of course, if the database gives us an exception about the violation of the unique index, then we have to execute the query and write
process again, and the number of SQL statements will increase instead of decrease. However, in most cases, when we insert the newly generated short URL and the corresponding original URL into the database, there is no conflict. So, in most cases, we only need to execute one written SQL statement. So, overall, the total number of SQL statements executed will be greatly reduced.
There is another way to optimize the number of SQL statements with the help of Bloom filters. We take the short URLs we have generated and build them into Bloom filters. We take the short URLs that have been generated and build them into Bloom filters. We know that Bloom filters are a relatively memory-efficient storage structure, and a Bloom filter of 1 billion in length requires only about 125MB of memory space. When a new short URL is generated, we first take the newly generated short URL and look it up in the Bloom filter. If the search result is no, then the newly generated short URL is not conflicting. At this time, we need to execute the SQL statement that writes the short URL and the corresponding original web page again. Therefore, the total number of SQL statements executed is reduced by querying the Bloom filter first.
How to generate short URLs via ID generator?
We can maintain an ID self-incrementing generator. When the short URL service receives a request to convert a raw URL into a short URL, it first takes a number from the ID generator, converts it into a 62-entry representation, and splices it into the domain name of the short URL service (e.g., http://t.cn/) to form the final short URL. Finally, we still store the generated short and original URLs in the database. The theory is very simple to understand.
However, there are 2 problems to deal with here:
- The same original URL may correspond to a different short URL.
- Every time a new original URL comes, we generate a new short URL. This practice will lead to two identical original URL generating a different short URL. How should this be handled? We have two processing ideas.
- Not to deal with it. It sounds nonsensical, but I’ll explain. The same original URL corresponds to a different short URL, which is acceptable to users. In most short URL application scenarios, the user only cares whether the short URL can correctly jump to the original URL. As for what the short URL looks like, he does not care. Therefore, even if the original URL is the same, the short URL generated twice is not the same, which does not affect the user’s use.
- Processing idea of generating short URLs with the help of hashing algorithm, when we want to generate a short URL to the original URL, we have to take the original URL in the database to see if the same original URL already exists in the database. If the database exists, we will take out the corresponding short URL and return it to the user directly. However, there is a problem with this processing idea, and we need to add indexes to both the short URL and the original URL fields in the database. The index on the short URL is to improve the user’s speed to query the short URL corresponding to the original web page, and the index on the original URL is to speed up the short URL query through the original URL just mentioned. Although this solution can meet the demand of “the same original URL corresponds to the same short URL”, there is a cost: on the one hand, the two indexes will occupy more storage space, and on the other hand, the indexes will lead to the degradation of the performance of insertion and deletion operations.
- Every time a new original URL comes, we generate a new short URL. This practice will lead to two identical original URL generating a different short URL. How should this be handled? We have two processing ideas.
- Performance issues.
- There are many ways to implement ID generators, such as database self-incrementing fields. Of course, we can also maintain a counter and keep adding one and one. However, a counter to cope with frequent short URL generation requests is obviously a bit overwhelming since the counter must ensure that the generated IDs are not duplicated, which generally means that locks are required. So how can the performance of the ID generator be improved?
- We can load the ID generator with multiple predecessors. The ID generator will send ID numbers to each of the predecessors in bulk. When we receive a request for short URL generation, we select a predecessors to fetch the number. This significantly increases the concurrent numbering capability by having multiple front senders.
- Instead of using an architecture with one ID generator and multiple predecessors, we directly implement multiple ID generators serving simultaneously. To ensure that the IDs generated by each ID generator are not duplicated. We require each ID generator to follow certain rules to generate ID numbers. For example, the first ID generator can only generate IDs with the last number 0, the second ID generator can only generate IDs with the last number 1, and so on. This also improves the efficiency of ID generation by having multiple ID generators working simultaneously.
- There are many ways to implement ID generators, such as database self-incrementing fields. Of course, we can also maintain a counter and keep adding one and one. However, a counter to cope with frequent short URL generation requests is obviously a bit overwhelming since the counter must ensure that the generated IDs are not duplicated, which generally means that locks are required. So how can the performance of the ID generator be improved?