In my years of work, the one place that I've seen many inefficiencies
creeping in is the use of LIKE queries. Most people don't know how
LIKE queries and Indexes work with each other.
I'm planning to put up
a blog post about this. I have a
that discusses this aspect in brief. But for those of you who do,
you would know that most databases will use indexes for only
LIKE queries of this form: column_name LIKE 'string%'
and not even something like column_name LIKE '%string',
so you completely forget about indexes being used for something
like column_name LIKE '%string%'.
My goal is to be able to hint to the DB engine what type of LIKE queries are going to be fired on a particular column. This is because I've observed that generally only a specific type of LIKE query is ever used on a particular column in most cases. For example, suppose that a column endpoint stores the TCP/IP endpoint of a connection as host:port but as a string. Now, I wish to query all endpoints that have a host dhruvbird. The like query would be something like endpoint LIKE '%dhruvbird.%' This is because many entries such as: www.dhruvbird.com:80 or mail.dhruvbird.com:20 may match.
If we know that the LIKE clause is always going to be of the form '%[a-zA-Z0-9]+\.%' then we can potentially hint the indexer to optimize LIKE queries of that form and try to build an index to speed up such queries. In this case, we could call the generated index a substring index.
Data structures like the Fenwick Tree & the Segment Tree that are used very often for answering cumulative, aggregation or min/max type of queries on a range of data don't seem to have found their way into databases yet. This type of reporting is still being done by cubes and OLAP based systems. Maybe its time that databases got native support for such indexes.
Column oriented databases have opened up new paradigms for data storage and querying. Concepts like multi-valued attributes and polymorphic attributes have become commonplace. Do we need to re-think how querying and indexing in such structures should function?
Many a times, when reading an article or passage, I came across words that I didn't entirely understand. On trying to find the meaning of those words, it turned out that they have different meaning depending on the context on which they have been used.
For example, words like Java, Apple and Rancid may mean different things depending on the context in which they appear. This means that just entering the word in a search engine isn't enough. A search engine that considered the complete (or partial) context in which you encountered that word would be able to answer your query in a more accurate manner. Such query answering systems don't seem to currently exist and it would be exciting to work on one.
When the wave of distributed computing hit us, we readily picked up existing techniques and applied them to distributed system in the hope that they would just work. Now that some time has elapsed since that initial rush of experimentation, it may be time to re-think some of the older approaches in lieu of newer ones.
These days, browsers are becoming even more important since everything is online and on the web. Almost everything can be done online and there are applications that run entirely in the browser. The presentation technology used is that from a few years ago. Even though the semantics are being updated continuously, the underlying technology that it uses is not progressing at the same rate. It might be time to look at data structures that let us apply CSS style-sheet rules more efficiently so that our web pages load faster.
Hadoop is a great tool. I love it! In fact, I have used it for processing so much data (so quickly) that I can't imagine not having it (especially, given the fact that it re-starts only failed jobs). Google has switched to using Caffeine instead of MapReduce for crawling web documents since Caffeine processes pages incrementally, as they are crawled. This reduces the time between the document being crawled and it showing up in search results.
We have come a long way from having nothing to having MapReduce, which had some sort of job-checkpointing to Caffeine which is incremental to the point of transacting at the document level. I am sure these ideas can be taken ahead, but it remains to be seen where.