Hadoop, Big Data and Small Businesses

Henry Newman

Updated · Aug 02, 2011

Hadoop, HDFS and the MapReduce algorithm are becoming as popular as searching for celebrity gossip, and this surge in interest says a lot about the changing nature of enterprise infrastructure and data and application requirements.

We all know that search engines and databases have completely different requirements. With most databases, you have a single persistent copy of the data that is backed up and can be restored. With search engines – Google’s MapReduce technology is the basis of Hadoop – much of the data is often transient and can be re-collected.

There is explosive growth for search engines, be they open source or commercial, that are indexing unstructured and structured data (far easier to index). Searching for say Henry Newman on Google brings up Cardinal John Henry Newman, whom I had never heard of until I started searching for some of my articles. If you search for Henry Newman Storage you will get me and nothing about the Cardinal, as he wasn’t much of a storage geek. Combing through 4 million distributed files in two-fifths of a second to come up with answers like that is what makes MapReduce a technology that enterprises are keen to harness.

Big Search Infrastructures

But developing an infrastructure for indexing and searching large amounts of unstructured data requires significant computational power and bandwidth to storage. This is why, for example, that a 1U server with a single disk drive has been the most common architecture for this type of problem. Take 128 1U servers with 128 disk drives and you get about 10 GB/sec of I/O using current disk drives. If you attempted to get that type of storage performance within an external RAID controller framework, it would require tens of thousands of dollars, and given how RAID controllers work, you would likely need more than 128 disk drives because of the contention.

With RAID you get data protection, but that comes at a cost. I would estimate that with RAID, to achieve the same performance as with single drives would take about 2.5 times the number of disk drives plus the cost of the controller and storage network. This is why I believe that most search engine architectures use replication of the system to achieve reliability rather than having a few servers and using high reliability storage architectures.

Of course there are other costs such as power and cooling, additional rack and floor space, cabling, and the cost of people. I have not seen a reasonable study that looks at all these issues for a specific level of performance. It would be a very complex study, but it would provide a great service to the industry to prove one way or another which method is cheaper for the same level of reliability and performance.

For now, big search engine requirements are going to be dominated by architectures that replicate the data, and the most important thing to realize is that whether you use open source methods or commercial methods, big search and indexing is the future for most enterprise and SMB environments. Enterprise search is at this point in time pretty well understood, and whether you use a commercial or open source product, the architecture and the methods are well understood. That cannot be said for the small and mid-sized business (SMB) world.

The Future of SMB Search

Currently, SMB environments do not have a method of easily searching their data across platforms without moving applications and data into the cloud, where vendors are charging for the right to search your data. I think the world really needs an SMB search appliance that combines storage, backup/restore, and search. For now, SMBs must choose between moving their data to a cloud that provides an indexing and search engine capability, rolling their own search engine, or doing nothing.

None of these in my opinion are the right answer. Search engines for the masses allow me to run Google desktop and search through PDF, zip files and most other non-encrypted file formats on my laptop. If I want to search across all of the laptops in my company for a report we did a few years ago that I have misplaced, that isn’t possible without having all of our laptop data centralized. Let’s say a backup application did that; how do I prevent, say, employee reviews from my laptop backup from being indexed for all to see?

One of the reasons that people in small businesses do not want all of their data indexed is that some of their data is not meant for company-wide distribution, and if you are a public company or in the medical field, the possibility of violating Sarbanes-Oxley or HIPAA regulations and the resultant liability is scary.

On the other hand, I want everyone in my company to be able to search all of the reports that I have on my laptop. Right now, there are no good answers for the SMB for addressing this type of problem, as such companies do not have a dedicated IT staff to move over the right data and not the wrong data. Rumor has it that there are appliances in our future that will address this problem, but they still require every employee to decide what others can see. What is really needed is a multiple level security (MLS) file system such as exists under SELinux, but that is another story for another day. Suffice it to say, we’re a long way from an ideal small business search and indexing solution.

Henry Newman is CEO and CTO of Instrumental Inc. and has worked in HPC and large storage environments for 30 years. The outspoken Mr. Newman initially went to school to become a diplomat, but was firmly told during his first year that he might be better suited for a career that didn’t require diplomatic skills. Diplomacy’s loss was HPC’s gain.

More Posts By Henry Newman