- Jack Norris,
vice president, marketing, MapR Technologies (www.mapr.com), says:
With the Internet now
touching more than two billion people daily, every call, tweet, e-mail,
download, or purchase generates valuable data. There is also a wealth of
machine-generated data such as log files, sensor data, video images, genomic
data, etc that is growing at an even faster rate. Companies are increasingly
relying on Hadoop to unlock the hidden value of this rapidly expanding data and
to drive increased growth and profitability. A recent IDC study confirmed that
data is growing faster than Moore's Law. The implication of this growth rate is
that however you're processing data today will require doing it with a larger
cluster tomorrow.
Put another way, the
speed of data growth has changed the bottleneck. The network is the bottleneck.
It takes longer to move data over the network than it takes to perform the
analysis. Hadoop represents a new paradigm to effectively analyze large amounts
of data. This new computing paradigm performs data and compute together so that
only the results are shared over the network.
When beginning the
evaluation and selection of the various Hadoop distributions, organizations
need to understand the criteria that mean the most to their business or
activity. Key questions to ask include:
- How
easy is it to use?
How easily does data
move into and out of the cluster?
Can the cluster be
easily shared across users, workloads, and geographies?
Can the cluster easily
accommodate access, protection, and security while supporting large numbers of
files?
- How
dependable is the Hadoop cluster?
Can it be trusted for
production and business-critical data?
How does the
distribution help ensure business continuity?
Can the cluster
recover data from user and application errors?
Can data be mirrored
between different clusters?
- How
does it perform?
Is processing limited
to batch applications?
Does the namenode
create a performance bottleneck?
Does the system use
hardware efficiently
In order for Hadoop to
be effective for a broad group of users and workloads, it must be easy to use,
provision, operate and manage at scale. It should be easy to move data into and
out of the cluster, provision cluster resources, and manage even very large
Hadoop clusters with a small staff. It is advisable to look for real-time
read/write data access via the industry standard file protocols such as NFS.
This will make it dramatically easier to get data into and out of Hadoop
without requiring special connectors. Most Hadoop distributions are also
limited by the write-once Hadoop Distributed File System (HDFS). Like a
conventional CD-ROM, HDFS prevents files from being modified once they have
been written, requiring a file append, and files must be closed before new
updates can be read.
As data analysis needs
grow, so does the need to effectively manage and utilize expensive cluster
resources. It is often useful for organizations to have separate data sources
and applications leveraged by the same Hadoop cluster. Ways to segment a
cluster by user groups, projects, or divisions are also useful. The ability to
separate a physical cluster into multiple, logical Hadoop clusters is very
useful. A distribution should also be designed to work with multiple clusters
and multi-cluster management. It is critical to look for simple installation,
provisioning and manageability,
Data processing
demands are becoming increasingly critical and these demands require the
selection of a distribution that provides enterprise class reliability and data
protection.One area of concern is the single points of failure that exist,
particularly in the NameNode and JobTracker functions. There is no HA
available today in Apache Hadoop. While there are some HA capabilities expected
in the next major release of Hadoop it is for a single failover of the NameNode
and there is no failback capability and no protection against multiple NameNode
failures.
Hadoop provides
replication to protect against data loss, but for many applications and data
sources, snapshots are required to provide point-in-time recovery to protect
against end-user and application errors. Full business continuity features
including remote mirroring, is also required in many data centers to meet
recovery time objectives across data centers.
Data center computing
is going through one of the largest paradigm shifts in decades. Are you ready
for the change? Are you ready for Hadoop?


No comments:
Post a Comment