大数据的到来及其继续的增长看来已经是个不争的事实,而怎样管理大量的、快速产生的、及多样化的数据却是一个前沿性的大课题,还在“百家争鸣”的阶段。人们普遍认为云计算应该是管理大数据的基本计算平台。我们把利用云计算来管理大数据称为“数据云”。传统的关系数据库的巨大成功在很大程度上可以归功于图灵奖获得者E.
F.
Codd提出的关系模型,我们认为数据云的成功也需要这样一个(或多个)合适的抽象层。从机器底层来看,数据云要能充分利用云计算的横向可扩展性,并克服“无分享”性质及高出错率的障碍。在用户层,数据云要提供很好的用户体验,包括支持交互式分析。我们将分析几种人们已经提出并使用的概念来探讨作为数据云抽象层的可能性,并试图推荐几个可能的研究方向。
It seems to be a universal
consensus that big data is here
to stay and growing fast. What's
still being debated is how to
manage the big data with its
three V properties, namely
Volume, Velocity, Variety. One
aspect, however, is commonly
agreed, that is, cloud computing
is the right computing platform
for managing the big data. We
call "data cloud" a system that
manages big data on the cloud. I
will argue that data cloud is
still in its infancy, mainly
lacking a right abstraction that
serves as the traditional
relational model did for
managing business data one a
centralized server. At the low
level, data cloud needs to
leverage the cloud elasticity,
but overcome the
"shared-nothing" restriction and
high failing rate, to deliver
high efficiency. At the high
level, data cloud should deliver
user friendliness including
interactive analysis capability.
I will look into a number of
different abstractions used by
the community, and propose some
tentative ones that I believe
worth further research.