《Big Data Glossary》笔记
清明假期翻以前的笔记发现有一些NoSQL相关的内容,比较零散,是之前读《Big Data Glossary》的笔记.简单整理了一下,记录于此.
Horizontal or Vertical Scaling
数据库扩展的方向有两个:
垂直扩展-换更牛的机器
水平扩展-增加同样的机器
选择水平扩展必然遇到的一个问题就是,如何决定数据分布在哪台机器上? 也就是分片策略
分片Sharding
数据比较平坦的分布在各个节点上,可以使用数字结尾的方式或者取余运算,但是一旦增加机器就要进行大规模的数据重排
要想消除数据分布之痛,就需要更复杂的数据分布schemes来切分数据.
有些依赖于中心目录,它决定了key值对应的位置.当某个碎片增长过大的时候,
这种间接的指导允许数据在机器之间转移,这种做法的代价就是每一个操作都会去 中心目录里面去查询一下目录信息通常非常小 也都是静态的 一般都会放在内存里面,偶尔变动一下
另外一种方案就是一致性哈希consistent hashing.这种技术使用小表把可能用到的哈希值分范围.一个碎片对应一个值
分片模型对我们的影响
大数据的处理构建在水平扩展模型上,带来的问题就是海量数据的分布式处理,会在某些方面存在妥协
Writing distributed datahandling code is tricky and involves tradeoffs between speed,scalability,fault tolerance,and traditional database goals like atomicity and consistency.
不仅仅这些,还有就是数据的使用方式也会有变化:数据不一定在同一物理机器上,取数据和数据运算都会成为新问题.
NoSQL
NoSQL真的没有Schema吗?
In theory,each record could contain a completely different set of named values,though in practice,the application layer often relies on an informal schema,with theclient code expecting certain named values to be present.
传统的K/V缓存,缺少对复杂情况的查询,NoSQL在纯K/V的基础上做的强化,把这种常用操作的实现职责从开发者转移到数据库.
Hadoopis the best-known public system for running MapReduce algorithms,but manymodern databases,such as MongoDB,also support it as an option. It’s worthwhileeven in a fairly traditional system,since if you can write your query in a MapReduceform,you’ll be able to run it efficiently on as many machines as you have available.
MongoDB
automatic sharding and MapReduce operations.
特点:类JOSN结构 javascript 优势:有商业公司支持 支持自动分片 MapReduce
CouchDB
特点:查询使用js MapReduce
使用多版本的并发控制策略(客户端需要处理写冲突并要进行周期性的垃圾回收来移除旧数据) 缺点:没有内置水平扩展的解决方案 但有外部的解决方案
Cassandra
源自Facebook内部项目,成为标准的分布式数据库方案. 值得花时间学习这样一个复杂的系统以获得强大的功能和灵活性
Traditionally,it was a long struggle just to set up a working cluster,but as the projectmatures,that has become a lot easier.
一致性哈希解决碎片问题
数据结构针对一致性写做了优化,代价是偶尔的读慢
特性:需要多少给节点一直才可以读/写 控制一致性等级,在一致性和速度之间做取舍
Redis
Two features make Redis stand out: it keeps the entire database in RAM,and its valuescan be complex data structures.
优势:处理复杂数据结构的能力
你可以通过集群方式来处理海量数据但是目前,sharding都是通过客户端实现的.
BigTable
BigTable is only available to developers outside Google as the foundation of the AppEngine datastore. Despite that,as one of the pioneering alternative databases,it’s worthlooking at.
HBase
HBase was designed as an open source clone of Google’s BigTable,so unsurprisinglyit has a very similar interface,and it relies on a clone of the Google File System calledHDFS.
Hypertable
Hypertable is another open source clone of BigTable.
Voldemort It uses consistent hashing toallow fast lookups of the storage locations for particular keys,and it has versioningcontrol to handle inconsistent values. Riak
It also uses consistenthashing and a gossip protocol to avoid the need for the kind of centralized index serverthat BigTable requires,along with versioning to handle update conflicts.
Querying ishandled using MapReduce functions written in either
Erlangor JavaScript. It’s opensource under an Apache license,but there’s also a closed source commercial versionwith some special features designed for enterprise customers.
ZooKeeper
The ZooKeeper framework was originally built atYahoo! to make it easy for the company’s applications to access configuration information in a robust and easy-to-understand way,but it has since grown to offer a lot offeatures that help coordinate work across distributed clusters.
One way to think of itis as a very specialized key/value store,with an interface that looks a lot like a filesystemand supports operations like watching callbacks,write consensus,and transaction IDsthat are often needed for coordinating distributed algorithms.
This has allowed it to act as a foundation layer for services like LinkedIn’s Norbert,aflexible framework for managing clusters of machines. ZooKeeper itself is built to runin a distributed way across a number of machines,and it’s designed to offer very fastreads,at the expense of writes that get slower the more servers are used to host theservice. Storage
S3
Amazon’s S3 service lets you store large chunks of data on an online service,with aninterface that makes it easy to retrieve the data over the standard web protocol,HTTP.
One way of looking at it is as a file system that’s missing some features like appending,rewriting or renaming files,and true directory trees. You can also see it as a key/valuedatabase available as a web service and optimized for storing large amounts of data ineach value.
http://www.ibm.com/developerworks/cn/java/j-s3/
HDFS
HDFS科普内容
http://baike.baidu.com/view/3061630.htm
NoSQLfan 上关于HDFS的资料
http://blog.nosqlfan.com/tags/hdfs
大数据的计算
Getting the concise,valuable information you want from a sea of data can be challenging,but there’s been a lot of progress around systems that help you turn your datasetsinto something that makes sense. Because there are so many different barriers,the toolsrange from rapid statistical analysis systems to enlisting human helpers.
R Yahoo! Pipes Mechanical Turk Solr/Lucene ElasticSearch BigSheetsTinkerpop
NLP
Natural language processing (NLP) is a subset of data processing that’s so crucial,it earned its own section. Its focus is taking messy,human-created text and extractingmeaningful information. As you can imagine,this chaotic problem domain hasspawned a large variety of approaches,with each tool most useful for particular kindsof text. There’s no magic bullet that will understand written information as well as ahuman,but if you’re prepared to adapt your use of the results to handle some errorsand don’t expect miracles,you can pull out some powerful insights.
Map Reduce
The approach pioneered by Google,and adoptedby many other web companies,is to instead create a pipeline that reads and writes toarbitrary file formats,with intermediate results being passed between stages as files,
with the computation spread across many machines.
HadoopHivePigCascadingCascalogmrjobCaffeine S4MapRAcunuFlumeKafkaAzkabanOozieGreenplum
Machine Learning
WEKA
WEKA is a Java-based framework and GUI for machine learning algorithms. It providesa plug-in architecture for researchers to add their own techniques,with a command-
line and window interface that makes it easy to apply them to your own data. Mahout is an open source framework that can run common machine learning algorithms on massive datasets.
scikits.learn
It’s hard to find good off-the-shelf tools for practical machine learning.It’sa beautifully documented and easy-to-use Python package offering a high-level interface to many standard machine learning techniques.This makes it a very fruitfulsandbox for experimentation and rapid prototyping,with a very easy path to using thesame code in production once it’s working well.
卓越亚马逊地址:http://www.amazon.cn/Big-Data-Glossary-Warden-Pete/dp/1449314597/qid=1333609610&sr=8-1#
2012-8-18更新,下面是回复同事邮件,解答Nosql的几个疑问:
(编辑:李大同) 【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容! |