Understanding NoSQL
Understanding NoSQL 1.What is NoSQL Agenda Common Traits(特点) Consistency Indexing Queries MapReduce Sharding NoSQL Common Traits Non-relational Non-schematized/schema-free Eventual Consistency Open source Distributed "Web scale" Developed at big internet companies Consistency CAP Theorem Databases may only excel at tow of the following thress attributes: consistency,availability and partition tolerance NoSQL does not offer ‘ACID' guarantees Atomicity,consistency,isolation and durability Instead offers 'eventual consistency' Similar to DNS propagation Indexing Most NoSQL databases are indexed by key Some allow so-called 'secondary' indexed Often the primary key indexes are clustered Hbase uses Hadoop Distributed File System,which is append-only Writes are logged Logged writes are batched File is re-created and sorted Queries Typically no query language Instead,create procedural program Sometimes SQL is supported Sometimes MapReduce code is used ... MapReduce Map step:split the query up Reduce step:merge the results Most typical of Hadoop and used with Wide Column Stores,esp,Hbase Amazon Web Service's Elastic MapReduce(EMR) can read/write DynamoDB,s3,Relational Database Service(RDS) "Hive" offers a HiveSQL(SQL-like) abstraction over MR Use with Hive tables Use with Hbase Sharding A partition pattern where separate servers store partitions Fan-out queries supported Partitions may by duplicated,so replication also provided Good for disaster recovery Since 'shards" can be geographically distributed,sharding can act like a CDN Good for keeping data close to processing Reduces network traffic when MapReduce splitting takes place 2.NoSQL Technology Breakdown Agenda Key-Value Stores Wide-Column Stores Document Stores Demo Couchdb Graph Databases Key-Value "mechanics" present throughout Key-Value Stores The most common;not necessarily the most popular Has rows,each with something like a big dictionary|associative array Schema may differ from row to row Common on Cloud platforms e.g,Amazon SimpleDB,Azure Table Storage MemcachedDB,Voldemort DynamoDB(AWS),Dynomite,Redis and riak Document Stores Have 'databases',which are akin(类似) to tables Have 'documents',akin to rows Documents are typically JSON objects Each document has properties and values Values can be scalars,arrays,links to documents in other databases or sub-documents(i.e,contained JSON objects - Allow for hierarchical storage) Can have attachments as well Old versions are retained So Doc Stores work well for content management Some view doc stores as specialized KV stores Most popular with developers,startups,VCs The biggies: CouchDB MongoDB Document Store Application Orientation Documents can each be addressed by URIs CouchDB supports full REST interface Very geared towards JavaScript and JSON Documents are JSON objects CouchDB|MongoDB use JavaScript as native language In CouchDB,'view functions' also have unique URIs and they return HTML so you can build applications in the database Demo CouchDB http://127.0.0.1:5984/pluralsight/_design/example/_view/dotNet http://127.0.0.1:5984/pluralsight/_design/example/_view/dataacess http://127.0.0.1:5984/pluralsight/_design/example/_show/showfunction/_id Wide Column Stores Has tables with declared column families Each column family has "columns" with are KV pair that can vary from row to row These are the most foundational for large sites BigTable(Google) Hbase(Originally part of Yahoo-dominate Hadoop project) Cassandra(Facebook) Calls column families "super columns" and tables "super column families" They are the most "Big Data"-ready Especially Hbase + Hadoop Graph Databases Great for social network applications and other where relationships are important Node and edges Edge like a join Nodes like rows in a table Nodes can also have properities and values Neo4j is a popular graph db 3.Where is a NoSQL Killer App Agenda Content Management Product Catalogs Social Big Data Miscellaneous Content Management Document databases work really well here Regular KV pairs can store meta data Can also store text-based content Attachments can store file-based or binary content Versioning and URI addressability help as well CouchDB gets called a 'Web database' Database for Web apps Database that can contain Web apps Think Web sites,not Browser-based LOB applications Think EverNote Product Catalogs Products is a catalog tend to have many attributes in common and then various others that are class-specific Common ProductID Name Description Price Class-Specific Flavor,Color Resolution,Clockspeed Key Value Stores and Wide Column Stores work well here KV Stores better when schema will change over time Since nothing is declared Social Graph databases work best here Great for tracking: Networks Followers Group membership Threaded interactions(comments,likes/favorites) Great for Membership,Ownership Avoids the self-joins and many-to-many table necessary in relational DBs Big Data Wide Column and Key-Value stores work best here MapReduce is designed for this scenarios Hadoop and Hbase come up a lot Sharding and append-only help here Premise of analytics is reading data,not maintaining it This is perfect for NoSQL Aggregation,Correlation,regression do not require formal schema,or sophisticated query capabilities Just need to read and perform mathematical operations on data really,really quickly Miscellaneous Event-driven data(i.e,logs) User Profiles,preferences Mail,status message streams Other Web data Automobile directions info for sites on maps(category,name,description,lat/long,photo) User reviews Etc. 4.What Good is Relational Agenda Transactional Formal Schema Line of Business Applications Declarative Query Banded Reporting Transactional Business systems require atomic transactions You can't process an order without decrementing inventory(清单) You can't register a credit without its corresponding debit No exceptions,no excuses Formal Schema Regular processes have regular data Stocks,trades PO line items Personnel records Insurance policies These need relational databases with declared schema These don't need MapReduce,document or graph representation Line of Businesses Applications Screen layouts and data binding require consistent schema Data Transfer Objects have properties defined in code You can't have strong typing without a schema Object Relational Mapping Object models are mapped to database schema If the schema is not consistent then the mapping can't be either Declarative Query I silly to write imperative code for each routing query Makes ad hoc queries and reporting difficult Lose out on engine optimization Lose out on versatility(多功能性) Imperative query works best when the range of queries is very small Relational stored procedures do set precedent for pre-written queries,but they still don't iterate through data sets imperatively Banded Reporting Operational reporting is based on detail and group sections with predictable,consisent layout,based on known schema Very hard to design pixel-perfect reports against indeterminate schema You can dump all columns/all rows,but that's generic Forms are formal,by definition This highlights how operational business processes almost always require relational databases 5.NoSQL and Microsoft Agenda Azure Table Storage SQL Server/Azure XML Columns SQL Azure Federations Demo OData MongoDB on Azure Hadoop on Azure/Windows Demo SQL Server "Beyond Relational" SQL Server Parallel Data Warehouse Azure Table Storage Cloud-based Key-Value Store Supports OData interface(more on that later) Key-Value works nicely for general pupose storage and retrieval SQL Server Data Services (precursor to SQL Azure) also implemented a Key-Value store SQL Azure XML Columns XML columns hold structured data that can differ between rows Combining scalar and XML columns allows combination of static and dynamic schemas XML schemas can still be declared But you can have more than one And it's not required If motivation to use NOSQL is loose schema,then consider XML columns To prove the point:Azure Dev Fabric's Table Storage is implemented with SQL Server Express and XML columns SQL Azure Federations Federations are the SQL Azure version of sharding Just for partitioning,not for replication Replication is automatic,implicit in SQL Azure Federation Root (physical & logical db,defines F.Key) F.Member(physical db - contains specific range of F.Key values) F.Atomic Unit(AU - container for all data with same F.Key value) F.Table F.Members can be addrssed by absolute name or relative key value Allow online repartitioning Offer ACID guarantees withing F.Members and adopt Evetual Consistency between them Multi-tenancy(租用) applications Do not support fan-out query OData RESTful api for data access,with rendering in XML or JSON Clients for JavaScript,mobile platforms,.NET,Java Works for feeds and updates The following feature OData interfaces: Azure Table Storage SQL Server/Azure(via WCF Data Services) Azure DataMarket SQL Server Reporting Services (in 2008 R2.2012) SharePoint Lists(2010) NetFlix,eBay catalogs;TwitPic IBM WebSphere eXtrem Scale REST data service Pluralsight catalog! Compare to JavaScript/JSON orientation of Document Stores Run MongoDB,others on Azure Deploy to worker roles Put databases in Azure Blog Storage;mount as drives(Azure Drive) MongoDB Replica Set Azure wrapper supports this directly Use from on-premise or cloud application code Similar approach can be used for other NoSQL DBs Hadoop on Azure/Windows MS + HortonWorks have developed Windows Version of Hadoop Currently in Community Technology Preview Can use installer to create cluster On-premises On Azure Can also use Hadoop On Azure Provision entire cluster from Portal Currently has 48-hour lifetime Browser-based Hive console Hive ODBC Driver Use from Excel (with add-in) Also use from PowerPivot,Analysis Services(2012 Tabular Mode),Reporting Services SQL Server "Beyond Relational" Features XML Columns(already discussed) HierarchyId Sparse columns(SQL Server-only) Filestream(SQL Server-only) Allow schema flexibility while retaining ACID guarantees SQL Server Parallel Data Warehouse Edition(SQL PDWE) Makes a cluster of SQL Server instances appear as on logical server Uses MPP:Massively Parallel Processing Compare to MapReduce Supports SQL,so no imperative coding needed Supports fan-out queries Supported by most SQL Server clients Available only as appliance Has finely tuned processor,storage,networking internals 6.NoSQL,Relational or Both? Agenda Type of App Productivity Skill Sets and investment Recommendations Type of App Really a question of consistency versus massive scale Is this an internal system or a public one? Is is an application for the data or data for a system? Below a certain threshold of concurrent usage,NoSQL may e slower than relational Productivity NoSQL db tooling still immature Queries require significant work,and testing Programming platforms,frameworks and components may support RDBMSes much more robustly Especially enterprise platforms If schema subject to frequent change then NoSQL may be more productive Skill Sets and investment Does your staff have RDBMS skills already? Do you have significant investment in relational database hw/sw?(hardware/software) Lots of apps that use an RDBMS? Do you want to retool(改革)? Do you want to support both? Are you a startup? Employ developers who possess NoSQL skills and prefer NoSQL? Does availability/scalability make RDBMS investment questions moot? Recommendations Large,public,content-centric properties:NoSQL Internal LOB(line of business) supporting business operations:relational Investment in RDBMS licenses,infrastructure,skills: Relational Use both (application-dependent) Use Hybrid approaches Productivity Do cost-benefit analysis How much extra dev times/$$? What is cost of less scalable system? It will be tempting ot use one for the other And it very well may work,but that doesn't make it right (编辑:李大同) 【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容! |