Designing Data-Intensive Applications(設計資料密集應用)- O’Reilly 2017 讀書筆記

NO IMAGE

Designing Data-Intensive Applications The Big Ideas Behind Reliable, Scalable, and Maintainable Systems

目錄

可靠性、可伸展性、可維護性

資料模型與查詢語言

  1. schema更新:
    1. 關係SQL:~ 靜態語言
    2. NoSQL:動態型別語言
      1. Document型:MongoDB的BSON –> 整個重寫替換/部分更新?–> 增量快照?
  2. grouping related data –> locality
    1. Spanner的“巢狀表”
    2. Oracle的“多表索引聚類表”
  3. 文件型與關係型的融合:RethinkDB(列向)–> 關係-like joins?
  4. MongoDB 2.2 聚集管道,宣告式的map-reduce
    1. 屬性圖:Neo4j、titan、InfiniteGraph
    2. triple-store:Datomic、AllegroGraph
  5. 3′ 宣告式語言:Cypher,SPARQL(基於RDF),Datalog(Prolog/LISP的變體,學術型)
  6. vs CODASYL(網路模型)

儲存與檢索

  1. log –> segments with hash index –> compact去重
    • Range查詢不夠高效
  2. SSTable –> memtable(key-value) log日誌
    • 思想來自於BigTable,如LevelDB、RocksDB
  3. LSM-tree
    • used by fulltext index(倒排的儲存,Lucene)
  4. Bloom-filter:早期排除不存在的key(?但不能排除的情況下還是一樣的)
  5. compact & merge的時機:
    1. size-tierd
    2. leveled
  6. B-tree:標準索引實現
    1. Making reliable:WAL(redo log)
    2. 寫操作對SSD不友好?
    3. 併發控制:latch鎖
    4. 優化*
  7. LSM-樹後臺壓縮影響效能,不如B-樹可預測(實質是均攤了)
  8. 其他索引
    1. secondary(Map<K, List<V>>)
    2. clustered(value in index)
    3. 多列組合
    4. 全文與fuzzy
  9. 記憶體資料庫
  10. 資料倉儲:分析查詢
    1. 產商產品:Teradata, Vertica, SAP HANA, ParAccel; Aws RedShift; SQL-on-Hadoop(Hive, SparkSQL, Impala, Presto, Tajo, 基於Dremel的Drill)’
    2. 星模式:事實表(events log?) 維度表(進一步匯出的?)
      1. 雪花
  11. 列向儲存
    1. 列壓縮:點陣圖編碼(enum–>bool),適用於IN (…)類的查詢?
    2. 多個sort orders?
  12. 聚集:Data Cube & 物化檢視*

編碼與演進

  1. JSON,Bin(Base64?)
  2. BSON:擴充套件了datatypes
  3. Thrift & protobuf(2007年開源)
    1. 無field name字串,而是tags(列舉整數)
  4. Avro:len頭部 utf8位元組
    1. reader & writer’s schema
    2. 模式演化規則:field有default value
  5. ASN.1:如DER
  6. 分散式actors:Akka、Orleans、Erlang OTP

複製

  1. replication lag下的一致性模型
    1. Read-after-write consistency
    2. Monotonic reads
    3. Consistent prefix reads(因果序)
  2. Leader-less Dynamo-style(不保證寫的順序?但是如果有應用級的hash路由呢?如果所有資料潛在可關聯如社交網路,就不適用了…)
    1. client發起並行讀,取version最高的?
      1. 然後client回寫更新版本過期的
      2. 或伺服器Anti-entropy process後臺同步?
    2. Quorums:w r > n
    3. Quorums方法的侷限性(略)
    4. Monitoring staleness
      1. 度量replication lag
    5. Sloppy Quorums and Hinted Handoff
      1. 增加寫的可用性,但是讀的可靠性下降
      2. Sloppy quorums are optional in all common Dynamo implementations. In Riak they are enabled by default, and in Cassandra and Voldemort they are disabled by default(很奇怪的不同設計?)
    6. Detecting Concurrent Writes(多個clients同時寫同一個key)
      1. Last write wins (discarding concurrent writes) 應用級別的版本化?
      2. The “happens-before” relationship and concurrency(程式語言的記憶體模型裡的術語)
        1. 兩者不相關就是併發,不需要全域性時間戳(網路~光)
      3. Capturing the happens-before relationship
        1. 版本號:寫之前必須讀,獲取當前的最新版本號(全域性快照),而客戶端寫之前自己負責處理merge
        2. 媽的這不就是SVN/Git的通常使用模型嘛
      4. Merging concurrently written values
        1. Such a `deletion marker` is known as a `tombstone`.
      5. Version vectors(自動branching…)

分割槽

  1. 分割槽策略
    1. by Key Range:資料熱點問題
    2. by Hash of Key:不支援key range查詢
      1. Consistent Hashing:術語“一致性”指的是rebalancing方法
      2. Cassandra:compound primary key 僅hash第一個key(Hash of Range?)
    3. 分割槽不能處理對同一個key對大規模併發讀寫:Skewed Workloads and Relieving Hot Spots
  2. & 二級索引(文件資料庫中的文件屬性)
    1. document-based local index:scatter/gather read
      1. 理論上,如果不是用文件id直接作為主key,而是用這些二級索引屬性的組合hash作為key的話,可以避免
    2. term-based:全域性的index及其partition
      1. 寫操作更復雜了:所有資料庫都不支援“分散式事務”?
  3. Rebalancing
    1. 不要使用 % N方法,減少data move的開銷!
    2. 解決方法1:一個物理node上分配多個partition
      1. 一開始就固定住總的partition數?node與partition的關聯需要手工維護?
      2. how to 選擇正確的總partition數??
    3. Dynamic partitioning
  4. Request Routing
    1. routing tier:acts as a partition-aware load balancer
    2. ZooKeeper:keep routing info up-to-ate

事務

  1. ACID
    1. The word `consistency` is terribly overloaded(必須由應用來保證?)
    2. Isolation
      1. serializable:效能問題,Oracle甚至沒有實現它!
      2. snapshot(MVCC)
    3. 作者的批評:… such as Rails’s ActiveRecord and Django don’t retry aborted transactions
  2. Weak Isolation Levels
    1. Read Committed:無髒讀、無髒寫
      1. dirty write:一個事務的寫覆蓋了另一個事務未提交的讀
      2. 實現:
        1. 行級鎖
    2. Snapshot Isolation and Repeatable Read
      1. 不能容忍臨時不一致的情況:
        1. Backups
        2. Analytic queries and integrity checks
      2. 實現:
        1. 關鍵原則:readers never block writers, and writers never block readers
        2. txid:每一行關聯created_by、deleted_by(額外的GC程序),update轉換為delete create?
        3. 物件可見性規則
      3. Indexes:append-only B-tree
      4. … As a result, nobody really knows what repeatable read means.
    3. Preventing Lost Updates
      1. 2個併發的read-modify-write
      2. 原子寫
      3. 應用側的Explicit locking:SELECT … FOR UPDATE
      4. 自動檢測並abort
      5. Compare-and-set
      6. 副本問題:Conflict resolution and replication(原子操作需要是可交換的)
    4. Write Skew and Phantoms
      1. 2個事務快照讀同樣的物件,然後分別進行寫更新操作(併發帶來的問題)
      2. => SELECT … FOR UPDATE(但不能處理後續寫不是之前快照讀的物件的情況)
      3. Materializing conflicts
    5. Serializability
      1. Literally executing transactions in a serial order(記憶體資料庫?Redis)
        1. systems with single-threaded serial transaction processing don’t allow interactive multi-statement transactions.(儲存過程?)
          • 作者這裡吹噓了VoltDB一把~
      2. 2PL
        1. 2PL不是2PC
        2. 共享鎖 -> 排它鎖
          1. deadlock
        3. Predicate locks(鎖住一個‘搜尋條件’)
          1. Index-range locks
      3. Optimistic concurrency control techniques such as serializable snapshot isolation(SSI)
        1. 樂觀的併發控制技術在高Contention下有效能問題?
        2. 檢測outdated premise
          1. Detecting stale MVCC reads(???)
            1. 這裡的問題是:SSI怎麼知道哪些資料屬於快照建立時uncommited writes同時commit時已被修改?
            2. the database needs to track when a transaction ignores another transaction’s writes due to MVCC visibility rules(???)
          2. Detecting writes that affect prior reads

分散式系統的麻煩

  1. Monotonic Versus Time-of-Day Clocks
  2. confidence interval
    1. Google’s TrueTime API in Spanner:[earliest, latest]
  3. Knowledge, Truth, and Lies
    1. The Truth Is Defined by the Majority:但是quorum的quorum,這會導致獨裁吧?
    2. Fencing tokens
    3. Byzantine Faults
      1. 太空輻射導致的硬體錯誤?一個bit的翻轉?
      2. Most Byzantine fault-tolerant algorithms require a supermajority of more than 2/3 of the nodes to be functioning correctly
    4. system model(fault的模型),如timing:
      1. Synchronous model:有上界?不太現實
      2. Partially synchronous model
      3. Asynchronous model
    5. Safety and liveness

一致性與選舉

  1. stronger consistency: worse performance or less fault-tolerant
  2. Linearizability
    1. linearizability is a recency guarantee(一旦x read到新值,則所有後續read都應該獨到新值)
    2. cas:只有當x值為old的時候更新到new,否則fail
    3. vs Serializability
      1. 序列化是事務的隔離級別屬性,而線性化是對register的讀寫的recency保證,它不能防止write skew問題
      2. serializable snapshot isolation(SSI)is not linearizable
    4. Uniqueness constraints
      1. 原文:Strictly speaking, ZooKeeper and etcd provide linearizable writes, but reads may be stale …
    5. 實現:
      1. Multi-leader replication (not linearizable)
      2. Single-leader replication (potentially linearizable) Consensus algorithms (linearizable)
      3. Leaderless replication (probably not linearizable)
  3. CAP
    1. Consistency, Availability, Partition tolerance: pick 2 out of 3
    2. 網路分割槽是一種fault,不是你可以自由選擇的(沒得選擇)
  4. Ordering Guarantees
    1. If a system obeys the ordering imposed by causality, we say that it is
      causally consistent
      .
    2. linearizability是全序,而causality是偏序關係(有沒有可能是“線性化 可水平擴充套件”?)
    3. Sequence Number Ordering
      1. 非single leader的情況:Noncausal sequence number generators
        1. 確保全域性uid是可行的(預分配uid block);使用高解析的全域性時鐘。缺點:not consistent with causality
    4. Lamport timestamps:(counter, node ID)
      1. every node and every client keeps track of the maximum counter value it has seen so far, and includes that maximum on every request.
      2. Lamport timestamps are sometimes confused with version vectors(enforce a total ordering)
    5. Timestamp ordering is not sufficient(兩個使用者同時註冊以搶佔一個userid?這個問題有點病態的說)
      1. This idea of knowing when your total order is finalized is captured in the topic of “total order broadcast”.——有點新鮮,我怎麼沒聽說過?
    6. 全序廣播(原子廣播)
      1. 演算法需要確保safety屬性:
        1. Reliable delivery(訊息必須可靠傳播,保證每個node都接受到)
        2. Totally ordered delivery(訊息的node接受順序相同)——缺點:這個順序不能動態修改??
      2. Implementing linearizable storage using TOB
        1. 實現“線性化CAS操作”:… Read the log, and wait for the message you appended to be delivered back to you(???這裡的描述有點含糊啊)
          1. 註釋: If you don’t wait, but acknowledge the write immediately after it has been enqueued, you get something similar to the memory consistency model of multi-core x86 processors [43]. That model is neither linearizable nor sequentially consistent.
  5. 分散式事務與Consensus
    1. FLP:The Impossibility of Consensus(in the asynchronous system model)
    2. 2PC與原子提交
      1. coordinator:prepare/commit
      2. A system of promises: a bit more details
        1. 全域性唯一的transaction ID
        2. 各個參與者的讀寫操作都關聯到此事務id
        3. By replying “yes” to the coordinator, the node promises to commit the transaction without error if requested. In other words, the participant surrenders the right to abort the transaction, but without actually committing it(但是這個承諾可靠嗎?由於不可抗的外部災難導致硬體故障呢??)
        4. Once the coordinator’s decision has been written to disk, the commit or abort request is sent to all participants. If this request fails or times out, the coordinator must retry forever until it succeeds.(這玩意兒也能稱得上是演算法嗎???扯淡)
      3. 3PC:blocking –> nonblocking
        1. !3PC assumes a network with bounded delay and nodes with bounded response times
        2. perfect failure detector(但是reliable基本上不可能,因網路超時在分散式系統裡等同於節點失敗)
    3. 實踐中的分散式事務
      1. internal(所有node執行同樣的協議)vs heterogeneous
      2. Exactly-once message processing
      3. XA is not a network protocol—it is merely a C API for interfacing with a transaction coordinator.
        1. 接受repare通知相當於在XA Driver的C API上註冊callback???靠
      4. Holding locks while in doubt
      5. Recovering from coordinator failure(commit log意外丟失導致不能決定提交/abort)
        1. The only way out is for an administrator to manually decide whether to commit or roll back the transactions.(人工干預)
    4. Fault-Tolerant Consensus
      1. 演算法屬性:
        1. Uniform agreement:No two nodes decide differently.
        2. Integrity:No node decides twice.
        3. Validity:If a node decides value v, then v was proposed by some node.
        4. Termination:Every node that does not crash eventually decides some value.
      2. The best-known fault-tolerant consensus:Viewstamped Replication (VSR), Paxos, Raft, and Zab
      3. Epoch numbering and quorums(每個epoch週期內leader唯一,但不需要是同一個)
        1. … can’t decide by itself, Instead, it must collect votes from a quorum of nodes …
        2. 2輪選舉: 一次 a leader, 一次 leader’s proposal.
      4. Limitations of consensus
        1. … designing algorithms that are more robust to unreliable networks is still an open research problem
    5. Membership and Coordination Service
      1. HBase, Hadoop YARN, OpenStack Nova, and Kafka all rely on ZooKeeper running in the background.
      2. ZooKeeper is modeled after Google’s Chubby lock service, implementing not only total order broadcast:
        1. Linearizable atomic operations
        2. Total ordering of operations(zxid)
        3. Failure detection
        4. Change notifications
      3. ZooKeeper, etcd, and Consul are also often used for service discovery
      4. read-only caching replicas:只快取consensus演算法的結果,不參與投票

注意,本章沒有描述consensus演算法的細節(如Paxos或Raft)

批處理

  1. MapReduce workflows(通過硬編碼的HDFS ouput路徑?)
    1. 一個單獨的MapReduce job無法有效處理Top-N問題?
  2. Reduce-Side Joins and Grouping
    1. Sort-merge joins
    2. Bringing related data together in the same place
    3. GROUP BY(聚集查詢)
    4. Handling skew
      1. hot keys問題:
        1. skewed join in Pig(預取樣處理?需要完全複製hot key關聯的其他資料)
        2. sharded join in Crunch:類似的,但不是預取樣,而是要求人工指定(?)
        3. Hive:mapper side join
  3. Map-Side Joins
    1. Broadcast hash joins:大資料集連線小資料集(後者可完全載入到記憶體hashtable)
    2. Partitioned hash joins(如果連線雙方的key都依據相同的規則partition的話)
    3. Map-side merge joins
  4. The Output of Batch Workflows
    1. Key-value stores as batch process output
    2. immutable input:buggy code can be corrected and re-run
    3. using more structured file formats: Avro, Parquet
  5. Hadoop vs MPP
    1. 優勢:可以快速地載入到HDFS… (MPP系統需要預先做結構化建模)
    2. 允許Overcommitting resources
      1. At Google, a MapReduce task that runs for an hour has an approximately 5% risk of being terminated to make space for a higher-priority process
  6. Beyond MapReduce
    1. “simple abstraction on top of a distributed filesystem”:理解容易,但是具體的job實現並不簡單(MapReduce是很低階的API)
    2. Materialization of Intermediate State
      1. MapReduce的狀態物化的缺點(vs Unix Pipes):略
      2. Dataflow engines:Spark、Tez、Flink
        1. 避免昂貴的排序?
        2. *JVM程序可以被重用(Hadoop裡面每個task都要重啟新的)
      3. Dataflow系統的Fault tolerance:
        1. 如何描述ancestry資料輸入?(Spark RDD)
        2. make operators deterministic
    3. Graphs and Iterative Processing
      1. PageRank
      2. Pregel處理模型:bulk synchronous parallel (BSP), vertex之間相互傳送訊息
        1. Pregel model guarantees that all messages sent in one iteration are delivered in the next iteration, the prior iteration must completely finish, and all of its messages must be copied over the network, before the next one can start.
        2. This fault tolerance is achieved by periodically checkpointing the state of all vertices at the end of an iteration
        3. 問題:機器之間的通訊開銷比較大
    4. High-Level APIs and Languages
      1. UAF:Spark generates JVM bytecode and Impala uses LLVM to generate native code for these inner loops
      2. Specialization for different domains:Mahout、Spatial-kNN

流處理

  1. Transmitting Event Streams
    1. Messaging Systems:pub/sub
      1. Message broker(訊息佇列):允許consumer暫時離線
      2. Multiple consumers:fan-out
      3. Acknowledgments and redelivery
        1. the combination of load balancing with redelivery inevitably leads to messages being reordered
  2. Partitioned Logs
    1. log-based message brokers:Apache Kafka
    2. 典型的磁碟大小:6TB,寫速度450MB/s,則大約11小時寫滿
  3. Databases and Streams
    1. Change Data Capture(CDC):捕獲對資料庫的修改,作為流輸出(需要解析redo log/WAL)
      1. Initial snapshot
      2. Log compaction
      3. API support for change streams
    2. Event Sourcing(來自於DDD社群):從高層建模?
      1. Deriving current state from the event log
      2. Commands and events
    3. State, Streams, and Immutability
      1. 分離資料讀寫的形式:command query responsibility segregation (CQRS)
      2. append another event to the log,假裝資料被“刪除”了?哈哈哈
  4. Processing Streams
    1. Complex event processing (CEP):查詢(宣告式的模式搜尋)變成持久的
    2. Stream analytics:window、概率演算法
    3. Maintaining materialized views
    4. Search on streams
    5. vs 其他Messaging/RPC(如actors模型)
      1. 融合:Apache Storm的distributed RPC特性
    6. Event time versus processing time
    7. Types of windows(感覺這裡的討論有點無趣)
      1. Tumbling window
      2. Hopping window:允許重疊
      3. Sliding window
      4. Session window
    8. Stream Joins(感覺這裡的討論有點無趣)
      1. Stream-stream join (window join) 略
      2. Stream-table join (stream enrichment)
      3. Table-table join (materialized view maintenance)
    9. Time-dependence of joins
  5. Fault Tolerance
    1. Microbatching and checkpointing
    2. Atomic commit revisited(‘exactly-once’)
    3. Idempotence
    4. Rebuilding state after a failure

這部分內容大部分都很扯淡(空談),感覺有必要仔細看看Kafka的設計??

資料系統的未來

  1. 資料整合
    1. Schema Migrations on Railways(漸進遷移!)
    2. The lambda architecture:immutable append-only stream batch?
  2. Unbundling Databases
    1. Transactions within a single storage or stream processing system are feasible, but when data crosses the boundary between different technologies, I believe that an asynchronous event log with
      idempotent writes is a much more robust and practical approach
    2. Observing Derived State
      1. Taken together, the write path and the read path encompass the whole journey of the data …
      2. Materialized views and caching
      3. Stateful, offline-capable clients
        1. offline-first
        2. In particular, we can think of the on-device state as a cache of state on the server. The pixels on the screen are a materialized view onto model objects in the client app; the model objects are a local replica of state in a remote datacenter
      4. End-to-end event streams
      5. Reads are events too(有點去中心化的感覺?作者的奇思妙想?)
      6. Multi-partition data processing
  3. Aiming for Correctness
    1. “exactly once”, duplicate supression, operation identifiers
    2. The end-to-end argument(需要使用一個端到端的事務ID嗎?)
    3. Applying end-to-end thinking in data systems
      1. 作者這裡啥也沒說。重新探尋“事務”的抽象是否合適?
    4. Enforcing Constraints
      1. 不需要multi-partition atomic commit:基於event的log,附帶一個“Request ID”即可(可trace&可audit)
        1. 備註:event log實際上就用在Bitcoint這種分散式P2P去中心化的應用中(不過Request ID機制在比特幣的交易模型中並不明顯,因為塊鏈要求全域性線性化同步。。。)
        2. 這裡還有一個問題:Request ID到底由誰來分配?應該是app client side,而不是伺服器前端(這不就是DDD嘛)
          1. 實際上,並不需要端到端的Request ID,只要保證領域轉換(DDD術語)可以正確對映領域物件id即可
    5. 一致性:Timeliness and Integrity
      1. Timeliness means ensuring that users observe the system in an up-to-date state.
      2. Integrity means absence of corruption; i.e., no data loss, and no contradictory or false data
      3. event-based dataflow systems解耦這2者?(由於event處理是非同步的,sender不能立即看到結果)
        1. Passing a client-generated request ID through all these levels of processing, enabling end-to-end duplicate suppression and idempotence
      4. Similarly, many airlines overbook airplanes in the expectation that some passengers will miss their flight,…(資料模型的一致性有時候被業務模型直接破壞,引入了補償事務/系統外的人工干預)
        1. apology
      5. Coordination-avoiding data systems
        1. 可部署到“跨資料中心、multi-leader配置、非同步複製”的環境,一致性總是可以臨時violated,事後再修復即可(通過end-to-end的回溯log可以做到這一點)
    6. Trust, but Verify
      1. “system model”:計算機不會犯錯誤,磁碟fsync的資料不會丟失,…
        1. but:硬體bit flip錯誤,“rowhammer”攻擊
      2. If you want to be sure that your data is still there, you have to actually read it and check.
      3. Designing for auditability(可審查性)
      4. Tools for auditable data systems
        1. use cryptographic tools
        2. (作者的看法)proof of work很浪費?
        3. Merkle trees
          1. certificate transparency
  4. 做正確的事情
    1. A technology is not good or bad in itself—what matters is how it is used and how it affects people.
    2. 例如:Predictive Analytics
      1. Bias and discrimination
      2. Responsibility and accountability
      3. Feedback loops
    3. Privacy and Tracking(備註:尤其是醫療系統)
      1. Surveillance
      2. Consent and freedom of choice(de facto mandatory
      3. Having privacy does not mean keeping everything secret; it means having the freedom to choose which things to reveal to whom, what to make public, and what to keep secret.
        1. 備註:隱私問題的本質在於,資料本身一旦儲存下來,理論是可以無限制的儲存下去的(除非硬體失效),而人的大腦記憶可以選擇性遺忘
      4. Legislation and self-regulation