--- Log opened Fri Feb 27 00:00:27 2009
01:57 < mae> jsn, stepcut: very interesting discussion :) I also see spread as possibly being training wheels, but I have no reason to doubt its efficacy for the time being
01:58 < jsn> i think my larger point is that transactional storage has strong performance implications
02:01 < jsn> mae: well, i would go further -- it is inappropriate to make a transaction commit a barrier for storing information
02:01 < jsn> it is possible to store right away and "let god sort it out" with revision numbers -- you simply put the numbers in order, after the fact
02:02 < mae> absolutely
02:02 < mae> I think right multimaster is optimistic, like stepcut said, 'eventual consistency'
02:02 < mae> right now *
02:02 < jsn> well, LDAP has multi-master of a kind, as well
02:03 < jsn> unfortunately, their multi-master is just pushing changesets to everyone
02:03 < mae> the sharding stuff that Lemmih had planned was not going use any sort of distributed transaction
02:03 < jsn> the net effect is "last changeset wins"
02:03 < jsn> oh, no distributed transactions?
02:04 < mae> jsn: well, if you needed transactional integrity, you could use a component that strictly runs with multimaster replication, and if you didn't you would use sharding
02:04 < mae> this was his original thought
02:04 < mae> I looked at the traditional 2-phase commit
02:04 < mae> for distributed..
02:04 < mae> seems a bit too heavy duty
02:04 < jsn> well, to actually do multi-master, you need consensus
02:04 < jsn> might point is that LDAP, for example, claims multi-master but fails
02:05 < jsn> because they just shovel the updates around, they don't have a voting phase
02:05 < mae> jsn: depends on how your measuring the success
02:05 < mae> jsn: but my point is taken, and I honestly don't know if spread falls into the former or latter camp
02:05 < jsn> mae: it can be in an inconsistent state
02:06 < jsn> last update _to reach this node_ wins
02:06 < mae> from what i read in the previous discussion between you and stepcut, am I making the correct conjecture that spread may not suffer from consensus / voting but might end up with zombie nodes?
02:07 < mae> because of a lack of timeout/
02:07 < mae> ??
02:07 < Saizan> spread has to solve consensus to ensure total ordering.
02:07 < jsn> right
02:07 < Saizan> we don't know how it handles node failures.
02:07 < mae> right
02:07 < jsn> i am still reading their stuff
02:08 < jsn> a zombie node would be one that doesn't respond or one that responds incorrectly?
02:08 < mae> jsn: We would love to have some mindshare from more people like you which have more experience / interest in the distributed systems
02:08 < mae> that part of the code really needs some love
02:08 < mae> its sort of hard to justify happstack as a scalability panacea if we can't solve these scaling problems well
02:09 < jsn> well, i say on the contrary, it's not your problem
02:09 < mae> jsn: thats true :)
02:09 < jsn> scalable storage is not a framework issue
02:09 < jsn> to solve this problem for haskell alone would be a crime
02:09 < mae> jsn: MACID is a productivity booster, but if we can make it scale, it will be a productivity booster that can scale
02:10 < jsn> to solve it for everyone would be a triumph, far more valuable than HAppS
02:11 < mae> jsn: My vision is that macid doesn't attempt to be another generic distributed database, but rather a flexible persistent datastore for haskell. Regarding outward interfaces, I am fond of the idea of standardizing on REST.
02:12 < mae> My itch is the fact that I feel generic database constraints fail miserably, and that check constraints / etc are too detached from the application code
02:12 < jsn> well, to get a distributed, ACID datastore, you need to at least solve the problem of transactional, distributed byte storage
02:12 < mae> now, on the other hand, if you build your data validation into your happs webapp, and provide a rest interface, then this will keep people from wrecking your data.
02:13 < jsn> hmm
02:13 < mae> You lose flexibility, but you gain massively in reduction of complexity of the entire platform.
02:13 < jsn> i think that is all fair
02:13 < jsn> i don't think you need to have a system that is generic beyond storing bytes
02:14 < sm> maybe happs should just use couchdb
02:14 < mae> now, regarding your tune, if someone were to build a high level databasy thing on top of happs in the future, this could solve it for "everybody" as you say, even if they didn't want to program in haskell.
02:14 < jsn> sm: no transactions, though
02:14 < mae> sm: couchdb is too specialized :)
02:14 < jsn> mae: if we solve it for haskell, we have to solve the bytes problem.
02:15 < jsn> mae: if we solve the bytes problem, that is valuable all by itself.
02:15 < mae> jsn: well, again, nothing is stopping anyone from generalizing the platform for flexibility during runtime
02:15 < mae> but I see this as an extension too, and not a requirement for use of the system
02:16 < jsn> mae: hmm?
02:16 < mae> jsn: Perhaps I understood your tact, what did you mean by "if we solve the bytes problem"
02:17 < mae> I took that to mean, generalized routines for storage and retrieval of bytes, outside the haskell universe.
02:17 < jsn> mae: i mean, to do MACID correctly, you have to be able to store and retrieve bytes correctly
02:17 < mae> jsn: oh all right, then we are on the same page :)
02:18 < jsn> mae: so once you have done that, you have already thrown out all the haskell, really
02:18 < mae> by thrown out, you mean the algorithm is independent of any language, of course.
02:19 < jsn> well, what i mean is: the transaction manager will be quite ignorant of its messages' contents
02:19 < mae> hehe
02:19 < jsn> i mean we obviously want to use haskell to write it
02:19 < mae> jsn: ahh ok, then I guess I can get off the fenc then
02:19 < mae> of COURSE we have to write it in haskell :)
02:19 < jsn> yes
02:20 < jsn> because if we solve this problem in haskell, then everyone has to install haskell and learn haskell
02:20 < Saizan> you're agreeing, but you seem to focus on the opposite ends of the problem/solution :)
02:20 < jsn> and we will have glorified haskell forever
02:20 < jsn> Saizan: do clarify?
02:21 < jsn> yeah, i am saying, they will install haskell not to love it but to run it
02:22 < mae> yes, and thats where we will convert them
02:22 < jsn> so, basically, there is a transaction manager
02:22 < jsn> right
02:22 < mae> to be part of the master programmer race
02:22 < mae> umm ok
02:22 < jsn> well, and look at how we were able to write something that confounded everyone -- even those erlang guys -- mwawahahah
02:22 < mae> I totally meant that in a non-offensive tongue-in-cheek sorta way, for any euros.
02:22 < jsn> so, there is a transasction manager
02:23 < jsn> and it gets some bytes with transaction tags
02:23 < jsn> that's what it has to work with
02:23 < jsn> serialization/validation is a different layer
02:23 < jsn> that's why i think any reasonably orthogonal solution will work well with applications components in other langauges
02:24 < jsn> so, as for zombies -- a functioning consensus system will have zombies
02:24 < jsn> in the sense that, it will have nodes that know they have lost connection and will function in a diminished capacity
02:25 < jsn> i'm not sure what zombies you were talking about earlier
02:26 < mae> transaction tags, uuids? random sha2 hashes?
02:26 < jsn> well, you can get them in any way that guarantees their uniqueness
02:26 < jsn> so UUID version 1 is alright
02:27 < jsn> but you must also have group membership information anyways, as a pre-condition for even knowing what quorum is in your system
02:27 < jsn> so it would be quite alright for each node to have a counter and a number and to just append its number to its counter to get a unique identifier
02:28 < jsn> i have actually described the whole thing in < 300 lines here: http://github.com/jsnx/members-only/blob/master/notes/Consistent%20Logging%20Algorithm
02:28 < mae> right
02:29 < mae> that would be n ice.
02:30 < jsn> mae: two phase commit's problem is not the two phases but the leader
02:30 < jsn> mae: you can't do without some kind of proposal and voting
02:31 < jsn> mae: but leader election and failover adds a whole new level of complexity and brittleness
02:31 < mae> right
02:31 < mae> the more stateless the better.
02:31 < jsn> aye
02:32 < jsn> so, we need to have the group members, but that is enough
02:32 < jsn> in the system i describe above, i just have every member broadcast votes to every other member
02:32 < mae> jsn: while we're shooting the shit, what do you think about the possibility of happstack using anycast for redundant / load balanced ip failover for your http server
02:34 < mae> jsn: re group member voting: so they aren't voting for leader, but for message dominance with each and every update?
02:34 < mae> jsn: anycast requires ipv6, of course, but root nameservers already make use of it with success
02:35 < jsn> mae: i am not sure how to address the anycast at present
02:35 < jsn> mae: i am unclear as to how that actually avoids talking to failed nodes, for example
02:37 < mae> jsn: its all handled at the ip level
02:37 < jsn> as for the voting: Every changeset goes through two rounds. In the first round, it reaches a node; that node tags it with a transaction number and broadcasts it to every node in the membership list. This message is:    <accept?> <transaction> <stuff that's in it>
02:38 < mae> so there would have to be some basic logic, in your http server, probably would have to use a similar election model to the one you described, (if you wanted to have session affinity for that server).
02:38 < mae> on the other hand, if its stateless, it doesn't really matter where it goes
02:38 < mae> but we all know that doesn't happen these days with webapps
02:38 < jsn> In the second round, nodes broadcast to their votes to one another. This message is:    <accept!> <transaction>
02:39 < jsn> each node waits to get a quorum; it commits if it receives a quorum and not otherwise.
02:40 < jsn> I think many end-users would prefer if HAppS delegated to their DNS server for routing policy
02:41 < jsn> in any given round, several messages can be handled at once
02:41 < mae> if they are in one transaction?
02:42 < jsn> well, no
02:42 < jsn> they can all be in different transactions
02:42 < mae> so what if a node dies
02:42 < jsn> then you just get several messages   <accept!> <transaction number 1> <transaction number 2> ....
02:42 < mae> no updates can happen?
02:42 < mae> or group membership is adjusted accordingly?
02:42 < jsn> well, you only need a quorum to commit
02:43 < jsn> you have to kill a nearly half the nodes
02:43 < jsn> group membership would be adjusted via this very protocol, though
02:43 < mae> i am not familiar with the quorum terminology in the way you are using it
02:43 < jsn> a quorum is a majority
02:43 < mae> ah ok
02:44 < jsn> so, we need more than half the nodes to approve a transaction in a round for it to be committed
02:44 < mae> why would they choose to reject it?
02:44 < jsn> well, any branch is rejected
02:45 < jsn> well, okay, i have not explained that at all
02:45 < jsn> so never mind why
02:45 < jsn> the nodes are, in practice, replicas
02:45 < jsn> whenever they reject, they are going to reject the same stuff
02:45 < mae> gotcha
02:46 < jsn> so the voting behaviour is basically going to be everyone all voting the same way, most of the time
02:46 < mae> a branch would be, anything which did not have a leaf as a parent
02:46 < mae> right?
02:46 < jsn> well, actually, the model i mention in my writeup is for ordering revisions
02:46 < jsn> so, we store all these changesets that form branches
02:46 < mae> yeah i am thinking in terms of revisions
02:46 < mae> in mercurial, it behaves the way you are describing
02:47 < jsn> however, we only let the first changeset be the "trunk"
02:47 < jsn> all the other ones are stored
02:47 < mae> to merge a branch, a given changeset will end up having two parents, so I guess my question is, you are ensuring a linear history, with no branches, right?
02:47 < jsn> because we broadcast the changeset to everyone, right?
02:47 < jsn> well, i have a history with branches _and_ a distinguished trunk
02:48 < jsn> we store all the branches and we store which one is first
02:48 < jsn> the storing part is eventual -- we just push everything out to be stored, we don't make any effort to ensure that it is
02:49 < jsn> the distinguished trunk is what requires consensus
02:49 < mae> so your just talking about dealing with asynchronous revisions in a elegant, distributed, way; you haven't specified how you will solve merging the branches, you leave that up to the implementer, right?
02:49 < jsn> they may choose not to do it, even
02:50 < jsn> the branches would just a be a record of failed operations in certain contexts
02:50 < mae> true
02:50 < jsn> of course this facilitates merging, though
02:50 < jsn> since you haven't lost the data that you failed to commit
02:50 < mae> so, checkpointing would also be solved at a higher layer, right? the checkpoint need only have a beginning tid and end tid
02:51 < jsn> hmm
02:52 < jsn> have you ever heard of checkpointing a version controlled repo?
02:52 < jsn> in a sense, it is self checkpointing, yes?
02:52 < mae> right
02:53 < mae> well, I guess I am referring to the reality of reading all the transactions to read your state
02:53 < mae> (in the database context)
02:53 < jsn> yes
02:53 < mae> so the checkpointed state would be identified by changesets, but not necessarily defined by them.
02:54 < jsn> well, you would want to repack each file pretty frequently
02:54 < mae> jsn: so then, if you were to create this distributed, elegant, asynchronous replication lib, what would it be called? And essentially, as far as MACID is concerned, we could replace spread with it, right?
02:54 < jsn> that is a kind checkpointing that version control systems likely do -- caching versions
02:55 < jsn> it would be called, uhm, i don't know
02:55 < mae> perfect
02:55 < mae> uhm
02:55 < mae> better than erlang , they got their db name from "amnesia"
02:55 < mae> what an oxymoron :)
02:55 < jsn> as for replacing MACID with it, i would say, you would replace it with this and a serialization system
02:56 < jsn> somehow, you need to make space in MACID for a "could not connect to server" error and a "server fed me garbage" error
02:56 < mae> jsn: how would sharding play into this replication method. Would it make use of, or be characterized by different behavior? Can sharding be handled at a low level like replication? If so will it stay balanced?
02:57 < jsn> I need to write that up. Sharding is basically that some of the nodes just agree to certain transactions and don't store anything.
02:57 < jsn> However, the message cost is still high there. So we should use trees.
02:58 < jsn> Which is to say, you have some groups of replicas.
02:58 < jsn> A B C D are the groups
02:58 < jsn> and then you have some nodes sitting on top of them, in a group that routes, call it M
02:59 < jsn> oh, wait, we can forget the tree
02:59 < jsn> so we have A B C D right?
02:59 < jsn> and say we send a request to a router, and what it does is send it to a node in A, a node in B, a node in C and a node in D
03:00 < jsn> the request belongs to C, say
03:00 < jsn> well, the nodes in A, B and D do not care, so they send back acceptance right away and do not forward it to their peers
03:01 < jsn> the node in C does two rounds with its peers and gets approval, &c.
03:01 < jsn> you will really want to use trees as you shard more, though
03:02 < jsn> and you want the router to actually send the request to any node in C, A, B or D. you don't want there to be a special representative
03:02 < mae> multicast?
03:02 < jsn> multicast has high message cost
03:02 < jsn> you can't shard with it, really
03:03 < mae> ok, well what if random A representative alpha is a no-ping
03:03 < jsn> oh, i see what you mean
03:03 < mae> a multicast for, hey, one of you bastards needs to help me out here
03:03 < jsn> yeah, send it to two or three of them
03:03 < mae> thats pretty lightweight
03:03 < jsn> right, or that
03:03 < mae> hehe
03:04 < jsn> so, one more thing
03:04 < jsn> we have the message contents and its transaction number -- but sharding means we have a space of names
03:05 < jsn> and that is why i called the stuff files, earlier -- the though the names can just be strings
03:05 < jsn> no need for hierarchical paths
03:05 < mae> allright, so how do we differentiate between A B C D, an arbitrary algorithm that the programmer provides for a particular dataset?
03:05 < mae> obviously, it has to be a stateless algorithm
03:06 < mae> or it is defeating the sharding concept
03:06 < jsn> well, as a suggest above, you just send the request to each of them and they can approve it if they don't care
03:06 < jsn> s/as a/as i/
03:06 < mae> but how do they know whether to reject it or not
03:06 < mae> or whether they care
03:06 < mae> there has to be a stateless algorithm, no?
03:07 < jsn> well, they are preserving the state of certain resources
03:07 < mae> i.e. Map keys that start with 'A'
03:07 < mae> then 'B'
03:07 < mae> etc
03:07 < jsn> aye, or anything like that
03:07 < mae> any way to generalize this and balance it better
03:07 < jsn> but each node is basically replicating many files with its peers
03:07 < mae> than arbitrage?
03:07 < jsn> hmm?
03:08 < mae> i.e. we might have 2000 'A' people
03:08 < mae> and 10 'B' people
03:08 < jsn> yeah, you can break that up sensibly
03:08 < jsn> each node knows which files are on it
03:09 < jsn> note that, because we require an <accept!> even from nodes that don't care, there is no outward difference between peers and don't care nodes
03:09 < jsn> we can have cross shard transasctions this way, it is just the same
03:09 < mae> um
03:10 < mae> but the point of sharding is to partition your update cost I thought
03:10 < mae> (or at least part of the point)
03:10 < jsn> you break the nodes up in to groups
03:11 < jsn> when you send a message to a group that does not care about it, the first node just sends back <accept!> to you
03:11 < jsn> which is to say, it <accept!>s for the group
03:11 < jsn> whereas a group that does care will have to go through the full commit before sending back <accept!>
03:11 < jsn> so you really do lower the cost
03:12 < jsn> however, you also can do cross cluster commit without an special cases
03:12 < jsn> it really is a tree, that is the best way to think about it
03:12 < mae> ok
03:12 < jsn> i need to write it up, though
03:13 < jsn> you know, i have not thought about this for some time
03:13 < mae> so multimaster vs sharding, they can and will be overlapping and/or optional right?
03:13 < mae> ie, sharding without multimaster, multimaster without sharding, or multimaster and sharding
03:13 < jsn> well,
03:14 < jsn> yes, yeah
03:14 < jsn> if you want to have replicas within a shard, for example, that is no problem
03:14 < jsn> likewise, if you want to have all replicas, that is fine
03:14 < mae> multimaster allows you to scale queries and provides failover, whereas partitioning allows you to scale your updates, and memory size
03:14 < jsn> right
03:15 < jsn> the consistent logging algorithm in the paper i linked to explains the replication
03:15 < mae> yeah it looks great, i'll read it tommorrow
03:15 < mae> so one last bit about sharding.
03:15 < jsn> aye
03:16 < mae> Will balancing it be difficult, and/or is balancing even really necessary? IE can I have a 8gb box and a 32gb box
03:16 < mae> and use the 32GB to its full potential
03:16 < jsn> there would be no problem with that
03:16 < jsn> put more files on the big one
03:17 < jsn> of course, they can not be replicas :)
03:17 < mae> which leads to the next question... can the updates be generalized for sharding without costing too much in terms of node communication?
03:17 < Saizan> yeah, but figuring out what should go where from that principle might not be straightforward
03:17 < mae> ie, i need to perform an update, so i need to figure out where I have space
03:17 < jsn> hmm
03:18 < jsn> right
03:18 < mae> this uses the space more efficiently, however, if i use an arbitrary / stateless algorithm, my update cost will be less
03:18 < jsn> these are all problems i do not have anything concrete to say about, yet
03:18 < mae> so really, I think finding the right balance between flexibility and simplicity will be difficult :)
03:19 < jsn> well, right now, i am just looking for correctness
03:19 < mae> jsn: bravo.
03:19 < jsn> my goal is that the algorithms i describe should be obviously correct
03:19 < mae> again, thanks for the talk, thanks for dealing with all my poking / prodding, I am gonna hit the hay.
03:19 < jsn> aye
03:19 < mae> jsn: yep! can't wait for the paper on sharding
03:20 < jsn> good night
03:20 < mae> jsn: I will read your consistent logging paper tommorrow, and see if I can play some more devils advocate :)
03:20 < mae> to help you make it more correct.
03:20 < jsn> i appreciate the prodding -- i very much want to be challenged to show that this is right
03:21 < mae> (at least from a human intuition point of view, I am not very familiar with formal mathematical proofs and/or correctness proofing utilities just yet)
03:21 < jsn> papers on these programs are so opaque -- and often hide devious failure cases -- i want to be sure people can see how the system is correct
03:22 < mae> jsn: yeah definitely, unravel the "magic" of the distributed system
03:22 < mae> there is not very much good stuff on the internet about real production stuff that is used
03:22 < jsn> right
03:22 < mae> just the basic 2-phase algorithms on wikipedia
03:22 < mae> kind of sucks
03:22 < jsn> right
03:22 < jsn> and the Paxos paper is ludicrous
03:22 < mae> : )
03:22 < mae> ok really gotta run! night
03:22 < jsn> g'night
03:23 < jsn> please allow a week's time for the sharding paper
16:16 < Saizan> thanks to nominolo we now can build haddock documentation with ghc STABLE! yay!
16:16 < Saizan> though we have a "parse error in doc string" in happstack-ixset
21:15 < stepcut> mae: ping
22:52 < h_buildbot> Build for ghc-6.10.1 OK. Test suite ran. 15/77 test cases failed. Check http://buildbot.happstack.com/ for details.
22:52 < stepcut> mae: http://src.seereason.com/urlt/
22:52 < stepcut> mae: I will merge it into happstack eventually, but I am still sorting it out. The URLTH module is new. It uses TH to create prettier URLs
22:53 < stepcut> mae: one unsolved issue I am thinking about is how to encode the request method (POST/GET/HEAD/etc).
22:54 < stepcut> mae: an obvious choice is to make a type like, data Method a = POST a | GET a, and then do, POST UploadImage
22:55 < stepcut> but the problem is, the request method can only appear once in the URL, this makes no sense:
22:55 < stepcut> POST (GET UploadImage)
22:55 < stepcut> making the request method be the first component does not work either, because when you embed a component in a larger site, it won't be at the beginning anymore
22:56 < stepcut> aka, JeremysGallery (POST UploadImage)
22:56 < stepcut> s/aka/e.g.,/
23:05 < h_buildbot> Build for ghc-6.8.3 OK. Test suite ran. 15/77 test cases failed. Check http://buildbot.happstack.com/ for details.
23:14 < h_buildbot> Build for ghc-6.10.1 OK. Test suite ran. 15/77 test cases failed. Check http://buildbot.happstack.com/ for details.
--- Log closed Sat Feb 28 00:00:27 2009