Experimental IRC log happs-2008-01-16

Available formats: content-negotiated html turtle (see SIOC for the vocabulary)

Back to channel and daily index: content-negotiated html turtle

These logs are provided as an experiment in indexing discussions using IRCHub.py, Irc2RDF.hs, and SIOC.

15:39:17<mightybyte>Anyone awake here?
15:40:52<pulczynsk1>yup
15:40:53<pulczynsk1>:)
15:42:33<mightybyte>Aha
15:42:52<mightybyte>I've been looking into happs for a web app.
15:43:12<mightybyte>Is it useable for an app that needs more state than can be stored in memory?
15:57:03<mightybyte>The reason I ask is because I've seen the in-memory state aspect emphasized.
16:07:46<pulczynsk1>mightybyte: whole point is abount keepeing data in memory (with acid guarantees), so short answer is : no. :)
16:09:38<mightybyte>Hmmm, so it seems that happs wouldn't be an appropriate choice for large-scale apps.
16:11:03<mightybyte>Since it's advertised as a having the ability scale, this is a little disappointing.
16:11:57<pulczynsk1>mightybyte: What do you call "large-scale" apps ?
16:12:27<mightybyte>I'm mainly referring to apps that have large storage requirements (and thus can't reside entirely in memory).
16:13:45<mightybyte>i.e. it sounds like Google couldn't use happs for web apps of their scale and storage requirements
16:14:04<pulczynsk1>mightybyte: you should check this thread http://groups.google.com/group/HAppS/browse_thread/thread/d2eea2f5170bbb0c (about data sharding etc)
16:14:18<mightybyte>Ok
16:14:22<pulczynsk1>mightybyte: And that's true, google couldnt use happs as its only persistence layer
16:14:48<pulczynsk1>on previous version of mainpage there were some numbers which showed that you could use happs as transaction-center of ebay
16:15:11<mightybyte>Hmmm, how would that be accomplished?
16:15:23<mightybyte>...the persistence aspect that is.
16:16:23<alexj>you store blobs on disk.
16:16:28<pulczynsk1>mightybyte: you dont need to have all data in memory, just those which are needed for this kind of transaction. If you keep them in memory you have non blocking acces to them , so you are limited only by cpus (and main bus)
16:16:36<alexj>the question is how much transactional state you have.
16:17:04<mightybyte>Ok, well let's choose a concrete example to talk about. Take something like Google reader.
16:17:27<mightybyte>I've heard that the size of their database of links is huge.
16:18:23<mightybyte>Could it be run with happs accessing a database backend when necessary?
16:18:31<alexj>mightybyte: yes.
16:18:39<alexj>mightbyte: you can do IO to whatever you want.
16:18:39<pulczynsk1>mightybyte: if I understand it properly you can just shred users over happs instance
16:18:55<pulczynsk1>s/instance/instances/
16:18:57<alexj>right now sharding would be a manual process.
16:19:03<alexj>we hope to automate it in the futur.
16:19:11<mightybyte>sharding what?
16:19:23<pulczynsk1>alexj: i have heard of plan of using ec2 btw, is that true?
16:20:01<alexj>yes.
16:20:19<mightybyte>Does sharding refer to splitting users over happs instances?
16:20:21<pulczynsk1>mightybyte: users does not have common data in google reader, so you can just put limited numer of users on one server. to communicate with other users (eg reccomend them a link) you can just use api of happs instance
16:20:29<pulczynsk1>mightybyte: yes.
16:21:39<mightybyte>So would a user be permanently restricted to a specific physical server, or would it be load balanced appropriately when the user logs in?
16:21:58<pulczynsk1>mightybyte: it would be restricted to one server, true
16:22:08<mightybyte>That seems like it might not scale
16:22:23<pulczynsk1>mightybyte: scale-up is something different than scaling-out
16:22:32<mightybyte>It's certainly not a stretch to suppose that groups of users would not be evenly distributed.
16:23:08<pulczynsk1>mightybyte: you are comparing it do common database solution, which also you have to scale at some point, and the key to scalabiltiy is keeping data in ram (eg mysql cluster)
16:23:59<mightybyte>Yeah, I guess my picture of how to properly use state and backend database storage is not quite there yet.
16:24:05<pulczynsk1>mightybyte: so using happs you just consloidate web-fronted with database ,(gain: dont have to use external caches)
16:24:26<mightybyte>Yeah
16:24:44<pulczynsk1>mightybyte: im not really an expernt in databases, but at some point you have to split/shred you data over multiple instances
16:25:02<pulczynsk1>if you want to do this using multi-master replication you are limited by INSERT or UPDATE statements (when using SQL)
16:25:14<pulczynsk1>because every INSERT must be propageted across nodes
16:25:19<mightybyte>So if users are split among servers, when a user logs out and then back in, does he get the same server as the first login?
16:25:33<pulczynsk1>that would be one of solutions
16:25:36<alexj>mightybyte: you would typically multimaster each shard.
16:25:48<mightybyte>What does that mean?
16:26:32<alexj>there is multimaster code in the repos but we have not made it production yet, but the concept is that each shard would be fully replicated so a user could hit any server that holds his/her data.
16:26:43<alexj>and any update to any of those servers would get propagated to the others.
16:26:51<mightybyte>Ok
16:27:06<alexj>so you load balance accross multimastered instances.
16:27:30<pulczynsk1>alexj: what about acid guarantess in this case?
16:27:38<alexj>you get acid per shard.
16:27:49<pulczynsk1>alexj: so what about conflicts?
16:27:53<alexj>you don't get acid accross shards.
16:27:58<alexj>conflicts?
16:28:03<mightybyte>I guess I need to figure out in my mind what data goes in happs state and what would go in the database
16:28:56<pulczynsk1>user hits server A and says DELETE X (and A deletes X) and then hits B which has X in its state because state-change hassn't been propageted to it yet
16:28:59<alexj>mightybyte; think of it as hierarchical storage management.
16:29:27<mightybyte>I'm thinking that in an application like Google Reader, there would be little to no user state.
16:29:28<pulczynsk1>alexj: i am talking about some kind of distributed lock (or smth)
16:29:45<alexj>pulczynsk1: you get ACID per exposed state method.
16:30:08<mightybyte>If a user deleted a link from his list, then that would be reflected in the database, not the happs state.
16:30:40<alexj>pulczynsk1: yes you could in principle see a propagation delay. that is true of any form of replication. it is just unlikely that the user would hit the other server before replication happened.
16:30:41<pulczynsk1>mightybyte: that would mean you are using happs just as a cache layer to db?
16:30:54<mightybyte>I guess so.
16:31:29<pulczynsk1>alexj: yeah, but in this scenario you dont gain so-called "scaling-out" (transparently) (eg i caonnt use it in banking application)
16:31:44<pulczynsk1>alexj: on the other hand its enough for most web2.0
16:31:45<alexj>in any case, you couldn't get a conflict because the operations have a total order so serverB may still have X but it you can't do anything with it.
16:31:46<mightybyte>If the happs state only contains data specific to a given session instance, then users would be able to get different happs servers and it would be possible to load balance properly.
16:32:27<alexj>pulcynzsk1: you can'
16:32:32<pulczynsk1>alexj: right. i didnt get it at first.
16:32:34<alexj>you can't get ACID accross shards.
16:32:50<alexj>but you can get ACID within shards because within a shard all operations have a total order.
16:33:02<mightybyte>It seems like this approach would be a good way to handle large amounts of data persistence because then you can let the database guys take care of their own scaling issues.
16:33:53<mightybyte>alexj: Yeah, and in the approach I've just described, I don't think you'd need ACID across shards.
16:33:59<alexj>pulczynsk1: you probably don't want to distribute the bank accounts accross multiple shards unless you do something clever algorithmicly like locking some of the balance until the transaction is confirmed at both ends (which is how the credit card systen works)
16:34:21<alexj>mightybyte: I really don't want to use an external DBMS ever again.
16:35:04<alexj>if you did a banking app you just need to think about whether you really need to shard the accounts table.
16:35:05<mightybyte>alexj, I have the same feelings, but they're pretty well-understood, and can be outsourced to people who don't mind.
16:35:31<alexj>mightbyte: my hope for happs is that few apps are sufficiently large that you need to outsource.
16:35:44<mightybyte>Hmmm
16:35:45<alexj>haskell coding efficiency is just soo much higher than for other languages.
16:35:54<mightybyte>Oh, I agree
16:36:00<mightybyte>That's why I'm here discussing this right now.
16:36:07<pulczynsk1>:)
16:36:14<mightybyte>...otherwise I'd be off coding in Python on Java
16:36:25<alexj>the coordination costs of adding programmers to the team dwarves the management savings of outsourcing.
16:36:47<alexj>you can't produce a baby in one month by impregnating nine women. etc.
16:36:54<mightybyte>But I still can't help thinking about situations like Google's.
16:36:54<pulczynsk1>lol
16:38:42<mightybyte>How would you do something like reader without outsourcing the persistence to a database?
16:38:51<mightybyte>...solve the scaling issues that is.
16:38:58<alexj>so lets spec out what reader is?
16:39:12<mightybyte>Ok, you have a huge "database" of links
16:39:22<mightybyte>...and feeds
16:40:12<mightybyte>Links are associated with feeds, and multiple users can be subscribed to a given feed.
16:40:52<alexj>ok so why not shard the link database?
16:40:57<mightybyte>I think you can ignore other details like tags for feeds and/or links.
16:41:23<mightybyte>Ok, I don't understand all the ramifications of that.
16:41:59<alexj>mightybyte: you need to read up on sharding issues in general. you have to do it for mysql too.
16:42:26<mightybyte>Yes, I don't claim to be an expert, but I'm trying to understand it in the context of happs.
16:42:42<alexj>the gist is that you need to divide the link database accross multiple servers. when you need to find something, you query all of them and collect results toegether.
16:42:56<mightybyte>Yeah, I understand that.
16:43:27<mightybyte>So that would imply that the state sharding is somewhat separate from the web server interface.
16:43:54<mightybyte>If that's the case, I hadn't realized it until now.
16:44:04<pulczynsk1>i would recoomend reading http://www.highscalability.com/ and real world examples : http://highscalability.com/links/weblink/24 . at some point all of them has to shard datas of some kinds. you have to do exactly this in happs, but are freed from using externals tools and thinks like locking, IO-bounding etc (you dont wait for resources)
16:44:17<mightybyte>Or rather, the implications weren't clear until now.
16:44:37<pulczynsk1>so, happs by itself isnt cure for scaling out (just for scaling up, which is sufficient in most cases)
16:45:33<mightybyte>So is the state management and sharding independant from the web server parts of happs?
16:45:50<alexj>mightybyte: happs is divided into different repos for a reason. happs-http is your web server layer. happs-state is another layer. the nice thing is that you get a single executable to handle it all.
16:45:56<alexj>yes
16:46:10<mightybyte>Ahhh, the light is coming on.
16:47:28<pulczynsk1>bah, i must leave, but for sure i will check logs tommorow (as i am evangelist of happs in my company ;)
16:48:19<mightybyte>I had assumed that if I used the happs state, I would be implicitly splitting data into groups of users. But it sounds like that's wrong.
16:49:28<alexj>mightybyte: you might also shard your users
16:49:36<mightybyte>Certainly
16:49:51<mightybyte>But I didn't realize that happs would allow both dimensions of sharding.
16:49:59<alexj>though you may find that you need to have a lot of users to justify bothering.
16:50:26<alexj>mightybyte: just to be clear happs is not production quality for this sort of thing yet so you are playing with fire if you are planning to scale like this today.
16:50:40<mightybyte>Right, I understand.
16:50:51<mightybyte>But that's the goal
16:51:07<mightybyte>right?
16:51:17<alexj>yes.
16:51:27<alexj>the next step is to get multimaster nice.
16:51:29<alexj>then sharding.
16:51:56<mightybyte>Well this is really encouraging.
16:52:28<mightybyte>It sounds like I can forget worrying about a database schema and just design my data structures in haskell.
16:53:02<alexj>thats the idea.
16:53:29<mightybyte>Wow, we need to improve the documentation so that this is more accessible.
16:53:43<alexj>very much. could definitely use help.
16:54:21<mightybyte>So what does happs have today?
16:54:54<alexj>working code for a single non-sharded server. we are stabilizing apis now.
16:55:06<mightybyte>apis?
16:55:44<alexj>how you declare your state so that you get ACID query/update operations on it.
16:56:06<mightybyte>Ok
16:56:12<alexj>check out the AllIn example on the homepage.
16:56:25<alexj>and you'll get a flavor for what is happening.
16:56:58<mightybyte>Hmm, I'm not seeing that. Link?
17:00:13<mightybyte>Oh, is it in the repository in HAppS-HTTP/Examples/AllIn.hs?
17:00:32<alexj>yes
17:00:47<mightybyte>I haven't found a way to access that from the website.
17:00:49<alexj>its labeled as State+HTTP together on the homepage.
17:01:02<alexj>yeah, I need to fix repo directory indexing.
17:01:02<mightybyte>Oh, ok
17:01:38<mightybyte>What is the "$(deriveNewData..." bit?
17:03:46<alexj>until the net version of haskell, we need that because happs uses generic programming interfaces to reduce the amount of boilerplate you need to write.
17:04:12<alexj>"deriving Data" does not yet provide enought stuff so we need templatehaskell to provide the rest.
17:04:32<mightybyte>What's the meaning of the $()?
17:05:08<alexj>it is templatehaskell
17:05:11<mightybyte>Is that from templatehaskell? I'm not familiar with it.
17:05:13<mightybyte>Ok
17:05:16<mightybyte>I'll look that up.
17:05:29<mightybyte>(still learning haskell too)
17:08:05<mightybyte>Is templatehaskell pretty standard?
17:10:12<mightybyte>alexj: I have to go. Thanks for all the help.

Back to channel and daily index: content-negotiated html turtle