--- Log opened Mon Mar 16 00:00:40 2009
00:10 < mae> elo
00:13 < stepcut> hello
01:19 < mae> stepcut: hi :) messing around with hint
08:46 < stepcut> mae: fun
08:57 < mightybyte> Can happstack have a component with kind * -> *?
09:18 < stepcut> mightybyte: probably not, but I would need to see a more specific example
09:21 < mightybyte> Well, the example that I was debugging the other day...
09:22 < mightybyte> http://hpaste.org/fastcgi/hpaste.fcgi/view?id=2419
09:22 < mightybyte> I worked around the GHC panic by not using inferIxSet.
09:23 < stepcut> one moment
09:24 < mightybyte> Assume for ease of explanation that I used inferIxSet.
09:24 < stepcut> I do something similar here, http://src.seereason.com/happstack-extra/src/Happstack/Server/Account.hs
09:24 < mightybyte> (I think my Indexable instance ends up being the same anyway)
09:25 < stepcut> I don't have time to look at it right now, but maybe that will help
09:25 < mightybyte> Ok
11:51 < sm> morning all
11:53 < sm> thanks for the good work moving happs forward!
11:55 < sm> that said, I'm sorry to say that http://tutorial.happstack.com/tutorial/get-and-post explains nothing. Has anyone got a minimal example of  a request handler that prints the value of a query input ?
12:00 < wchogg> Heh...I'll make that the next section I write more examples for.
12:02 < sm> thank you. I say this after half an hour of studying the tutorial, source, api docs, etc.
12:03 < sm> and a previously working HAppS implementation
12:04 < sm> in general I feel I need minimal examples of everything, a tutorial before the tutorial - happstutorial is advanced
12:04 < wchogg> I understand & I apologize for the obfuscation.  I'll try and get some good examples in the darcs head of the tutorial soon.
12:05 < sm> I don't mean to complain wchogg, thanks for your good work
12:05 < wchogg> Right, I'm trying to add those in section by section.  My last set of examples was for the basic MACID stuff.
12:05 < sm> just feedback from the trenches
12:06 < wchogg> You're not complaining.  Complaining would be coming in here and saying "So who's the asshole who won't write a good example?"
12:06 < sm> oh, well I'm complaining in my mind :)
12:07 < sm> can't help it
12:07 < mightybyte> sm: So what precisely are you looking for?
12:07 < sm> a minimal example of  a request handler that prints the value of a query input
12:08 < mightybyte> Input from the query string or post data?
12:08 < sm> either would be fine
12:08 < sm> but I did say query :)
12:08 < mightybyte> Yeah, the query data could come from either the path or the additional data.
12:12 < sm> ie here's a minimal request handler: return "HELLO WORLD" . Now let's print a query input from the request; add: ...
12:12 < sm> I just reminded myself of another happs tutorial .. let me seek that
12:14 < sm> http://articles.bluishcoder.co.nz/Haskell/NotAHAppSTutorial#retrieving-form-data-using-data-types and http://articles.bluishcoder.co.nz/Haskell/NotAHAppSTutorial#retrieving-form-data-using-functions
12:15 < sm> are those still current, or should I avoid ?
12:19 < wchogg> It's semi-current.  You'll have to change namespaces from HAppS to Happstack and there might be a few minor changes.
12:22 < wchogg> I'd say try them with just the namespace changes, and if they don't work ask here and I'll try to help
12:27 < mightybyte> sm: I also have some working example code at http://softwaresimply.blogspot.com.  It's not exactly minimal, but it demonstrates some concepts fairly simply.
12:29 < sm> mightybyte: thank you, where is that code ?
12:29 < mightybyte> sm: I build it up in a series of posts.
12:47 < mightybyte> sm: http://hpaste.org/fastcgi/hpaste.fcgi/view?id=2467
12:48 < mightybyte> sm: That's a pretty minimal example for you.
12:49 < sm> mightybyte: thanks!
12:53 < sm> success
12:53 < mightybyte> The ideas in that example are described in my first blog post on softwaresimply
12:53 < sm> here is my minimal version: withDataFn (look "a") $ ?a -> return $ "a is: " ++ a
12:54 < mightybyte> Yeah, that works.
12:55 < mightybyte> I separated it into functions because it's easier to investigate the types that way.
13:37 < sm> shouldn't dir "" work for matching / ?
13:46 < mightybyte> I think a toplevel anyRequest does that.
13:49 < sm> http://site/dir and http://site/dir/ seem to match differently
13:50 < mightybyte> Yes
14:11 < stepcut> sm: they do, and that is a good thing, for some reason I can't remember ;)
14:15 < sm> alrighty.. I just was wondering how to match on (http://site or http://site/) but not http:/site/randomurl
14:17 < stepcut> sm: methodM / methodOnly, only match if there is no more path left
14:17 < sm> thanks
14:17 < stepcut> it didn't used to be that way which tripped me up for a while one day
15:00 < stepcut> mightybyte: http://hpaste.org/fastcgi/hpaste.fcgi/view?id=2419#a2473 ?
15:11 < mightybyte> stepcut: Is the \' after TestDB the only change?
15:12 < stepcut> no
15:12 < stepcut> it infers the type on KeyedEntry instead of TestEntry
15:13 < stepcut> inferIxSet does not like type synonyms
15:14 < stepcut> someone should file a bug on that
15:16 < mightybyte> Yeah, that looks like what I want.
15:17 < mightybyte> I need to play with it a bit.
15:18 < mightybyte> I'd file one, but I don't know that I understand the problem well enough.
15:18 < stepcut> mightybyte: to actually use the IxSet in a component, you will have to fix it a specific of 'a'
15:19 < mightybyte> Yeah, that's what I was afraid of.
15:19 < stepcut> mightybyte: similar to how I use the Accounts ixset set in this example, http://src.seereason.com/examples/happs-logon-example/Logon/Main.hs
15:20 < mightybyte> I don't want to fix it to one specific 'a', I want to fix it to any 'a' that is an instance of a particular type class.
15:21 < mightybyte> Can that be done?
15:22 < stepcut> mightybyte: eventually you have to fix the type
15:22 < mightybyte> Can I use an existential type?
15:23 < stepcut> mightybyte: doubt it
15:23 < stepcut> mightybyte: the problem really comes down to the fact that you are going to write the data to disk, and read it back later. So, you have to know what type it is that you are reading in, yes?
15:24 < mightybyte> Hmmm
15:24 < stepcut> mightybyte: you can do almost everything polymorphically, except make the Component instance.
15:25 < stepcut> mightybyte: so the functions you pass to mkMethods, etc, can be polymorphic. But, your Component instance must be monomorphic.
15:25 < mightybyte> Yeah
15:27 < mightybyte> And I can't get it to be monomorphic by creating a "container" type: data Container = forall t. (MyClass t) => Container (LogEntry t) ?
15:29 < mightybyte> ...then use IxSet Container as my state?
15:29 < stepcut> mightybyte: Container has to be and instance of Serialize and Version. Seems like maybe that wouldn't be possible, but I am not positive. You will have to try and see.
15:29 < mightybyte> Ok
15:30 < stepcut> mightybyte: in order to deserialized the data, you would have to know what the type of the data is that you are deserializing, but that is all hidden away from the type checker
15:30 < mightybyte> I don't know enough about the compiler internals to know.
15:31 < stepcut> mightybyte: this is pretty unrelated to compiler internals
15:31 < stepcut> mightybyte: the serialization stuff is basically a compact binary form of Read/Show
15:32 < stepcut> so, you can just think about what would happen when you use Read/Show to write something to a file and read it back. The problems you face are essentially the same.
15:33 < stepcut> If I tell you I have a file on the disk that is full of Container, where, data Container = forall a. Container [a], could you successfully read it ?
15:33 < mightybyte> Seems like it should be doable with open types
15:35 < stepcut> mightybyte: since 'a' can be anything, you have to have something that can parse a string representation of any value a
15:36 < mightybyte> But Container is of kind *
15:36 < stepcut> yes, but it can contain, forall a.
15:37 < stepcut> so, to read something of type 'a' you have to have a parser.
15:38 < mightybyte> Ahh, something I'd have to write manually instead of using deriveSerialize
15:41 < stepcut> mightybyte: that could work. Your parser will be partial, and result in runtime errors if you try to parse a type that you don't check for
15:41 < mightybyte> My original implementation used a closed type, but I decided to try using an open type to increase modularity.
15:41 < mightybyte> Yeah, I'd rather avoid doing that work.
15:42 < mightybyte> I guess I'll have to go back to a closed type.
15:42 < stepcut> mightybyte: serialization is a cruel mistress
15:42 < mightybyte> :(
15:42 < stepcut> mightybyte: unless you have something like mobile haskell that let's you to serialize closures
15:43 < mightybyte> mobile haskell?  Never heard of that.
15:45 < stepcut> mightybyte: it's probably bitrotted now, but it was a way of sending running bits of haskell code from one machine to another.
15:45 < mightybyte> That would be cool.
15:47 < stepcut> mightybyte: yeah, required a modified ghc
16:36 < mae_work> regarding the code referring to utf-8 in gitit: http://github.com/jgm/gitit/commit/fd9eed57f0dd9a0b989428fd3ed9b8bf75e2a33c
16:36 < mae_work> is this still true? We also need to use toString to convert to utf-8 (since happstack doesn't do this).
16:37 < mae_work> I was under the assumption that we now assume and convert-to utf8, in lieu of proper character-set handling
16:37 < stepcut> mae_work: I think we intend to, not sure if we actually do yet. I am pretty sure the guestbook app does it 'manually'
16:39 < mae_work> i thought dsrogers did something like that
16:39 < mae_work> hmm
16:39 < mae_work> where to look
16:41 < stepcut> mae_work: SimpleHTTP
16:41 < mae_work> yeah i looked in there
16:41 < mae_work> lets see here
16:44 < stepcut> mae_work: if it just does an unpack, then it does not support utf-8
16:44 < stepcut> in the look* functions
16:44 < mae_work> is what gitit is doing a desired behavior?
16:44 < mae_work> or is this an edge case
16:52 < mae_work> toString: http://hackage.haskell.org/packages/archive/utf8-string/0.3.4/doc/html/Data-ByteString-Lazy-UTF8.html#v%3AtoString
16:53 < mae_work> their impl is here
16:53 < mae_work> http://github.com/jgm/gitit/blob/fd9eed57f0dd9a0b989428fd3ed9b8bf75e2a33c/Gitit/Server.hs
16:53 < mae_work> so basically, question is, should this be happening in 99.9 percent of the cases?
16:53 < mae_work> (proper utf8 decoding)
16:55 < stepcut> I think so, until we have a system for supporting arbitrary encodings
16:55 < mae_work> lol, looks like this doesnt do anything: toString :: B.ByteString -> String toString bs = foldr (:) [] bs
16:55 < mae_work> um
16:55 < mae_work> yeah look at this
16:55 < mae_work> http://hackage.haskell.org/packages/archive/utf8-string/0.3.4/doc/html/src/Data-ByteString-Lazy-UTF8.html#toString
16:55 < mae_work> does haskell assume utf8?
16:56 < mae_work> i mean, it doesn't seem to do anything special
16:57 < stepcut> mae_work: that's not your father's foldr
16:57 < mae_work> Data.ByteString.Lazy.UTF8.foldr
16:57 < stepcut> mae_work: it's defined down lower and calles uncons which calls decode
16:57  * stepcut goes to the store
16:58 < mae_work> ok question, why on earth would we want to convert into a string by default
16:58 < mae_work> doesn't that defeat the purpose of lazy bytestring
16:58 < mae_work> (the lazy part at least)
16:58 < stepcut> Strings are lazy...
16:58 < mae_work> and inefficient..
16:59 < mae_work> I'm just talking this  out
16:59 < mae_work> the user should be able to make the decision of whether they want to induce the overhead of converting everything to string first, right?
16:59 < mae_work> they might be able to process a bytestring directly
16:59 < mae_work> i.e. if its binary data
16:59 < mae_work> (which string sucks at)
16:59 < stepcut> mae_work: well, you don't have to. But if you want to do String oriented operations, like toUpper/toLower, length, etc, then you either need to use String, or a library that knows how to do all those things using utf-8 encoded data
17:00 < stepcut> mae_work: that's what lookBS is for
17:00 < mae_work> stepcut: so really what we might want to change here is not lookBs but look
17:00 < mae_work> look does an unpack
17:00 < stepcut> mae_work: lookBS does not.
17:01 < mae_work> which doesn't really do anything graceful in the instance where there is invalid char data
17:01 < mae_work> hence UTF8.toString
17:01 < mae_work> is this sound so far? or do i need to go to the asylum :)
17:02 < stepcut> there is no such thing as 'invalid char data' as far as unpack is concerned.
17:02 < mae_work> what assumptions does unpack make?
17:02 < stepcut> but, if the incoming data is utf-8 encoded, and you use unpack to turn it into a String... that is just a bug :)
17:02 < stepcut> mae_work: that a Char is big enough to hold a byte worth of data
17:03 < mae_work> perhaps we should make look's return type polymorphic
17:03 < stepcut> mae_work: It just turns Word8 into Char. It does not change the 'value' at all.
17:03 < stepcut> mae_work: polymorphic in what way?
17:03 < mae_work> you specify which encoding you want
17:03 < mae_work> i don't know if a lib like this exists
17:04 < mae_work> but basically, something similar to how regexp handles what you want returned based on the type sig you give it
17:04 < stepcut> so what might the return type for 'look' be ?
17:05 < mae_work> right now it is look :: String -> RqData String
17:05 < mae_work> could be look :: String -> RqData d
17:05 < stepcut> for that type, the implementation would have to be, look _ = return undefined
17:06 < mae_work> type RqData a = ReaderT ([(String,Input)], [(String,Cookie)]) Maybe a
17:06 < paulvisschers> lookRead?
17:06 < mae_work> so what is the elegand way to do this?
17:06 < mae_work> elegant *
17:07 < paulvisschers> (I haven't really read anything, so feel free to ignore if it doesn't make sense)
17:07 < mae_work> how does the regexp library do this?
17:07 < mae_work> you can give diff type sigs to =~
17:07 < stepcut> mae_work: I am not sure what you are looking for. What do you mean by 'encoding'? Do you mean utf-8 vs latin-1 vs jp ?
17:08 < mae_work> stepcut: i just mean that the user defines with the return type what format they want it in.
17:08 < mae_work> but maybe this is unnecessary
17:08 < stepcut> mae_work: format?
17:08 < stepcut> mae_work: how is this different from lookRead?
17:08 < mae_work> stepcut: sorry type
17:09 < mae_work> stepcut: your right
17:09 < stepcut> mae_work: except, lookRead is broken for anything except ascii because it calls look, which does not decode the ByteString
17:10 < mae_work> stepcut: is there any way we can automatically figure out the correct *original* encoding from the rq data?
17:10 < mae_work> like is there encoding information sent in rqdata?
17:10 < stepcut> mae_work: sometimes, but not always
17:10 < mae_work> stepcut: how would this work?
17:11 < stepcut> mae_work: when you create a form, you can specify the supported encoding/charset
17:11 < stepcut> mae_work: when the form gets POSTed, it might include an encoding
17:11 < stepcut> in the headers
17:11 < mae_work> and if it does not? :) Assume the browsers default encoding?
17:12 < stepcut> mae_work: well, if the browser does not tell you what encoding it is using, then you can't know :)
17:12 < mae_work> stepcut: so assume ascii?
17:12 < stepcut> mae_work: in general, I tell the browser what encoding it needs to use, and then trust that the browser did that
17:12 < mae_work> what is the practical solution
17:13 < mae_work> ok so if we have, encoding right
17:13 < stepcut> the partical solution is to assume utf-8
17:13 < mae_work> encoding :: Maybe String
17:13 < mae_work> when we get Nothing, what do we try :)
17:13 < mae_work> throw an ISE?
17:13 < mae_work> malformed request?
17:13 < mae_work> hehe
17:13 < mae_work> k utf-8
17:13 < mae_work> because it is compatible with ascii
17:13 < stepcut> yes
17:14 < mae_work> lookBS on the other hand
17:14 < stepcut> though not latin-1
17:14 < mae_work> is more low level
17:14 < stepcut> mae_work: sure.. unless what you need is a ByteString. Then it is right level.
17:14 < mae_work> right :)
17:15 < mae_work> so can we reasonably replace this L.unpack with Data.ByteString.Lazy.UTF8.toString and not break any existing codE?
17:15 < mae_work> (for look)
17:16 < stepcut> mae_work: yes, unless people are expecting latin-1 instead of ascii.
17:16 < stepcut> mae_work: but, for the greater code, their code will have to break. Since utf-8 is far more useful than latin-1.
17:16 < mae_work> stepcut: yeah but in that case wouldn't look break right now anyways?
17:17 < stepcut> mae_work: why would it break for latin-1?
17:17 < mae_work> umm
17:17 < mae_work> well i mean
17:17 < mae_work> how does printing / showing work for haskell strings
17:18 < mae_work> does it assume whatever your platforms character set is?
17:18 < mae_work> I guess I don't understand  how L.unpack would be smart enough to decode into latin-1, since it is not ascii (and I thought that haskell strings assumed either ascii or utf-8)
17:19 < mae_work> To be perfectly clear, a String in ghc doesn't prefer any particular encoding?
17:19 < stepcut> L.unpack just converts the type each 'character' from Word8 to Char, but does not change the number. So, 234 :: Word8 becomes, 234 :: Char.
17:20 < stepcut> I guess, technically, that is broken.
17:20 < mae_work> What is Char
17:20 < mae_work> is that platform dependent?
17:20 < stepcut> but, because everything is single byte, and enough things are broken, if you print it, things will appear to work.
17:20 < mae_work> (when shown)
17:20 < stepcut> Char is always a unicode code point.
17:21 < mae_work> so it would seem crazy to decode it in any other way, than in utf-8 no?
17:21 < stepcut> so, yeah, ascii is the only thing that currently works correctly. I figured that out before but forgot :)
17:21 < mae_work> i mean even if they are expecting latin-1, ghc is assuming utf-8
17:21 < stepcut> ghc does not assume utf-8
17:21 < stepcut> in ghc a Char is always a unicode code point.
17:22 < stepcut> utf-8 is one of many ways to encode a unicode code point. But a Char is not encoded.
17:22 < stepcut> now, ghc does assume that haskell source files are utf-8 encoded, but that is different.
17:22 < mae_work> Char is just ascii
17:22 < stepcut> no
17:22 < mae_work> uh
17:22 < stepcut> ascii is a number between 0 and 127.
17:22 < mae_work> when you say "Char is not encoded" what do you mean then?
17:23 < stepcut> Char is a 32-bit value representing a Unicode codepoint.
17:23 < mae_work> yikes
17:23 < mae_work> so it supports wide characters then?
17:23 < mae_work> i.e utf-16 or whatever
17:24 < stepcut> no
17:24 < stepcut> it only supports unicode code points.
17:24  * mae_work is so confused.
17:24 < stepcut> utf-16 is just a way of encoding unicode code points
17:24 < stepcut> similar to utf-8
17:25 < mae_work> hehhttp://www.unicode.org/Public/UNIDATA/
17:25 < mae_work> err this : http://www.unicode.org/Public/UNIDATA/
17:25 < mae_work> unicode code points
17:25 < stepcut> mae_work: for a unicode code point, say, '96', there are many possible binary represents (aka, encodings)
17:26 < stepcut> utf-8 is nice, because ascii characters require only 1-byte, and 0 means null.
17:26 < mae_work> so once again.. when ghc wants to show some characters on screen [Char], it doesn't have any particular encoding, because Char is 32-bit wide and can contain any unicode code point, correct?
17:27 < stepcut> in other encodings, 0x96 might be the octets, 00 00 00 96
17:27 < stepcut> yes, System.IO.putStrLn does something mostly useless
17:27 < stepcut> you have to use, System.IO.UTF8.putStrLn
17:27 < mae_work> um.
17:27 < stepcut> or soemthing similar that encodes the String
17:28 < mae_work> but System.IO.putStrLn will work for <= 8bit wide code points
17:28 < stepcut> you could do, System.IO.putStrLn (encode myString) as well
17:28 < mae_work> (work correctly)
17:28 < mae_work> is this right?
17:28 < stepcut> mae_work: I believe it will work if the encoding does not have multibyte sequences
17:29 < stepcut> mae_work: that is why you can read in latin1 as String and print it out and everything looks like it is working ok. (Even though it is not really0
17:30 < mae_work> yeah up until the String includes a multibyte character
17:30 < mae_work> man that is borked
17:30 < mae_work> no explicit encoding for Char
17:30 < stepcut> mae_work: there can be problems even with out multibyte
17:31 < stepcut> mae_work: for example, if some characters above 127 in a language support upper and lower case, but the unicode code points of the same value don't, then toLower/toUpper won't do what you expect.
17:32 < stepcut> mae_work: so, pretty much only ascii works, because that is the only part that is the same in unicode and ascii
17:32  * stepcut didn't say that quite right
17:33 < mae_work> stepcut: ok so System.IO.UTF8.putStrLn would be necessary if you used something like L.unpack with a utf-8 string, right? But would it be necessary if you had used Data.ByteString.Lazy.UTF8.toString in the first place?
17:34 < stepcut> wrong!
17:35 < stepcut> if you used, L.unpack, then System.IO.putStrLn would actually output the 'right' thing, because two wrongs do make a right in this case.
17:35 < stepcut> it's like double rot-13 encoding  ;)
17:36 < mae_work> err, yeah because utf-8 maintains ascii compat
17:36 < stepcut> no..
17:36 < mae_work> um
17:36 < mae_work> ok WTF i am so confused
17:36 < mae_work> let me start with the basics
17:36 < lanaer> if it was ascii to begin with, treating it as utf-8 won't cause problems
17:36 < stepcut> ok, so the first thing to understand is 'why does the terminal know what to display?'
17:37 < mae_work> it assumes?
17:37 < stepcut> mae_work: you have a terminal, and you send a sequence of octets (aka, bytes) to it. How does it decide what to do with those bytes?
17:37 < mae_work> it assumes 8-bit ascii, no?
17:37 < stepcut> mae_work: depends on your terminal.
17:37 < lanaer> a configured setting of some sort, usually.
17:37 < mae_work> ok so putStrLn is plat dependent
17:38 < mae_work> right?
17:38 < stepcut> mae_work: probably, but I am not sure if that is relevant.
17:38 < mae_work> heh
17:38 < mae_work> go on.
17:39 < stepcut> mae_work: the terminal expects the octets that is receives to be encoded in some way. It looks at the incoming encoded bytes, decodes them, looks up the glyph in a table, and shos the glyph on the screen.
17:40 < stepcut> mae_work: for something like latin-1, the table has 256 glyphs, and the 'encoding' just maps each byte to glyph
17:41 < mae_work> right
17:41 < stepcut> mae_work: other encodings, like, de map those 256 bytes to different glyphs
17:41 < stepcut> so, sending the same byte sequence to the terminal will have different results depending on what encoding the terminal is configured for
17:41 < stepcut> mae_work: for best results, you should send byte sequences that are encoded in the same encoding that the terminal is using ;)
17:42 < mae_work> listening ..
17:42 < stepcut> now, having all these different encodings for different languages sucked, because if you want to view a german file, and then a spanish file, and then a japanese file, you had to keep changing our encoding
17:43 < stepcut> so, these days, most people configure their terminals to expect utf-8 encoding
17:44 < stepcut> in utf-8 code points below 127 are encoded as a single byte, and code points above 127 are encoded as multiple bytes
17:44 < lanaer> those first 127 code points being identical to all of ascii
17:44 < stepcut> so if you have a Char in haskell, that contains the code point 96. In the Char it is (probably) going to be four bytes, 00 00 00 96
17:45 < stepcut> but, when you want to print that, you just want to output '96', because that is the utf-8 encoding of 96.
17:45 < mae_work> rihgt
17:45 < mae_work> right *
17:45 < mae_work> and what if its above 127
17:46 < stepcut> if you had a charter like, 234, then the output sequence is going to be multiple bytes, so even though you have something like, 00 00 00 234 in memory, you need to output something like 128 234
17:46 < stepcut> the terminal recognizes that as the utf-8 encoding of 234 and displays the proper glyph
17:46 < mae_work> would that be [128, 234] in [Char] or [128 234]
17:47 < mae_work> (single list item or multiple)
17:47 < lanaer> single item
17:47 < stepcut> data Char = Char Word32
17:47 < mae_work> k
17:47 < stepcut> so, a single item
17:47 < lanaer> if you converted it to a ByteString using pack/fromString, then it would be 2
17:48 < mae_work> is the whole 127 thing because ascii doesn't want to assume sign vs unsigned?
17:48 < mae_work> it just ignores that bit?
17:48 < stepcut> mae_work: way back in the day, computers only had 7-bits
17:48 < lanaer> ascii just uses 7 bits instead of 8, but I don't know the historical reasons for that
17:48 < lanaer> oh
17:48 < stepcut> "Historically, ASCII developed from telegraphic codes. Its first commercial use was as a seven-bit teleprinter code promoted by Bell data services."
17:49 < stepcut> anyway, the System.IO.putStrLn is dumb, when it sees Char (0 0 0 234), it just outputs 234. Which makes the terminal unhappy, since that is not valid utf-8.
17:50 < lanaer> which is why the UTF8 module has its own putStrLn
17:50 < mae_work> so anything < 128 encodes into a single ascii compatible byte, and anything >= 128 means "this is byte one of two byte sequence" ?
17:50 < mae_work> for utf-8 encoding
17:50 < lanaer> byte 1 of a 2 or more byte sequence
17:50 < mae_work> how does the decoding figure that out?
17:50 < mae_work> the length
17:51 < stepcut> now, let's say you have a non-utf-8 aware haskell program. You read in the utf-8 encoded data, (128 234), which you incorrectly treats as two characters [128,234]. Then later you print out the characters 128 followed by 234. Now the terminal is happy because you actually outputted valid utf-8
17:51 < lanaer> similar type of decision on the second byte. some range of values indicates "there's more to this sequence"
17:52 < mae_work> stepcut: so System.IO.putStrLn just truncates the word32 to a char8, regardless.
17:52 < stepcut> mae_work: my fake encoding of utf-8 is mostly wrong in these examples, but easier to understand :) http://en.wikipedia.org/wiki/UTF-8
17:52 < lanaer> the terminal is happy, but any processing you do (checking the length of it, for example) in your code will be incorrect
17:52 < stepcut> mae_work: I believe so -- or something equally useless.
17:52 < stepcut> lanaer: exactly, length, toUpper/toLower, etc, will all be problematic
17:53 < stepcut> calling toUpper might accidently convert half the octets of a multibyte sequence
17:53 < mae_work> ic.
17:53 < lanaer> or not convert some accented character that it should have
17:54 < stepcut> mae_work: so, a unicode code point is a fixed number representing a character. But, there are many ways (aka, encodings) for presenting that code point.
17:54 < mae_work> stepcut: so again, like i said before, haskell Char has no internal concept of what encoding it is
17:54 < mae_work> it just is
17:54 < mae_work> but System.IO.putStrLn assumes it is ascii
17:55 < lanaer> (a fun hobby of mine is to take any programming exercise involving strings in a language I'm learning (currently haskell, of course), and then tweak it so that I can feed ?? to it without breaking anything)
17:55 < stepcut> mae_work: right, it is just a unicode code point. And System.IO.putStrLn assumes ascii.
17:55 < stepcut> lanaer: the guestbook application works properly :)
17:55 < mae_work> stepcut: but Char really _could_ be ascii
17:55 < lanaer> stepcut: I've gotten my toy happstack apps to work correctly too :)
17:56 < stepcut> mae_work: no, it is always unicode code points. But sometimes it only contains the subset of unicode code points that are isomorphic to ascii
17:56 < lanaer> Char is really just a number, and pretends that encodings don't exist. the numbers it uses are the unicode code points, but do keep in mind that those are different from utf-8
17:57 < mae_work> stepcut: well, once again, Char is arbitrary, it could be any encoding that you want, but putStrLn assumes ascii
17:58 < mae_work> well actually
17:58 < mae_work> i take that back
17:58 < mae_work> putStrLn could be using latin-1
17:58 < mae_work> right?
17:58 < mae_work> err printing assuming latin-1
17:58 < lanaer> it could be
17:59  * lanaer needs to go home.
17:59 < lanaer> later guys
17:59 < mae_work> later
18:00 < stepcut> mae_work: putStrLn does not really assume anything about the encoding. It just truncates the Char to 8-bits and prints it.
18:00 < stepcut> mae_work: if you change your terminal from latin-1 to de, it does not affect putStrLn in any way.
18:01 < mae_work> gotcha
18:01 < mae_work> so uh, then to bring it full circle with our, ahem problem.
18:01 < mae_work> we just stick whatever encoding the browser sent us into Char currently
18:02 < mae_work> whether that be latin-1 or utf-8 or latin-2 etc etc
18:02 < stepcut> mae_work: now, System.IO.UTF8.putStrLn *does* assume things about the encoding. It assumes that your String is really a unicode code points, and that you want to display them on a terminal via the utf-8 encoding.
18:03 < stepcut> mae_work: the only *valid* thing to stick in a Char is a unicode code point. If the browser sends a byte sequence encoded as de, latin-1, utf-8, etc, then we should be converting it into unicode code points before putting it in a String. Having a String with anything but unicode code points in it as a recipe for pain.
18:05 < stepcut> mae_work: imagine, you don't decode it, you just covert each incoming byte into a separate Char. Now you use a database binding which does do encoding. It will take the 'encoded' bytes in the String and encode them a second time before writing them to the database. Now someone else reads that database and gets garbage, unless they decode things twice.
18:05 < mae_work> um
18:06 < mae_work> but again, your assuming that when the database encodes it, it assumes unicode code points :)
18:06 < mae_work> are you saying this is a defacto sort of paradigm?
18:06 < mae_work> either ascii or utf-8?
18:07 < stepcut> yes. The haskell standard specifies that a Char is a unicode code point. The database author is therefore correct to assume that he must encode the Char before sending it to the database. The encoding does not have to utf-8, it can be whatever the database specifies.
18:08 < stepcut> internally, the database can store things however it wants. But the string in the incoming database request must be encoded in some format. That encoding can usually be configured.
18:10 < stepcut> mae_work: I think mysql only supported latin1 until mysql 5.x
18:10 < mae_work> when you say unicode code point
18:10 < mae_work> this does not necessarily translate into utf8
18:10 < mae_work> this is my point of confusion
18:11 < mae_work> because Char could be an octet like 0 0 128 234
18:11 < stepcut> yes, a codepoint is just a number. There is no binary representation associated with a code point.
18:11 < mae_work> yeah ok this is valid "utf8" but is it not also valid utf16?
18:11 < stepcut> utf-8 can be used to encode any unicode point. The bits will be different than if you used the utf-16 encoding though.
18:12 < mae_work> ok so once again, when you say Char has to be a unicode code point, this does not necessarily equate with utf-8
18:12 < mae_work> this is where i am confused
18:13 < mae_work> a code point is just a 32 bit wide integer
18:13 < mae_work> utf-char8 is an encoding of that
18:13 < stepcut> mae_work: a code point is a number. A mathematical concept. bits are not even part of the picture.
18:14 < mae_work> right
18:14 < stepcut> utf-8 is a way to present that number in binary. There are many other ways to represent the same code point (UCS-2, UCS-4, utf-32, utf-16, etc)
18:15 < mae_work> yeah i understand, so what your saying is, "unicode code point" could mean very different things depending on which encoding you use
18:16 < mae_work> but utf-8 is the accepted / blessed way going forward?
18:16 < mae_work> is this an acceptable summation?
18:17 < stepcut> hrm, sort of the opposite. A unicode code point is the fixed thing that always means the same thing. The encodings are different ways of represented the fixed concept.
18:17 < mae_work> (by very different things, I mean in terms of what is actually stored in Char)
18:17 < mae_work> stepcut: ok i think we are violently agreeing but just not understanding one another
18:17 < mae_work> stepcut: let me narrow my focus, in your database example, any right-minded database vendor would assume that you encoded in utf-8
18:18 < mae_work> true or false?
18:18 < mae_work> s/assume/default/g
18:18 < stepcut> mae_work: well, mysql assumed/defaulted to latin1 for a very long time (and still may unless you compile with --charset=utf-8)
18:19 < mae_work> heh
18:19 < stepcut> mae_work: other vendors may not make any assumptions, and simple require you to specific it somewhere.
18:20 < stepcut> mae_work: but, in general, if everyone always assumed utf-8, that would make the world nicer I think
18:20 < mae_work> stepcut: [Char] represents a string of unicode code points, BUT may be actually stored in one of many different unicode encodings. true or false?
18:21 < stepcut> mae_work: let me check the spec for a moment
18:22 < stepcut> mae_work: according to the spec, 'The character type Char is an enumeration whose values represent Unicode (or equivalently ISO 10646) characters.', and 'To convert a Char to or from the corresponding Int value defined by Unicode, use toEnum and fromEnum from the Enum class respectively (or equivalently ord and chr).'
18:22 < stepcut> well, according to the haddock comments in Data.Char
18:23 < mae_work> : )
18:23 < stepcut> so I think that means that Char could internally be stored as any valid unicode code point, but also an encoding of it's own invention
18:23 < mae_work> so i think it actually assumes unicode codepoints, independent of any encoding
18:23 < stepcut> it just has to provide valide toEnum/fromEnum instances
18:24 < stepcut> mae_work: internally, GHC probably uses an encoding that makes toEnum/fromEnum be 'id'
18:25 < mae_work> stepcut: so this function: http://hackage.haskell.org/packages/archive/utf8-string/0.3.4/doc/html/src/Data-ByteString-Lazy-UTF8.html#toString
18:25 < mae_work> turns it into a pure unicode code point, (not utf 8 encoded)
18:25 < mae_work> from my understanding
18:26 < stepcut> yes
18:26 < mae_work> which putStrLn still acts retarded around
18:26 < stepcut> yes
18:26 < mae_work> but at least this is in line with the haskell specs
18:26 < stepcut> yes
18:27 < mae_work> so where the heck is System.IO.NotBroken.putStrLn :)
18:27 < stepcut> System.IO.UTF8.putStrLn
18:27 < mae_work> UTF8 isn't it, because it assumes a utf8 (encoded) stream
18:27 < mae_work> utf8 != utf8 code point in terms of storage
18:27 < mae_work> err
18:28 < mae_work> utf8 != unicode code point in terms of storage
18:28 < stepcut> System.IO.UTF8.putStrLn is not broken... It does one thing and it does it well.
18:29 < stepcut> if you wish to output 'de' or 'fr' encoded strings, then you need to use iconv to convert the String to an appropriately encoded stream of bytes
18:30 < mae_work> its not broken
18:30 < stepcut> also, hugs and jhc's version of putStrLn behave differently than ghc
18:30 < mae_work> but it makes the wrong assumptions (not in line with the spec)
18:31 < stepcut> what is the wrong assumption?
18:31 < mae_work> utf8
18:31 < mle> unicode is 21bit not 32bit
18:31 < stepcut> mle: right
18:31 < stepcut> mae_work: what is wrong about that?
18:32 < mae_work> unicode point == utf-8 value does not hold true in all cases, true or false?
18:32 < stepcut> mae_work: that is a type error.
18:32 < mle> code points (the scalar values represented by a utf-8 stream) do not map directly to human-conceptual-characters.
18:33 < mle> riastradh had a rant about this the other day in #haskell... went on at some length
18:33 < stepcut> mae_work: System.IO.UTF8.putStrLn, takes a String of unicode code points, converts them to a utf-8 encoded byte-sequence, and sends the byte sequence to stdout...
18:33 < mae_work> unicodePoint :: Integer, right?
18:33 < mle> reading the full standard is necessary to understand the terminology really.
18:33 < stepcut> mle: yes, it's wacky, wacky stuff.
18:34 < mae_work> stepcut: ahh ok i think i understand now
18:35 < mae_work> stepcut: System.IO.UTF8.putStrLn takes a [Char] where each Char represents a unicode point, and then encodes into utf8 before sending to stdout
18:36 < stepcut> mae_work: unicode code points a isomorphic to the integers between 0x0 and 0x10FFFF... but they are not integers per se.
18:36 < stepcut> mae_work: yes, exactly.
18:36 < mae_work> stepcut: ok clear now
18:37 < stepcut> it's a more efficient version of, Data.ByteString.print (UTF8.encode (myString :: String))
18:37 < stepcut> or, perhaps it is exactly, putStrLn = Data.ByteString.print . UTF8.encode
18:38 < mae_work> stepcut: so we need to implement look to find out what encoding the browser says the value is, and then if this information is not available assume it is utf8, then we decode it based on this conclusion into a proper [Char]
18:38 < stepcut> mae_work: yes, that is would be most correct way to do things.
18:38 < mae_work> great
18:39 < mae_work> but in the meantime, assuming utf8 is still more correct than letting the browser give us filth
18:39 < stepcut> mae_work: for a happstack developer *today*, the most sensible thing to do is to make sure that everything is utf-8.
18:40 < stepcut> right, having happstack assume utf-8 is the most sensible approach for now I think. It is exactly what most people want.
18:41 < stepcut> I can't imagine why you wouldn't want to (or at least, be able to) use utf-8 instead of some other specialized encoding (fr, de, etc)
18:41 < stepcut> there are (perhaps) reasons to in some situations, but in the web world, the developer can tell the browser what encoding to use for the data it is sending, and also what encoding the posted form data should be in.
18:42 < stepcut> also, because we export lookBS, people can implement their own solutions easily if they need something besides utf-8
18:43 < stepcut> we need to (1) use 'decode' on incoming data (in the look* functions) and (2) use encode on the outgoing data
18:44 < stepcut> HSP already encodes as utf-8 and sets charset=utf-8 in the content-type
18:44 < stepcut> not sure about Text.XHtml, etc
18:47 < mae_work> stepcut: thats mostly handled at the ToMessage level
18:47 < mae_work> right now
18:47 < mae_work> ToMessage class that is
18:49 < stepcut> yep
18:50 < mae_work> stepcut: any objection to using http://hackage.haskell.org/packages/archive/utf8-string/0.3.4/doc/html/src/Data-ByteString-Lazy-UTF8.html#toString instead of encode, the advantage seems to be that it does all the encoding directly from a bytestring (probably performs slightly better).
18:50 < mae_work> err instead of decode
18:50 < stepcut> nope. In fact, that is the function I meant, I just got the names mixed up
18:55 < mae_work> stepcut: for run.sh / run.bat I have been thinking that maybe I should pin the package happstack-0.3 for the guestbook app
18:55 < mae_work> what do you think ?
18:55 < mae_work> and the same with the cabal files
18:55 < mae_work> cabal file *
19:00 < mae_work> stepcut: OK, I pushed the patch
19:04 < mae_work> stepcut: btw, while we're at it, what of lookCookie?
19:25 < mae_work> When we read the Cookies header, what is the defined encoding of this?
19:25 < mae_work> AFAIK all http heades have to be ascii
19:26  * stepcut pondders
19:27 < mae_work> i believe this is true
19:27 < mae_work> before i took over happstack i was going to create an http server from scratch
19:27 < mae_work> using happy
19:27 < mae_work> (for the parser)
19:27 < mae_work> http://www.faqs.org/rfcs/rfc2616.html
19:28  * stepcut puts the groceries away and *then* ponders
19:28 < mae_work> look in that doc at "2.2 Basic Rules"
19:29 < mae_work> since anything that is not 1-127 is "undefined" for the headers in the standard, i guess it is ok to accept whatever it gives us
19:29 < mae_work> alternatively, we could say that if a char is out of the range of 1-127 that we send a 400 Bad Request
19:32 < stepcut> I know that form submission data will sometimes contain entity references, depending on what do you/don't set in the form for enctype and friends.
19:32 < mae_work> not talking about form data
19:32 < stepcut> I am not clear that it is safe to always decode entity references though
19:32 < mae_work> talking about cookies
19:32 < stepcut> right
19:33 < stepcut> I'm justing saying, that form-data has issues as well
19:34 < mae_work> entity references as in &amp; ?
19:35 < stepcut> yep
19:36 < stepcut> which, aren't really part of HTTP even...
19:36 < mae_work> http://www.xml.com/axml/notes/NoExtInAttr.html
19:36 < mae_work> the only case where you might end upw ith this
19:36 < mae_work> is if you put it in a value attribute for an input
19:36 < mae_work> but even then, i think this is invalid
19:36 < mae_work> decoding these encourages bad behavior (in general).
19:37 < mae_work> <input type="hidden" name="foo" value="bar&&&"> is perfectly valid AFAIK
19:38 < mae_work> and textareas will end up getting their contents converted from entity refs before it hits the webserver
19:44 < mae_work> hmm
19:44 < mae_work> ok so it seems that
19:44 < mae_work> to get html 4.01 to validate
19:44 < mae_work> character entities are needed in attributes
19:45 < mae_work> <input type="hidden" name="&amp;foo" value="bar"> was valid whereas <input type="hidden" name="&foo" value="bar"> was not
19:45 < mae_work> but still, I don't believe the web browser will literally send &amp;foo as the name, it will send just &foo
19:45 < mae_work> (in the post or get request)
19:46 < mae_work> so we shouldn't have to worry about entity decoding
19:46 < mae_work> at least not in the general sense
19:47 < mae_work> but ok, back to the matter at hand, with Cookies, what to do when > 127
19:47 < mae_work> 400 Bad Request ?
19:47 < mae_work> or practice tolerance and accept whatever
19:47 < mae_work> at the risk of having funky behavior when you begin to print or access the item with the wrong encoding assumptions
19:52 < mae_work> hmm
19:53 < mae_work> gitit encodes cookies they are creating as utf-8
19:53 < mae_work> is this one of those grey areas in the spec?
19:53 < mae_work> headers can have whatever encoding?
19:53 < mae_work> : )
19:54 < mae_work> http://hc.apache.org/httpclient-3.x/cookies.html
19:54 < mae_work> useful
19:56 < mae_work> bbl
19:59 < stepcut> mae: the browser *will* send entities
20:39 < mae> stepcut: in the post parameters?
20:52 < stepcut> mae: http://www.hpaste.org/fastcgi/hpaste.fcgi/view?id=2489#a2489
20:53 < stepcut> mae: try it an see
21:48 < mae> yeah, but why would you use application/x-www-form-urlencoded
21:51 < mae> : D
21:59 < stepcut> mae: because that is the default if you don't specify and enctype
21:59 < stepcut> also, I think it happens with form-data as well
21:59 < stepcut> I think the real key is that the content type is iso-8859-1 (aka, latin 1), but there are non-latin-1 characters in the input, so it has to do something with them
22:00 < mae> hmm
22:00 < mae> well
22:00 < mae> we require utf-8 now
22:01 < stepcut> :)
--- Log closed Tue Mar 17 00:00:41 2009