Share your thoughts in the 2024 State of Clojure Survey!

Welcome! Please see the About page for a little more info on how this works.

0 votes
in test.check by

Currently the default char generator is only in the range from 0 to 255. Java chars can range from \0000 to \FFFF. If this is something of interest, I will add a patch as I need to do this anyway.

6 Answers

0 votes
by
_Comment made by: gfredericks_

There's definitely a need for this, but I don't think the details of a solution are obvious -- in particular what distribution such a generator ought to have.

My hazy understanding of unicode is that a great many (the majority I think) of the code points are not assigned to any particular character, and so if you picked code points at random you would mostly get unprintable stuff.

I spent a while on this issue when I implemented [string-from-regex|https://github.com/gfredericks/test.chuck#string-from-regex] in test.chuck (using a uniform distribution), and you can see the results by doing {{(gen/sample (com.gfredericks.test.chuck.generators/string-from-regex #".*"))}}:


("" "
0 votes
by

Comment made by: m0smith

You make some great points. I will also review the Java Character class as it seems to have some Unicode information encoded that could be put to good use.

0 votes
by

Comment made by: m0smith

;;
;; Unicode support for test.check
;;
;; Unicode support is divided into 2 sections: char based and code-point/int based
;;
;; Ranges and choices
;; Ranges are a vector of range defs
;; A range def is either
;; A single character
;; A pair (vector) of the start and end of a range
;;
;; choices is a generator that choose from a vector of ranges. For example,
;; (choices (link: 1 2 [100 200))
;; would return 1 and 2 and the numbers from 100 to 200. The members of the range pair 100 and 200 in this
;; example, can be anything accepted by choose.
;;
;;
;; The char based Unicode support mirrors the normal char and string generators
;;
| Standard Generator | Unicode Generator |Generates |
| :-- | :-- | :-- |
| char | uchar | valid Unicode characters (char) from \u0000 to \uFFFF. |
| char-asciii | uchar-alpha | letter Unicode characters. |
| | uchar-numeric | digit Unicode characters |
| char-alphanumeric | uchar-alphanumeric | letter and digit Unicode characters |
| string | ustring | Unicode strings consisting of only chars |
| string-alphanumeric | ustring-alphanumeric | Unicode alphanumeric strings. |
| | ustring-choices | Unicode strings in the given ranges. |
| namespace | unamespace | Unicode strings suitable for use as a Clojure namespace |
| keyword | ukeyword | Unicode strings suitable for use as a Clojure keyword |
| keyword-ns | ukeyword-ns | Unicode strings suitable for use as a Clojure keyword with optional namespace |
| symbol | usymbol | Unicode strings suitable for use as a Clojure symbol |
| symbol-ns | usymbol-ns | Unicode strings suitable for use as a Clojure symbol with optional namespace |

;; Code-point or int based characters

| Standard Generator | Unicode Generator | Unicode Desc |
| :-- | :-- | :-- | :-- |
| string | ustring-from-code-point | Generates Unicode strings consisting of any valid code point. |
| :-- |
| char | code-point | Generates a valid Unicode code point |
| :-- |

0 votes
by

Comment made by: gfredericks

Are you thinking that these generators will generally have uniform distributions, and that the problem of mostly-unprintable-values is not a big enough problem to do anything special about?

Should the second group of generators include analogs for keyword, symbol, etc. as well?

I think anything that involves dozens of new generators I'll be inclined to put in a separate namespace.

0 votes
by

Comment made by: m0smith

I listed all the new generators I was wanting to build. Basically, I want to map the normal string based generators to have similar behavior to current ones. For example, keywords and symbols have a ukeyword and usymbol for unicode keywords and symbols.

Adding the apply-to from TCHECK-99 will make it easier for people to create a Unicode string generators.

I expect the Unicode versions of the functions to have a very similar distribution to the current versions. The exception is the ones based on "choices" which distributes even across each range, regardless of the size of the range.

0 votes
by
Reference: https://clojure.atlassian.net/browse/TCHECK-97 (reported by m0smith)
...