Share your thoughts in the 2024 State of Clojure Survey!

Welcome! Please see the About page for a little more info on how this works.

+52 votes
in Clojure by
closed by

Android ART runs compile time verification on bytecode and fails on any usage of the locking macro. The error looks like that seen in CLJ-1829 (in that case clojure.core.server/stop-server calls the locking macro):

10-16 14:49:26.801 2008-2008/? E/AndroidRuntime: java.lang.VerifyError: Rejecting class clojure.core.server$stop_server because it failed compile-time verification (declaration of 'clojure.core.server$stop_server' appears in /data/app/com.clojure_on_android-1/base.apk)

Cause: From discussion on an Android issue (https://code.google.com/p/android/issues/detail?id=80823), it seems this is due to more strictly enforcing the "structural locking" provisions in the JVM spec (https://docs.oracle.com/javase/specs/jvms/se7/html/jvms-2.html#jvms-2.11.10).

It appears that the mixture of monitor-enter, monitor-exit, and the try/finally block in locking create paths that ART is flagging as not having balanced monitorenter/monitorexit bytecode. Particularly, monitorenter and monitorexit themselves can throw (on a null locking object). The Java bytecode does some tricky exception table handling to cover these cases which (afaict) is not possible to do without modifying the Clojure compiler.

Approach: One possible approach is to have locking invoke the passed body in the context of a synchronized block in Java. This avoids the issue of the tricky bytecode by handing it over to Java but has unknown performance impacts.

Patch: clj-1472-3.patch

See also: Examination of the bytecode as compared to a java synchronized block shows up a number of differences:
https://gist.github.com/AdamClements/2ae6c4919964b71eb470

Screened by: Alex Miller - I'm marking this screened as I think it is a viable approach that fixes the issue and due to the infrequency of its use, I'm not really that concerned about it being a performance problem. I will flag that I think another way to handle this would be to make locking a special form with compiler support, but I'm not sure whether that's worth doing or not, so I will leave that to Rich to decide.

closed with the note: fixed

41 Answers

0 votes
by
 
Best answer

A fix for this issue was released in 1.10.2-alpha1.

0 votes
by

Comment made by: adamclements

After using this a little more, I've found that moving this outside the try block breaks nREPL.

Looking at the bytecode, the monitorenter for the locking in clojure.tools.nrepl.middleware.session/session-out and in a few other places ends up in an entirely different method definition and we now get a JVM IllegalMonitorStateException as well as an ART verification error for this function.

0 votes
by

Comment made by: jafingerhut

Adam, I cannot comment on whether your patch is of interest or not, but it is true that no patch will be committed to Clojure if the author has not signed a Contributor Agreement, which can now be done on-line at https://clojure.org/contributing

0 votes
by

Comment made by: adamclements

Uploaded a new patch (and signed the contributor agreement). This passes both the JVM and ART bytecode verification, The extra try/catch around the monitor exit is optional (verification passes with or without it) but where the java version retries monitor-exit indefinitely and shows the deadlock at the right time, without catching errors in the monitor-exit an undetermined monitor-enter in the future might fail, not showing up the actual bug.

It's not very pretty, but without finer grained control of the generated bytecode, this is the best I could do.

0 votes
by

Comment made by: adamclements

Have just tested with Lollipop, and this patch might no longer be sufficient.

Getting in touch with the ART guys to see if they can shed a little more light and verify whether it will work on the current master branch of AOSP

0 votes
by

Comment made by: adamclements

Bug filed with AOSP project, hopefully they can shed some light on whether it is our problem and if so how we can fix it.

https://code.google.com/p/android/issues/detail?id=80823

0 votes
by

Comment made by: adamclements

I have uploaded an alternative implementation of the locking macro (0001-CLJ-1472-Locking-macro-without-explicit-monitor-ente.patch) which cheats a little - the synchronized block is actually implemented in Java and so guarantees compatibility. This is at the cost of a little extra indirection and the naming/location could probably be better.

But it does fix the bug and work on all versions of android, android + art and the jvm. Would this approach be acceptable?

0 votes
by

Comment made by: hiredman

I have yet to see any evidence that the bytecode clojure is generating in some way violates the jvm spec, so I suspect the issue is clojure requires a jvm to run, and android doesn't provide a jvm, just something that looks like one if you don't tread outside the beaten path.

0 votes
by

Comment made by: hiredman

given the structured locking verbiage in https://docs.oracle.com/javase/specs/jvms/se7/html/jvms-2.html#jvms-2.11.10, (locking nil) may generate bytecode whose runtime behavior violates structured locking. the first patch on this issue can cause the compiler to emit monitorenter/exit instructions in different methods, which definitely violates structured locking

0 votes
by

Comment made by: adamclements

Yes, the first patch was definitely wrong, I left it for some context to the conversation, but it's probably best to just remove it for clarity.

For anyone following this conversation who doesn't want to decompile and observe the bytecode, here's a gist with the difference between a java synchronized block and clojure locking https://gist.github.com/AdamClements/2ae6c4919964b71eb470

I'm finding it hard to work out where the deviation from the spec occurs too, though I can see the difference with the Java version, if anything, the Clojure version looks closer to what's described in the spec than the Java one!

If someone with more knowledge than me on the subject could engage on the AOSP bug https://code.google.com/p/android/issues/detail?id=80823 then perhaps we could settle this as an android bug which is too focussed on the java implementation rather than the JVM spec, or perhaps they'll find something that's wrong with the Clojure implementation. I have uploaded the original clojure behaviour and asked them for some more explanation on why it is failing.

0 votes
by

Comment made by: adamclements

The response from the ART guys about what they think we're violating is:

The section on "Structured locking" contains the following:

"(link: ...) implementations (link: ...) are permitted but not required to enforce
both of the following two rules guaranteeing structured locking. (link: ...)"

ART currently enforces both rules at verification time, including

"At no point during a method invocation may the number of monitor exits
performed by T on M since the method invocation exceed the number of
monitor entries performed by T on M since the method invocation."

0 votes
by

Comment made by: adamclements

If for example instruction https://gist.github.com/AdamClements/2ae6c4919964b71eb470#file-test_locks-class-L24 or the monitor-enter itself on the next line were to fail, couldn't it could end up in the finally clause and attempt to release the lock even though it has never been captured?

I think this violates the structured locking rules in the jvm spec you linked to.

0 votes
by

Comment made by: hiredman

an interesting question about structured locking, is does the specification refer to the static bytecode or the runtime behavior of the bytecode. given the bytecode linked(https://gist.github.com/AdamClements/2ae6c4919964b71eb470#file-test_locks-class-L24), the static bytecode has the same number of entries and exits, but the dynamic behavior may be different. I wonder which one the art guys claim to be enforcing at verification time (it seems like it would have to be the static bytecode, not the dynamic properties, but then they shouldn't be failing to verify this). looking at the google code issue, the comment https://code.google.com/p/android/issues/detail?id=80823#c6 was made by the same dev as https://code.google.com/p/android/issues/detail?id=80823#c3, so I sort of suspect there is some miscommunication going on. It is not clear in what context the dev is replying in, since in the previous comment you mention splitting monitor-enter and exit across methods. I think things would be much clearer if all patches, specialized clojure android builds, etc, were gotten rid of, then with a vanilla clojure jar you get a javap dump of what fails to verify, then just take that over to the android issue tracker and ask "hey, this fails to verify, why?"

0 votes
by

Comment made by: adamclements

Yeah, I shouldn't have confused it with the patched versions. The gist and the currently uploaded version use the vanilla clojure version of the locking macro now though.

I think the issue comes from the exception table and the instructions that covers. If line 24 can throw for example, you would end up at runtime with a monitor-exit, having never encountered a monitor-enter.

0 votes
by

Comment made by: gshayban

According to Marcus Lagergren JRockit and Hotspot both account for lock releases being in a different method... Perhaps Android has a different (wrong) interpretation?

...