赞
踩
近日有部分线上用户反馈打开App后会偶现闪退,但奇怪的是我们在捞取相关设备的App日志时却没有发现任何的异常栈信息,这给我们定位问题带来了不小的难度。没有明确的异常栈信息,那就只能找规律了。从大量的日志中我们发现进程挂掉的前面一小段时间里,都出现了与操作某一个ContentProvider组件相关的日志信息。通过推测+尝试,最终成功在本地复现了闪退现象,并抓取到了关键的系统日志。日志如下:
2021-03-30 16:28:35.661 1091-1459/? I/ActivityManager: Killing 20972:com.test.demo1/u0a71 (adj 100): depends on provider com.test.demo2/.provider.SharedProvider in dying proc com.test.demo2 (adj 0)
2021-03-30 16:28:35.668 1091-1459/? I/ActivityManager: Killing 22561:com.test.demo2/u0a1222 (adj 0): timeout publishing content providers
2021-03-30 16:28:35.674 1091-1459/? D/ActivityManager: proc ProcessRecord{2522dce 22561:com.test.demo2/u0a1222} already removed. so we skip next process.
2021-03-30 16:28:35.676 1091-3793/? E/ActivityManager: Timeout waiting for provider com.test.demo2/11222 for provider com.test.demo2.SharedProviderAuthority providerRunning=false caller=com.test.demo1/10071
com.test.demo1在进程启动时会去查询com.test.demo2实现的一个ContentProvider组件,该组件名为SharedProvider,对应的authority是com.test.demo2.SharedProviderAuthority。com.test.demo1就是出现闪退问题的App。根据日志我们可以总结出以下几个信息:
也就是说com.test.demo2注册ContentProvider超时除了导致自身被杀以外,同时还导致了调用方com.test.demo1被杀。
这个就有点超出我们以往的认知了,一般来说调用方进程和被调用方进程相互之间都是独立的,被调用方进程出现崩溃等问题不应该会影响到调用方的逻辑。com.test.demo1原先预想的实现逻辑也是如此,优先去查com.test.demo2的SharedProvider中的数据,如果取到了就展示该数据,如果没取到就展示默认的数据。不管demo2的进程是否存活,是否发生崩溃等,我们都不希望它影响到调用方demo1的进程。
分析这种问题时,我们可以先通过关键日志定位到导致问题发生的关键代码,再从关键代码处往上层层剖析,这样往往能达到事半功倍的效果。很明显,"depends on provider"是日志中最为关键的一个词,我们直接在安卓Framework的源码中搜一下,就会找到如下的关键代码。这段代码位于ActivityManagerService.java中
private final boolean removeDyingProviderLocked(ProcessRecord proc, ContentProviderRecord cpr, boolean always) { ... for (int i = cpr.connections.size() - 1; i >= 0; i--) { ... ProcessRecord capp = conn.client; conn.dead = true; // 关键就在于conn.stableCount > 0 这个条件 if (conn.stableCount > 0) { // 由于三方应用的进程基本都不是常驻进程,因此都会满足以下这个if条件,从而走到kill逻辑中 if (!capp.isPersistent() && capp.thread != null && capp.pid != 0 && capp.pid != MY_PID) { capp.kill("depends on provider " + cpr.name.flattenToShortString() + " in dying proc " + (proc != null ? proc.processName : "??") + " (adj " + (proc != null ? proc.setAdj : "??") + ")", ApplicationExitInfo.REASON_DEPENDENCY_DIED, ApplicationExitInfo.SUBREASON_UNKNOWN, true); } ... } ... }
从方法名来看,removeDyingProviderLocked应该是AMS用来移除将死进程的Provider信息的。并且在移除这些Provider信息的时候会根据一些条件来判断是否要杀死调用方。接下去我们可以分两个方向来分析,一个是removeDyingProviderLocked(ProcessRecord proc, ContentProviderRecord cpr, boolean always)这个方法何时会被调用,另一个则是conn.stableCount在满足怎样的条件时会大于0。
探究removeDyingProviderLocked(ProcessRecord proc, ContentProviderRecord cpr, boolean always)方法调用逻辑其实就是就是在探究ContentProvider的注册和查询流程,注册方以及调用方是如何和system_server做交互的。这部分内容比较多,不熟悉ContentProvider原理的同学可以看这篇文章: 理解ContentProvider原理
借用这篇文章中的一张概括图继续往下分析
导致demo1闪退的关键就在system_server到provider process的交互过程中。
AMS首先会调用getContentProviderImp()方法尝试获取target provider。如果ContentProvider还未被注册(即所在进程还未启动),则会调用startProcessLocked()方法来启动server process,对应开头的例子就是指com.test.demo2进程
private ContentProviderHolder getContentProviderImpl(IApplicationThread caller, String name, IBinder token, int callingUid, String callingPackage, String callingTag, boolean stable, int userId) { ... // If the provider is not already being launched, then get it // started. if (i >= N) { final long origId = Binder.clearCallingIdentity(); try { ... if (proc != null && proc.thread != null && !proc.killed) { if (DEBUG_PROVIDER) Slog.d(TAG_PROVIDER, "Installing in existing process " + proc); if (!proc.pubProviders.containsKey(cpi.name)) { checkTime(startTime, "getContentProviderImpl: scheduling install"); proc.pubProviders.put(cpi.name, cpr); try { proc.thread.scheduleInstallProvider(cpi); } catch (RemoteException e) { } } } else { checkTime(startTime, "getContentProviderImpl: before start process"); proc = startProcessLocked(cpi.processName, cpr.appInfo, false, 0, new HostingRecord("content provider", new ComponentName(cpi.applicationInfo.packageName, cpi.name)), ZYGOTE_POLICY_FLAG_EMPTY, false, false, false); checkTime(startTime, "getContentProviderImpl: after start process"); if (proc == null) { Slog.w(TAG, "Unable to launch app " + cpi.applicationInfo.packageName + "/" + cpi.applicationInfo.uid + " for provider " + name + ": process is bad"); return null; } } cpr.launchingApp = proc; mLaunchingProviders.add(cpr); } finally { Binder.restoreCallingIdentity(origId); } } ... }
而server(com.test.demo2)进程在启动时会调用attachApplicationLocked(@NonNull IApplicationThread thread, int pid, int callingUid, long startSeq)方法,关键代码如下:
static final int CONTENT_PROVIDER_PUBLISH_TIMEOUT_MSG = 57;
private boolean attachApplicationLocked(@NonNull IApplicationThread thread,
int pid, int callingUid, long startSeq) {
// ...
if (providers != null && checkAppInLaunchingProvidersLocked(app)) {
Message msg = mHandler.obtainMessage(CONTENT_PROVIDER_PUBLISH_TIMEOUT_MSG);
msg.obj = app;
mHandler.sendMessageDelayed(msg,
ContentResolver.CONTENT_PROVIDER_PUBLISH_TIMEOUT_MILLIS);
}
// ...
}
server(com.test.demo2)进程会判断当前AndroidManifest.xml文件中是否存在需要注册的ContentProvider,如果存在就给Handler发送一个延时消息。这个消息的处理逻辑如下:
case CONTENT_PROVIDER_PUBLISH_TIMEOUT_MSG: {
ProcessRecord app = (ProcessRecord)msg.obj;
synchronized (ActivityManagerService.this) {
processContentProviderPublishTimedOutLocked(app);
}
} break;
private final void processContentProviderPublishTimedOutLocked(ProcessRecord app) {
cleanupAppInLaunchingProvidersLocked(app, true);
mProcessList.removeProcessLocked(app, false, true,
ApplicationExitInfo.REASON_INITIALIZATION_FAILURE,
ApplicationExitInfo.SUBREASON_UNKNOWN,
"timeout publishing content providers");
}
final boolean cleanUpApplicationRecordLocked(ProcessRecord app, boolean restarting, boolean allowRestart, int index, boolean replacingPid) { ... // Remove published content providers. for (int i = app.pubProviders.size() - 1; i >= 0; i--) { ContentProviderRecord cpr = app.pubProviders.valueAt(i); if (cpr.proc != app) { // If the hosting process record isn't really us, bail out continue; } final boolean alwaysRemove = app.bad || !allowRestart; final boolean inLaunching = removeDyingProviderLocked(app, cpr, alwaysRemove); ... } ... }
AMS$MainHandler.handleMessage()
—> AMS.processContentProviderPublishTimedOutLocked()
—> AMS.cleanUpApplicationRecordLocked()
—> AMS.removeDyingProviderLocked()
经过层层调用最终调用到了AMS.removeDyingProviderLocked()方法。
我们在全局范围内搜索CONTENT_PROVIDER_PUBLISH_TIMEOUT_MSG的时候会发现,还有removeMessage的方法。总共有两个调用地方
final boolean cleanUpApplicationRecordLocked(ProcessRecord app, boolean restarting, boolean allowRestart, int index, boolean replacingPid) { ... if (restart && allowRestart && !app.isolated) { // We have components that still need to be running in the // process, so re-launch it. if (index < 0) { ProcessList.remove(app.pid); } // Remove provider publish timeout because we will start a new timeout when the // restarted process is attaching (if the process contains launching providers). mHandler.removeMessages(CONTENT_PROVIDER_PUBLISH_TIMEOUT_MSG, app); mProcessList.addProcessNameLocked(app); app.pendingStart = false; mProcessList.startProcessLocked(app, new HostingRecord("restart", app.processName), ZYGOTE_POLICY_FLAG_EMPTY); return true; } ... }
public final void publishContentProviders(IApplicationThread caller,
List<ContentProviderHolder> providers) {
...
if (wasInLaunchingProviders) {
mHandler.removeMessages(CONTENT_PROVIDER_PUBLISH_TIMEOUT_MSG, r);
}
...
}
看到这里,removeDyingProviderLocked(ProcessRecord proc, ContentProviderRecord cpr, boolean always)的调用过程就已经很清晰了。system_server在启动进程时如果目标进程有需要注册的ContentProvider,就会发送一个10s的超时信息;如果目标进程的ContentProvider在十秒内加载完成,system_server就会移除这个超时信息;如果没有注册完成,system_server就会处理这个信息,最终就会调用到removeDyingProviderLocked()方法。
但是,调用到removeDyingProviderLocked()这个方法并不一定就会导致调用方进程被杀,还要满足conn.stableCount > 0的条件,因此接下去我们继续看下conn.stableCount的相关赋值逻辑。
conn.stableCount的赋值涉及到ContentProvider中的引用计数逻辑,详细分析可见: ContentProvider引用计数。关键就在于下面这张表
再看下com.test.demo1中调用ContentProvider的逻辑
我们会看到com.test.demo1通过ContentResolver的call()方法来操作com.test.demo2的SharedProvider,ContentResolver.call()方法的实现如下:
public final @Nullable Bundle call(@NonNull String authority, @NonNull String method, @Nullable String arg, @Nullable Bundle extras) { Preconditions.checkNotNull(authority, "authority"); Preconditions.checkNotNull(method, "method"); try { if (mWrapped != null) return mWrapped.call(authority, method, arg, extras); } catch (RemoteException e) { return null; } // 关键地方:stableCount+1 IContentProvider provider = acquireProvider(authority); if (provider == null) { // provider为null,抛出异常 throw new IllegalArgumentException("Unknown authority " + authority); } try { final Bundle res = provider.call(mPackageName, authority, method, arg, extras); Bundle.setDefusable(res, true); return res; } catch (RemoteException e) { // Arbitrary and not worth documenting, as Activity // Manager will kill this process shortly anyway. return null; } finally { releaseProvider(provider); } }
显然client端调用call()方法后如果server端的ContentProvider注册失败,stableCount就会加一但是没有减一,此时如果服务端超过十秒没注册完相应的Provider组件,那么就会导致client端被system_server杀死。
但如果我们把call()方法换成常用的query()方法,就会发现并不会出现这个问题。这是为什么呢?我们再看下query方法的实现:
public final @Nullable Cursor query(final @RequiresPermission.Read @NonNull Uri uri, @Nullable String[] projection, @Nullable Bundle queryArgs, @Nullable CancellationSignal cancellationSignal) { Preconditions.checkNotNull(uri, "uri"); try { if (mWrapped != null) { return mWrapped.query(uri, projection, queryArgs, cancellationSignal); } } catch (RemoteException e) { return null; } IContentProvider unstableProvider = acquireUnstableProvider(uri); if (unstableProvider == null) { return null; } IContentProvider stableProvider = null; Cursor qCursor = null; try { long startTime = SystemClock.uptimeMillis(); ICancellationSignal remoteCancellationSignal = null; if (cancellationSignal != null) { cancellationSignal.throwIfCanceled(); remoteCancellationSignal = unstableProvider.createCancellationSignal(); cancellationSignal.setRemote(remoteCancellationSignal); } try { qCursor = unstableProvider.query(mPackageName, uri, projection, queryArgs, remoteCancellationSignal); } catch (DeadObjectException e) { // The remote process has died... but we only hold an unstable // reference though, so we might recover!!! Let's try!!!! // This is exciting!!1!!1!!!!1 unstableProviderDied(unstableProvider); stableProvider = acquireProvider(uri); if (stableProvider == null) { return null; } qCursor = stableProvider.query( mPackageName, uri, projection, queryArgs, remoteCancellationSignal); } if (qCursor == null) { return null; } // Force query execution. Might fail and throw a runtime exception here. qCursor.getCount(); long durationMillis = SystemClock.uptimeMillis() - startTime; maybeLogQueryToEventLog(durationMillis, uri, projection, queryArgs); // Wrap the cursor object into CursorWrapperInner object. final IContentProvider provider = (stableProvider != null) ? stableProvider : acquireProvider(uri); final CursorWrapperInner wrapper = new CursorWrapperInner(qCursor, provider); stableProvider = null; qCursor = null; return wrapper; } catch (RemoteException e) { // Arbitrary and not worth documenting, as Activity // Manager will kill this process shortly anyway. return null; } finally { if (qCursor != null) { qCursor.close(); } if (cancellationSignal != null) { cancellationSignal.setRemote(null); } if (unstableProvider != null) { releaseUnstableProvider(unstableProvider); } if (stableProvider != null) { releaseProvider(stableProvider); } } }
从代码中我们很明显就能看出原因所在,query方法调的是acquireUnstableProvider(),stableCount的值并不会增加,所以即使服务端超过10s没有注册完成Provider,也不会导致客户端被杀。
至此,我们终于找到导致线上用户App闪退的原因了。小结一下就是,demo1进程通过ContentResolve的call()方法来查询demo2的ContentProvider时,由于demo2进程启动较慢,超过十秒还没有注册好相应的ContentProvider,导致AMS在杀死demo2进程的同时,也连带着杀死了demo1进程。
根据ContentResolve中各个方法的实现逻辑,我大致列出了以下几个有可能导致调用方进程闪退的方法。包括:acquireProvider()、getStreamTypes()、canonicalize()、uncanonicalize()、refresh()、insert()、bulkInsert()、delete()、update()、call()、acquireContentProviderClient()(有些不是public类型的方法我也列出来了)。
问题找到了,如何解决呢?方案一:不使用ContentResolve的call()方法,直接用query()。这种方案简单粗暴,在当前的业务场景下确实也能满足需求。但是总有治标不治本的感觉,如果以后必须要用call()方法怎么办呢?况且不仅仅是call()方法会导致这个问题,如3.1中所列的,update()等方法也存在这个问题。
我们再仔细回想下这个问题发生的关键点在哪,一个是demo1进程调用了call()方法来启动demo2进程,另一个是demo2进程启动太慢。我们能改变的只有第一点,至于第二点demo2进程的启动速度则不是我们可以把握的,即使是demo2进程本身也很难把握,进程启动速度是和当时设备的状态强相关的。
既然用call()方法来启动demo2进程可能会导致闪退,我们能不能先用query()方法来启动demo2进程,之后判断拿到的返回结果,如果返回的Cursor对象不为null再调用call()方法。如此一来既不会有闪退的风险,也能够调用任意的方法了。思路大概就是这个思路,只是调完query()方法再调call()方法总有种脱裤子放屁的样子。其实还有一个更优雅的方法,就是acquireUnstableContentProviderClient()方法。这个方法返回的是一个ContentProviderClient对象,通过判断这个对象是不是空,我们再决定是否继续调用call()方法。
趁着这次线上bug仔细梳理了下ContentProvider的相关逻辑,同时排查了下App中个业务方对ContentProvider的使用逻辑,避免后续又出现类似问题。在排查的过程中发现了各种五花八门的写法。有连返回的Cursor是不是null都不判断就直接往下操作的,还有不带try…catch保护的,再有就是直接调用3.1所列的可能导致调用方闪退的方法的。平时没出事的原因是ContentProvider使用得较少,而server 进程启动慢于十秒出现的概率也比较低,如果不是大规模地去实现这个Provider的话,还是不容易发现问题的。
此外,上面说的都是调用方Client端的坑。除调用方外,被调用方Server端中需要注意的一个坑就是,ContentProvider的onCreate()方法会先于Application的onCreate()被调用,而App的基础组件一般都是在Application的onCreate()方法中才初始化的,因此千万不要在ContentProvider的onCreate()中调用基础组件,query()等其他的方法里面最好也不要调。并且如果崩溃是发生在ContentProvider的onCreate()方法中,热修复都修复不了(热修复组件都还没来得及初始化呢!!!)
想要防止出现由于ContentProvider导致的异常闪退等问题,就需要规范地使用ContentProvider,考虑到种种可能出现的异常情况。从3.3的分析中,我们可以知道,操作ContentProvider的代码逻辑中需要至少需要加上非空判断 + try…catch保护,而且如果调用的是stable相关的方法,则要先用通过acquireUnstableContentProviderClient()方法来尝试拉起ContentProvider所在的进程,代码如下:
private void queryProvider() { try { ContentResolver contentResolver = getContentResolver(); String targetProviderAuthority = "com.test.demom2.SharedProviderAuthority"; ContentProviderClient targetProviderClient = contentResolver.acquireUnstableContentProviderClient(targetProviderAuthority); if (targetProviderClient == null) { Log.e(TAG, "targetProviderClient is null, return"); return; } Bundle bundle = contentResolver.call(targetProviderAuthority, "xxx", null, null); if (bundle == null) { Log.e(TAG, "bundle is null, return"); } // 具体的业务逻辑 } catch (Exception e) { Log.e(TAG, e.getMessage()); } }
最后,还有一个小问题说明下,为何在我们的进程日志里面看不到任何的异常栈信息?原因其实很简单,因为我们的进程根本就没有发生异常!我们的进程被杀仅仅只是因为我们调用的ContentProvider组件加载超时了。
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。