從源碼角度看traces.txt是如何生成的

NO IMAGE

簡介

原文請看我的博客:從源碼角度看traces.txt是如何生成的

traces.txt 位於安卓系統下/data/anr目錄下,當系統中有應用出現ANR時,framework會在彈出Dialog的同時dump出當前各線程的堆棧情況,方便開發者分析出ANR的root cause。

ANR是Application Not Responsing 的簡稱,簡而言之,就是安卓系統內置提示用戶應用界面沒有反應的機制,是用來避免應用界面一直卡頓,增加系統用戶友好度的一種方式。

造成ANR的原因很多,都是因為在主線程執行任務太久阻塞了界面更新導致的,主要有以下幾類:

  1. Broadcast Timeout: 前臺廣播執行超過10s, 後臺廣播執行超過60s (要注意的是,只有串行的廣播才有超時機制,並行的廣播並不會超時,也就是說,如果廣播是動態註冊的,直接調用sendBroadcast觸發,如果主線程Looper中排在後面的Message不會觸發超時機制,那麼即時這個廣播是前臺廣播,系統也永遠不會彈出框提示用戶超時了)
  2. Service Timeout: 前臺服務之星超過20s, 後臺服務之星超過200s
  3. Provider Timeout: 內容提供者,publish超過10s
  4. Input Timeout: 按鍵觸摸事件分發超過5s

能夠造成ANR的前提是任務是在主線程上執行的,執行什麼樣的任務主要有以下幾點:

  1. 執行耗時任務過久,如文件讀取或存儲,網絡訪問獲取文件太耗時
  2. 線程被阻塞過久,或者乾脆出現了死鎖
  3. 線程飢餓,如Binder線程總共16個,Binder主線程佔有一個,剩餘的15個工作線程都被佔滿
  4. CPU飢餓,負載值過大,雖然代碼正常額但任務一直沒有來得及執行

那麼回到traces.txt文件,它到底包含了什麼信息可以幫助開發者找到ANR問題的根源呢。

在這篇文章,我先從一份traces.txt的日誌實例開始解析,然後再通過追蹤源碼來解釋traces.txt是如何生成的。

我是基於Android 5.0原生代碼進行分析的,最新的一些代碼邏輯放到了AppErrors類中了,有些小差別,但是大體的邏輯和步驟沒有變化,不會影響閱讀。

日誌實例

// 觸發ANR的應用pid及觸發時間
----- pid 30307 at 2017-09-03 14:51:14 -----
Cmd line: com.example.androidyue.bitmapdemo
JNI: CheckJNI is off; workarounds are off; pins=0; globals=272
DALVIK THREADS:
(mutexes: tll=0 tsl=0 tscl=0 ghl=0)
// JAVA 態線程狀態
// 線程名稱; 優先級; 線程id; 線程狀態
"main" prio=5 tid=1 TIMED_WAIT
// 線程組; 線程suspend計數; 線程debug suspend計數; 同輩線程對象; 當前線程對象
| group="main" sCount=1 dsCount=0 obj=0x416eaf18 self=0x416d8650
// 線程id; 進程優先級; 調度者優先級; 調度者組名; 線程緩存句柄
| sysTid=30307 nice=0 sched=0/0 cgrp=apps handle=1074565528
// native線程狀態; 調度者狀態; 執行用戶代碼的總時間; 執行系統代碼的總時間; 任務cpu使用數量
| state=S schedstat=( 0 0 0 ) utm=5 stm=4 core=3
// 線程調用棧
at java.lang.VMThread.sleep(Native Method)
at java.lang.Thread.sleep(Thread.java:1044)
at java.lang.Thread.sleep(Thread.java:1026)
at com.example.androidyue.bitmapdemo.MainActivity$1.run(MainActivity.java:27)
at android.app.Activity.runOnUiThread(Activity.java:4794)
at com.example.androidyue.bitmapdemo.MainActivity.onResume(MainActivity.java:33)
at android.app.Instrumentation.callActivityOnResume(Instrumentation.java:1282)
at android.app.Activity.performResume(Activity.java:5405)
// Native態線程狀態
"Binder_5" prio=5 tid=32 NATIVE
| group="main" sCount=1 dsCount=0 obj=0x41dd6570 self=0x678fa458
| sysTid=29473 nice=0 sched=0/0 cgrp=apps handle=1732666432
| state=S schedstat=( 578000 10600000 3 ) utm=0 stm=0 core=0
#00  pc 00016fe4  /system/lib/libc.so (__ioctl+8)
#01  pc 0002a97d  /system/lib/libc.so (ioctl+16)
#02  pc 00016ba1  /system/lib/libbinder.so (android::IPCThreadState::talkWithDriver(bool)+132)
#03  pc 00017363  /system/lib/libbinder.so (android::IPCThreadState::joinThreadPool(bool)+154)
#04  pc 0001b15d  /system/lib/libbinder.so
#05  pc 00011267  /system/lib/libutils.so (android::Thread::_threadLoop(void*)+114)
#06  pc 0004679f  /system/lib/libandroid_runtime.so (android::AndroidRuntime::javaThreadShell(void*)+66)
#07  pc 00010dcd  /system/lib/libutils.so
#08  pc 0000e3d8  /system/lib/libc.so (__thread_entry+72)
#09  pc 0000dac4  /system/lib/libc.so (pthread_create+160)
at dalvik.system.NativeStart.run(Native Method)

關鍵詞對照

關鍵詞解釋
prio優先級
tid線程id
group組名
sCount掛起次數
dsCount調試下 掛起計數
obj同輩線程對象
self當前線程對象
sysTid線程id
nice進程優先級
sched調度者優先級
cgrp調度者組名
handle線程緩存句柄
statenative線程狀態
schedstat調度者狀態
utm用戶態CPU時間
stm內核態CPU時間
core運行所在核

線程狀態對照

JAVA狀態CPP狀態解釋
TERMINATEDZOMBIE線程死亡
RUNNABLERUNNING/RUNNABLE可運行或正在運行
TIMED_WAITINGTIMED_WAIT執行了帶有超時參數的wait, sleep或join函數
BLOCKEDMONITOR線程阻塞,等待獲取對象鎖
WAITINGWAIT執行了無超時參數的wait
NEWINITIALIZING新建,正在初始化中
NEWSTARTING新建,正在啟動
RUNNABLENATIVE正在執行JNI本地函數
WAITINGVMWAIT正在等待VM資源
RUNNABLESUSPENDED線程暫停,通常由於GC或者DEBUG被掛起

源碼分析 (虛擬機)

我將從代碼的最底層開始分析,再慢慢到觸發點的頂端,先從cpp開始

Thread::DumpState

thread.cc

void Thread::DumpState(std::ostream& os, const Thread* thread, pid_t tid) {
// 組名
std::string group_name;
// 優先級
int priority;
// 是否是daemon線程
bool is_daemon = false;
// 拿到當前線程對象指針
Thread* self = Thread::Current();
if (thread != nullptr) {
...
}
if (gAborting == 0 && self != nullptr && thread != nullptr && thread->tlsPtr_.opeer != nullptr) {
ScopedObjectAccessUnchecked soa(self);
priority = soa.DecodeField(WellKnownClasses::java_lang_Thread_priority)
->GetInt(thread->tlsPtr_.opeer);
is_daemon = soa.DecodeField(WellKnownClasses::java_lang_Thread_daemon)
->GetBoolean(thread->tlsPtr_.opeer);
...
if (thread_group != nullptr) {
...
group_name = (group_name_string != nullptr) ? group_name_string->ToModifiedUtf8() : "<null>";
}
} else {
priority = GetNativePriority();
}
std::string scheduler_group_name(GetSchedulerGroupName(tid));
if (scheduler_group_name.empty()) {
scheduler_group_name = "default";
}
// 這裡開始輸出線程狀態
if (thread != nullptr) {
// 線程名輸出
os << '"' << *thread->tlsPtr_.name << '"';
if (is_daemon) {
os << " daemon";
}
// 優先級; ID; 狀態
os << " prio=" << priority
<< " tid=" << thread->GetThreadId()
<< " " << thread->GetState();
if (thread->IsStillStarting()) {
os << " (still starting up)";
}
os << "\n";
} else {
os << '"' << ::art::GetThreadName(tid) << '"'
<< " prio=" << priority
<< " (not attached)\n";
}
// 線程組; 線程suspend計數; 線程debug suspend計數; 同輩線程對象; 當前線程對象
if (thread != nullptr) {
MutexLock mu(self, *Locks::thread_suspend_count_lock_);
os << "  | group=\"" << group_name << "\""
<< " sCount=" << thread->tls32_.suspend_count
<< " dsCount=" << thread->tls32_.debug_suspend_count
<< " obj=" << reinterpret_cast<void*>(thread->tlsPtr_.opeer)
<< " self=" << reinterpret_cast<const void*>(thread) << "\n";
}
// 線程id; 進程優先級; 調度者組名
os << "  | sysTid=" << tid
<< " nice=" << getpriority(PRIO_PROCESS, tid)
<< " cgrp=" << scheduler_group_name;
// 調度者優先級; 緩存句柄
if (thread != nullptr) {
int policy;
sched_param sp;
CHECK_PTHREAD_CALL(pthread_getschedparam, (thread->tlsPtr_.pthread_self, &policy, &sp),
__FUNCTION__);
os << " sched=" << policy << "/" << sp.sched_priority
<< " handle=" << reinterpret_cast<void*>(thread->tlsPtr_.pthread_self);
}
os << "\n";
// Grab the scheduler stats for this thread.
std::string scheduler_stats;
if (ReadFileToString(StringPrintf("/proc/self/task/%d/schedstat", tid), &scheduler_stats)) {
scheduler_stats.resize(scheduler_stats.size() - 1);  // Lose the trailing '\n'.
} else {
scheduler_stats = "0 0 0";
}
char native_thread_state = '?';
int utime = 0;
int stime = 0;
int task_cpu = 0;
GetTaskStats(tid, &native_thread_state, &utime, &stime, &task_cpu);
// native線程狀態; 調度者狀態; 執行用戶代碼的總時間; 執行系統代碼的總時間; 任務cpu使用數量
os << "  | state=" << native_thread_state
<< " schedstat=( " << scheduler_stats << " )"
<< " utm=" << utime
<< " stm=" << stime
<< " core=" << task_cpu
<< " HZ=" << sysconf(_SC_CLK_TCK) << "\n";
if (thread != nullptr) {
// dump 線程當前棧
os << "  | stack=" << reinterpret_cast<void*>(thread->tlsPtr_.stack_begin) << "-"
<< reinterpret_cast<void*>(thread->tlsPtr_.stack_end) << " stackSize="
<< PrettySize(thread->tlsPtr_.stack_size) << "\n";
// dump 線程鎖信息
os << "  | held mutexes=";
for (size_t i = 0; i < kLockLevelCount; ++i) {
if (i != kMonitorLock) {
BaseMutex* mutex = thread->GetHeldMutex(static_cast<LockLevel>(i));
if (mutex != nullptr) {
os << " \"" << mutex->GetName() << "\"";
if (mutex->IsReaderWriterMutex()) {
ReaderWriterMutex* rw_mutex = down_cast<ReaderWriterMutex*>(mutex);
if (rw_mutex->GetExclusiveOwnerTid() == static_cast<uint64_t>(tid)) {
os << "(exclusive held)";
} else {
os << "(shared held)";
}
}
}
}
}
os << "\n";
}
}

GetTaskStats

utils.cc

void GetTaskStats(pid_t tid, char* state, int* utime, int* stime, int* task_cpu) {
*utime = *stime = *task_cpu = 0;
std::string stats;
// 從stat節點獲取數據
if (!ReadFileToString(StringPrintf("/proc/self/task/%d/stat", tid), &stats)) {
return;
}
// Skip the command, which may contain spaces.
stats = stats.substr(stats.find(')') + 2);
// Extract the three fields we care about.
std::vector<std::string> fields;
Split(stats, ' ', &fields);
*state = fields[0][0];
*utime = strtoull(fields[11].c_str(), nullptr, 10);
*stime = strtoull(fields[12].c_str(), nullptr, 10);
*task_cpu = strtoull(fields[36].c_str(), nullptr, 10);
}

DumpUnattachedThread

thread_list.cc

// 讀取該線程中的所有狀態信息:JAVA態,Native態,Kernel態
static void DumpUnattachedThread(std::ostream& os, pid_t tid) NO_THREAD_SAFETY_ANALYSIS {
// [接上]
Thread::DumpState(os, nullptr, tid);
DumpKernelStack(os, tid, "  kernel: ", false);
if (false) {
DumpNativeStack(os, tid, "  native: ");
}
os << "\n";
}
void ThreadList::DumpUnattachedThreads(std::ostream& os) {
DIR* d = opendir("/proc/self/task");
if (!d) {
return;
}
Thread* self = Thread::Current();
dirent* e;
// 獲取當前進程中所有線程,dump這些線程的信息
while ((e = readdir(d)) != nullptr) {
char* end;
pid_t tid = strtol(e->d_name, &end, 10);
if (!*end) {
bool contains;
{
MutexLock mu(self, *Locks::thread_list_lock_);
contains = Contains(tid);
}
if (!contains) {
// [接上]
DumpUnattachedThread(os, tid);
}
}
}
closedir(d);
}
bool ThreadList::Contains(Thread* thread) {
return find(list_.begin(), list_.end(), thread) != list_.end();
}

DumpForSigQuit

thread_list.cc

void ThreadList::DumpForSigQuit(std::ostream& os) {
...
DumpUnattachedThreads(os);
}

runtime.cc

void Runtime::DumpForSigQuit(std::ostream& os) {
GetClassLinker()->DumpForSigQuit(os);
GetInternTable()->DumpForSigQuit(os);
GetJavaVM()->DumpForSigQuit(os);
GetHeap()->DumpForSigQuit(os);
TrackedAllocators::Dump(os);
os << "\n";
// [接上]
// dump 線程列表
thread_list_->DumpForSigQuit(os);
BaseMutex::DumpAll(os);
}

HandleSigQuit

singal_catcher.cc

void SignalCatcher::HandleSigQuit() {
Runtime* runtime = Runtime::Current();
std::ostringstream os;
// 輸出進程id以及當前dump時間
os << "\n"
<< "----- pid " << getpid() << " at " << GetIsoDate() << " -----\n";
DumpCmdLine(os);
// 輸出系統信息
std::string fingerprint = runtime->GetFingerprint();
os << "Build fingerprint: '" << (fingerprint.empty() ? "unknown" : fingerprint) << "'\n";
os << "ABI: '" << GetInstructionSetString(runtime->GetInstructionSet()) << "'\n";
os << "Build type: " << (kIsDebugBuild ? "debug" : "optimized") << "\n";
// [接上]
runtime->DumpForSigQuit(os);
...
os << "----- end " << getpid() << " -----\n";
Output(os.str());
}

SignalCatcher::Run

void* SignalCatcher::Run(void* arg) {
...
// Set up mask with signals we want to handle.
SignalSet signals;
signals.Add(SIGQUIT);
signals.Add(SIGUSR1);
while (true) {
// 等待信號
int signal_number = signal_catcher->WaitForSignal(self, signals);
...
switch (signal_number) {
case SIGQUIT:
// [接上]
signal_catcher->HandleSigQuit();
break;
case SIGUSR1:
signal_catcher->HandleSigUsr1();
break;
default:
LOG(ERROR) << "Unexpected signal %d" << signal_number;
break;
}
}
}
int SignalCatcher::WaitForSignal(Thread* self, SignalSet& signals) {
...
int signal_number = signals.Wait();
...
return signal_number;
}

源碼分析 (framework)

可見,android framework通過發送信號3來通知虛擬機來進行dump到traces.txt文件中的

dumpStackTraces

ActivityManagerService.java

private static void dumpStackTraces(String tracesPath, ArrayList<Integer> firstPids,
ProcessCpuTracker processCpuTracker, SparseArray<Boolean> lastPids, String[] nativeProcs) {
// 當trace寫完後,監聽並觸發onEvent方法
FileObserver observer = new FileObserver(tracesPath, FileObserver.CLOSE_WRITE) {
@Override
public synchronized void onEvent(int event, String path) {
notify();
}
};
try {
// 開始監聽
observer.startWatching();
// 首先收集firstPids鏈表裡的進程信息
if (firstPids != null) {
try {
int num = firstPids.size();
for (int i = 0; i < num; i++) {
synchronized (observer) {
// 發送信號3來觸發
Process.sendSignal(firstPids.get(i), Process.SIGNAL_QUIT);
observer.wait(200);  // Wait for write-close, give up after 200msec
}
}
} catch (InterruptedException e) {
Slog.wtf(TAG, e);
}
}
// 收集native進程信息
...
// 最後收集CPU信息
...
} finally {
observer.stopWatching();
}
}
public static File dumpStackTraces(boolean clearTraces, ArrayList<Integer> firstPids,
ProcessCpuTracker processCpuTracker, SparseArray<Boolean> lastPids, String[] nativeProcs) {
// 獲取trace文件路徑
String tracesPath = SystemProperties.get("dalvik.vm.stack-trace-file", null);
...
File tracesFile = new File(tracesPath);
try {
...
FileUtils.setPermissions(tracesDir.getPath(), 0775, -1, -1);  // drwxrwxr-x
if (clearTraces && tracesFile.exists()) tracesFile.delete();
tracesFile.createNewFile();
FileUtils.setPermissions(tracesFile.getPath(), 0666, -1, -1); // -rw-rw-rw-
} catch (IOException e) {
Slog.w(TAG, "Unable to prepare ANR traces file: " + tracesPath, e);
return null;
}
// [接上]
dumpStackTraces(tracesPath, firstPids, processCpuTracker, lastPids, nativeProcs);
return tracesFile;
}

appNotResponding

ActivityManagerService.java

final void appNotResponding(ProcessRecord app, ActivityRecord activity,
ActivityRecord parent, boolean aboveSystem, final String annotation) {
ArrayList<Integer> firstPids = new ArrayList<Integer>(5);
SparseArray<Boolean> lastPids = new SparseArray<Boolean>(20);
...
final ProcessCpuTracker processCpuTracker = new ProcessCpuTracker(true);
// [接上]
File tracesFile = dumpStackTraces(true, firstPids, processCpuTracker, lastPids,
NATIVE_STACKS_OF_INTEREST);
...
Slog.e(TAG, info.toString());
if (tracesFile == null) {
// There is no trace file, so dump (only) the alleged culprit's threads to the log
Process.sendSignal(app.pid, Process.SIGNAL_QUIT);
}
addErrorToDropBox("anr", app, app.processName, activity, parent, annotation,
cpuInfo, tracesFile, null);
...
}

觸發ANR的方式上面我已經列舉過了,這裡我以觸發Broadcast Timeout時進行分析

AppNotResponding.run

BroadcastQueue.java

private final class AppNotResponding implements Runnable {
private final ProcessRecord mApp;
private final String mAnnotation;
public AppNotResponding(ProcessRecord app, String annotation) {
mApp = app;
mAnnotation = annotation;
}
@Override
public void run() {
// [接上]
mService.appNotResponding(mApp, null, null, false, mAnnotation);
}
}

broadcastTimeoutLocked

BroadcastQueue.java

final void broadcastTimeoutLocked(boolean fromMsg) {
...
String anrMessage = null;
...
if (app != null) {
anrMessage = "Broadcast of " + r.intent.toString();
}
...
if (anrMessage != null) {
// [接上]
mHandler.post(new AppNotResponding(app, anrMessage));
}
}

BroadcastHandler.handleMessage

private final class BroadcastHandler extends Handler {
public BroadcastHandler(Looper looper) {
super(looper, null, true);
}
@Override
public void handleMessage(Message msg) {
switch (msg.what) {
...
case BROADCAST_TIMEOUT_MSG: {
synchronized (mService) {
// [接上]
broadcastTimeoutLocked(true);
}
} break;
...
}
}
};

setBroadcastTimeoutLocked

BroadcastQueue.java

final void setBroadcastTimeoutLocked(long timeoutTime) {
if (! mPendingBroadcastTimeoutMessage) {
Message msg = mHandler.obtainMessage(BROADCAST_TIMEOUT_MSG, this);
mHandler.sendMessageAtTime(msg, timeoutTime);
mPendingBroadcastTimeoutMessage = true;
}
}
final void processNextBroadcast(boolean fromMsg) {
if (! mPendingBroadcastTimeoutMessage) {
long timeoutTime = r.receiverTime + mTimeoutPeriod;
// [接上]
setBroadcastTimeoutLocked(timeoutTime);
}
}

總結

下面總結一下觸發廣播ANR後輸出traces.txt的過程:

  1. 有序廣播在註冊後發送
  2. AMS設置廣播超時時間,通過Handler進行延遲設定
  3. 廣播超時觸發超時機制
  4. Framework層發送信號給虛擬機
  5. 虛擬機dump出特定進程所有線程信息並存儲到traces.txt文件中

相關文章

微信小程序中的swiper用法

總結微信小程序中的一些知識

從源碼角度看CPU相關日誌

從源碼角度看AMS.startProcessLocked