Fraud investigation of employees’ behaviors on imbalanced transaction data using Python, SQL, R and RapidMiner
Extracted~10TB data and performed texts vectorization with word2vee model
Explored effects of bootstrap resampling,synthetic minority oversampling (SMOTE) and cluster centroid sampling algorithms on logistic regression and decision tree (publication under review by International Journal of Managing Information Technology)
Implemented variable clustering and principal component analysis (PCA) to reduce data dimensionality
Displayed the topology of the high-dimensional data using topological data
analysis (TDA)
Created new features based on results from TDA and clustering analysis that improved accuracy by 0.05
Explored the characteristics of the fraud employees using k-means clustering analysis
Measured the effectiveness of the proposed model using Kolmogorov-Smirnov test, lift charts and cumulative gains charts.
Customer usage patterns and anomalous activities detection using Python, SQL
Analyzed over 3GB workforce management data to identify customer usage patterns and anomalous activities using advanced SQL Querying and Python.
Performed exploratory data analysis on numerical, categorical and time series data using Matplotlib and Seaborn.
Constructed multivariance Time Series Clustering model, DTW with Hierarchical clustering, on 120 features to group customers, segment market and described the clustering centroid by DBA (DTW Barycenter Averaging).
Trained both Anomaly Detection Models including isolation forest, robust PCA and Regression Models including linear models, random forest to build customer’ anomalous activities alert system.
Built dashboards to describe the customer’s usage activities via Tableau and weekly reported analysis results to the team and manager.
Assembled an Apache Flume service on a private cloud infrastructure to ingest100+TB/day log files from cloud computing applications and dispatch tagged log events to downstream services all in real-time
Used Kafka as a primary data storage solution to hold short term log data reducing data retrieval latency from 3s to 0.5s
Used Hadoop Distributed File System as a secondary data storage solution to persist PB-level historical log data
Built an alerting module using Spark Streaming to catch error logs from Kafka and notify the internal management system of cloud computing applications achieving a total latency of less than ls
Built an analysis module using Spark Streaming to consume and aggregate log data based on severity levels and generate health reports of all cloud computing applications on a daily basis, improving 5x efficiency.
Performed data crawling using Python to collect 114687 images and trained, built a Faster-RCNN with Wider Face dataset to detect and extract 94% faces using PyTorch.
Trained and applied a SRGAN model to augment and standardize picture resolution from less than 100x100 to 256x256 using PyTorch.
Implemented, trained and applied an 8-layer StyleGAN with PyTorch framework to generate dynamic style of cartoon faces according to input image and commands
Added effects with mixing the styles of the generated images by controlling the latent model spaces
Implemented a diffusion model(ddpm)and trained with 10881 face images.
Constructed GPU infrastructure and implemented parallel computation to speed up 12x run-time performance.
Established an NLP system to summarize the text through TensorFlow in the environment built by Nvidia Docker.
Analyzed and processed 2TB text summarization datasets from THUCnews. LCSTS.CSL news headlinescontexts and judicial summaries by NLTK
Benchmarked the performance of Point-Generator, WoBERT, Nezha, and T5 on the above datasets to obtain proper title/ abstract, guideline and summaries.
Applied BERT WoBERT to improve prediction accuracy by 5.6% and 4.5% respectively
Expedited the inference of these transformer-based models by1.55xfaster via Turbo Transformer
Accelerated the modeling run-time performance by 8xwith parallel computation using CUDA based on GPUs.
Applied the Bert-of-Theseus method to distill WoBERT shrank model size to 50%.