useful package in python
Dec 5, 2022 00:00 · 395 words · 2 minute read
这里汇总一些小众但是常用的python以及其他软件包
PYTHON
pynvml
Provides a Python interface to GPU management and monitoring functions. This is a wrapper around the NVML library. For information about the NVML library, see the NVML developer page http://developer.nvidia.com/nvidia-management-library-nvml As of version 11.0.0, the NVML-wrappers used in pynvml are identical to those published through nvidia-ml-py. Note that this file can be run with ‘python -m doctest -v README.txt’ although the results are system dependent
Usage:
You can use the lower level nvml bindings
>>> from pynvml import *
>>> nvmlInit()
>>> print("Driver Version:", nvmlSystemGetDriverVersion())
Driver Version: 410.00
>>> deviceCount = nvmlDeviceGetCount()
>>> for i in range(deviceCount):
... handle = nvmlDeviceGetHandleByIndex(i)
... print("Device", i, ":", nvmlDeviceGetName(handle))
...
Device 0 : Tesla V100
>>> nvmlShutdown()
Or the higher level nvidia_smi API
from pynvml.smi import nvidia_smi
nvsmi = nvidia_smi.getInstance()
nvsmi.DeviceQuery('memory.free, memory.total')
from pynvml.smi import nvidia_smi
nvsmi = nvidia_smi.getInstance()
print(nvsmi.DeviceQuery('--help-query-gpu'), end='\n')
TQDM
tqdm derives from the Arabic word taqaddum (تقدّم) which can mean “progress,” and is an abbreviation for “I love you so much” in Spanish (te quiero demasiado). Instantly make your loops show a smart progress meter - just wrap any iterable with tqdm(iterable), and you’re done!
Usage:
from tqdm import tqdm
for i in tqdm(range(10000)):
...
Change color using colour='red'
https://tqdm.github.io/docs/tqdm/#update
PARAMIKO
Paramiko is a pure-Python [1] (2.7, 3.4+) implementation of the SSHv2 protocol [2], providing both client and server functionality. It provides the foundation for the high-level SSH library Fabric, which is what we recommend you use for common client use-cases such as running remote shell commands or transferring files.
viztracer
https://github.com/gaogaotiantian/viztracer
VizTracer is a low-overhead logging/debugging/profiling tool that can trace and visualize your python code execution.
The front-end UI is powered by Perfetto. Use “AWSD” to zoom/navigate. More help can be found in “Support - Controls”.
Cdx 工具
https://github.com/cocrawler/cdx_toolkit/
相关链接
skeptric - Searching 100 Billion Webpages Pages With Capture Index
Pythonspark
https://github.com/commoncrawl/cc-pyspark
work with the columnar URL index
可以通过colomuner文件筛选中文的语料
Markdownify
Html 转 markdown
https://github.com/matthewwithanm/python-markdownify
Deduplicate
数据去重
https://github.com/google-research/deduplicate-text-datasets
https://mrjob.readthedocs.io/ mapreduce
https://github.com/gaogaotiantian/viztracer
NETWORK
shadowsocks-local
proxychains4
DISPLAY
x11vnc
Common crawl
共有8种文件格式
其中,warc带html信息, wat是抽取的文本, url index 以及 columnar url index 是索引文件,
File List | #Files | Total Size Compressed (TiB) | |
---|---|---|---|
Segments | 100 | CC-MAIN-2022-49/segment.paths.gz | |
WARC files |CC-MAIN-2022-49/warc.paths.gz | 88000 | 92.59 | |
WAT files | CC-MAIN-2022-49/wat.paths.gz | 88000 | 22.89 |
WET files | CC-MAIN-2022-49/wet.paths.gz | 88000 | 9.58 |
Robots.txt files | CC-MAIN-2022-49/robotstxt.paths.gz | 88000 | 0.15 |
Non-200 responses files | CC-MAIN-2022-49/non200responses.paths.gz | 88000 | 2.43 |
URL index files | CC-MAIN-2022-49/cc-index.paths.gz | 302 | 0.25 |
Columnar URL index files | CC-MAIN-2022-49/cc-index-table.paths.gz | 900 | 0.28 |
- | - | - | - |
https://github.com/cocrawler/cdx_toolkit/blob/main/examples/iter-and-warc.py